In recent years, the big data era has produced an increasing volume and complexity of data that requires processing. To analyze and process these large amounts of data, applications are being scaled on large clusters using distributed data processing frameworks. A more recent tre
...
In recent years, the big data era has produced an increasing volume and complexity of data that requires processing. To analyze and process these large amounts of data, applications are being scaled on large clusters using distributed data processing frameworks. A more recent trend utilizes hardware accelerators to offload computationally intensive tasks and reduce compute time and energy consumption. As a result, a rapid growth of data center deployment containing heterogeneous compute infrastructures is observed. Alternative to the more commonly used general-purpose GPUs (GPGPUS), the field programmable gate array (FPGA) is becoming an increasingly popular choice of accelerator. Its effectiveness to accelerate highly parallel applications in combination with the flexibility due to its reconfigurable nature make it well suited for a wide range of applications. As a spatial compute resource, the problem size a single FPGA can process is bounded by the available programmable logic and memory. However, applications that do not require the full resources of an FPGA can be vertically scaled by instantiating multiple instances of the hardware design on a single node. A barrier in the adoption of FPGAs is formed by the complexity of hardware design which requires in depth hardware-specific expertise. Additionally, integrating FPGAs in distributed data processing frameworks is a challenge on itself.
These challenges are being addressed in two directions. High level synthesis (HLS) tools and compilers are being developed to decrease the complexity of hardware design by allowing users to develop FPGA designs in high level languages. Additionally, there is an increased availability of ready-to-use FPGA designs for common applications in hardware libraries such as Vitis libraries.
To aid the adoption of FPGAs and improve their accessibility, this work presents OctoRay: a python framework with a focus on ease-of-use that allows users to flexibly and transparently scale applications both vertically and horizontally on FPGA clusters. Scaling a binarized convolutional neural network (CNN) with OctoRay resulted in performance improvements linear to the number of nodes, or copied instances applied. The framework was also used to analyze the cost-efficiency of a cluster of low-end PYNQ-Z1 FPGAs compared to a data center class Alveo U280 FPGA. A partly in hardware accelerated implementation of Full Waveform Inversion (FWI), a seismic imaging algorithm, was developed and used to conduct the investigation. It was concluded that 32 PYNQ-Z1s are required to match the performance of a single Alveo U280 FPGA. An important bottleneck in the performance of the PYNQ-Z1s was the low-performance host processor on which a significant portion of FWI was executed. The small number of resources available on a PYNQ-Z1 limited the attainable accuracy of FWI to a bare minimum. The FWI hardware design with the same specifications made for the high-end FPGA only utilized a fraction of its resources, far from harnessing its full potential. It was concluded that, unlike FWI, applications that do not require the abundance of resources a high-end FPGA offers, but do benefit from rapid development cycles and low energy consumption are suited for a distributed low-end FPGA composition.