In modern High-Performance Computing (HPC) systems based on Field Programmable Gate Arrays (FPGA) accelerators, First-In, First-Out queues (FIFOs) are crucial for data buffering and balancing, particularly in demanding applications requiring high-throughput data streaming. Some p
...
In modern High-Performance Computing (HPC) systems based on Field Programmable Gate Arrays (FPGA) accelerators, First-In, First-Out queues (FIFOs) are crucial for data buffering and balancing, particularly in demanding applications requiring high-throughput data streaming. Some prominent examples are: large scale signal processing, machine learning, and online network processing. Typically, these FIFOs are implemented using the available on-chip blocks RAM (BRAM) blocks organized as a circular structure, leveraging the reconfigurable nature of FPGAs to manage data efficiently. However, as applications have to process larger and larger data capacities, the associated area and power consumption of these BRAM based implementations consume significantly large portion of the on-chip resources, posing challenges for both performance and efficiency.
This thesis explores an alternative approach for large and efficient reconfigurable on-chip FIFOs by designing a dedicated ASIC block containing linear shift register structure. Unlike circular FIFOs, which rely on complex control units such as address pointers and decoders, the linear FIFOs simplify the implementation by directly shifting the data through the registers. This approach eliminates the need for the additional intricate control logic typically associated with FPGA circular buffers, making it a potentially more efficient solution for a large class of streaming applications.
The first phase of this work focuses on implementing various FIFO designs on a Xilinx Virtex-7 series FPGA, utilizing different on-chip memory resources: registers, lookup tables (LUTs), and BRAM blocks. These implementations are investigated across different bit widths, in the range of hundreds of bits, and FIFO depths of several thousand stages. The second phase compares these FPGA implementations with an ASIC-based linear FIFO of similar size, synthesized using the TSMC 40 nm standard cell library. The results demonstrate that while the ASIC-based linear FIFO offers over twice the performance of the FPGA-based designs, it requires six to seven times more area. This is attributed to the fact that register-based storage cells are approximately ten times larger than SRAM cells, highlighting the inherent trade-off between performance and area in such designs.
To address the above area overhead, this project further investigates the use of two ring-counters to replace the address decoders typically found in SRAM-based FIFO designs. As data storage requirements increase, the complexity of address decoders also grows. By simplifying the control logic with ring-counters, the design achieved area reductions of 50\% to 76\% for FIFO depths of 1 Kbits and 2 Kbits across various bit widths. These findings underscore the significant potential of this approach to optimize both area and performance in memory-intensive FPGA applications.