High Performance ASIC Processor Design for DNA Basecallers
More Info
expand_more
Abstract
Genomics has revolutionized medicine and biological research by providing deeper insights into the genetic makeup of organisms, advancing our understanding of diseases, and enabling personalized medicine. These breakthroughs are driven by advancements in genome sequencing technologies, bioinformatics, and the aid of neural networks. The advent of third-generation sequencing technol ogy has further accelerated progress by allowing for long-read sequencing, which enhances the ac curacy and efficiency of genome assembly. Oxford Nanopore Technologies (ONT) offers advanced sequencers that use nanopore-based technology to read DNA sequences in real-time. However, the raw sequenced data contains noise and requires a basecalling stage to read DNA sequences with the required accuracy. Basecalling relies on deep neural networks (DNNs) to achieve high-accuracy reads, but the significant computational power required makes basecalling a costly process, especially in real-time applications. Current hardware accelerators used for these compute-intensive basecallers are state-of-the-art GPUs that cost over $10,000 per unit. This thesis explores the design of a custom hardware accelerator for ONT’s basecalling program, Bonito, which aims to provide a cost-effective alternative to existing accelerators such as Nvidia GPUs and Groq Tensor Streaming Processors (TSPs). Bonito’s DNN is dominated by five Long Short-Term Memory(LSTM)layers, accounting for 90% of its execution time. The custom accelerator targets these compute-intensive LSTMs to reduce execution time. This work provides a comprehensive analysis of LSTM performance and behavior on GPUs and Groq TSPs. It emphasizes the architectural benefits and limitations in the context of basecalling. Furthermore, it also evaluates an Application-Specific Integrated Circuit (ASIC) implementation of an existing FPGA-based LSTM accelerator design. TheanalysisshowsthatBonito’sHigh-Accuracymodel(HAC)LSTMlayerscontainmanysequential matrix multiplications that are compute-intensive and require high memory bandwidth to accommodate the data transfers. Furthermore, Bonito’s small problem size, with 384 features per vector input, causes GPUs to not fully utilize available compute cores. Combined with slow per-core performance, GPUs executing LSTMs achieve only 13.5% of the maximum TFLOP/s with FP16 precision. Groq uses a heterogeneous architecture with fast separate MXM units executing matrix multiplica tions, and VXM units executing point-wise operations. Data travels between these different compute units through streaming channels and MEM units. The LSTM analysis on Groq showed three main issues. First, Groq uses a 320-element wide data channel to transfer data across the chip, whereas Bonito has 384 hidden features as input. This leads to the 384-element input being sliced in two 192 element partial inputs as Groq supports physical tensors up to 320-element long, effectively doubling the cycle cost by using two slices instead of one. Second, performing matrix multiplications requires these slices to be transferred from MEM to MXM, by executing reload operations to the MXM weight buffer. This data transfer consumes the bandwidth on the streaming channel, which introduces stalls in the pipeline and clock cycles are spent on MEM operations, instead of compute operations in MXM or VXM. Lastly, the analysis shows that the VXM forms a bottleneck in the LSTM execution, accounting for 50% of the clock cycles, whereas the MXM accounts for the other 50%. This shows that the special functions and additions inside an LSTM cell are slowing down Groq’s overall performance. After the existing architecture analysis, the ASIC evaluation showed a synthesis result, where one block of 384 LSTM Processing Engines (PEs) achieves a clock speed of 434MHz and costs 8.07𝑚𝑚2 on a 40nmprocess node. Putting these PEs on a Groq-based chip layout of 725𝑚𝑚2 in area size, the 40nm-based PEs can achieve 79.3 TFLOP/s at FP16 precision. By correcting the 40nm process node to Groq’s 14nm process node, the 14nm-based PEs achieved 448 TFLOP/s at FP16. These results suggest that a custom LSTM accelerator could compete in performance with state-of-the-art solutions while being more cost-effective. Future work is suggested to investigate a cycle reduction in the multiply-accumulate stage and to evaluate the ASIC design using a modern process node technology, as the current ASIC design uses a 2008-based 40nm process node. This work helps future development in further optimizing a custom LSTM accelerator specified for Bonito’s DNN requirements, paving the road toward more affordable genome sequencing