Learning Structured Sparsity for Efficient Nanopore DNA Basecalling Using Delayed Masking

More Info
expand_more

Abstract

High accuracy nanopore basecalling uses large deep neural networks, requiring powerful GPUs, which is undesirable for sequencing experiments outside the lab. Research has shown that this can be circumvented by using smaller models to increase efficiency as well as basecalling speed. However, this comes at the cost of reduced accuracy, going against the trend of increasingly more complex models to extract the highest possible accuracy out of the source data. We propose learning structured sparsity during model training to find an improved trade-off between accuracy and model size, and thus basecalling speed. Our work introduces an improved pruning method with a delayed masking scheduler and removes redundant masks, saving compute, and is optimized for the basecaller training process. We find that the model size can be reduced by up to 21× with a reduction in match rate of 0.1% to 1.3% compared to Bonito-HAC, using a standardized benchmarking method. Our results indicate that the size of basecalling models can be reduced drastically without affecting accuracy, as long as researchers use appropriate training methods. Furthermore, our work helps democratize nanopore DNA sequencing, broadening the reach and impact of this technology. The code with the masking mechanism to reproduce our results is available at https://github.com/meesfrensel/efficient-basecallers.