GSST: High Throughput Parallel String Decompression on GPU

Vonk, R.

Abstract

This thesis describes how the throughput of data ingestion on GPUs can be increased by using data compression. This is done through two main contributions. First, a high-level model is presented to assess the impact of compression on ingestion throughput. Second, a novel decompression algorithm called GSST (GPU Static Symbol Table) is developed, optimized for GPU parallelism. GSST achieves state-of-the-art performance, striking an effective balance between compression ratio and decompression throughput.
The work done in this thesis has contributed to the submission of two scientific papers:
1. GSST: Parallel string decompression at 150 GB/s on GPU [1] (to be updated to 191 GB/s upon
submission).
2. Benchmarking GPU Direct Storage for High-Performance Filesystems: Impact & Future Trends [2]
The ingestion throughput model is introduced to quantify the impact of data compression on data ingestion on a GPU. This model offers insight into how the compression ratio and decompression throughput influence overall data ingestion performance. This model shows that, as storage devices become faster, the decompression algorithm must also increase its throughput to keep up, while the compression ratio becomes less influential on the ingestion throughput.
GSST is the solution proposed to increase data ingestion throughput on GPUs. GSST adapts from the FSST (Fast Static Symbol Table) algorithm to the parallel architecture of GPUs. GSST’s performance is driven by six performance optimizations. Three format optimizations, block parallelism, split parallelism, and coalesced memory access, increase parallelism and throughput by changing the way data is stored. Additionally, three memory management techniques are implemented to effectively utilize the memory throughput of a GPU. These are the use of shared memory, using aligned memory accesses, and utilizing asynchronous data transfers.
Using the ingestion throughput model, GSST is evaluated against the state-of-the-art GPU compression algorithms from nvCOMP. The results reveal that GSST achieves a decompression throughput of 191 GB/s with a compression ratio of 2.74 on an A100. While nvCOMP’s ANS and Bitcomp outperform GSST in decompression throughput, they offer lower compression ratios. Similarly, Zstd achieves a higher compression with significantly lower decompression throughput, positioning GSST as a good balance of decompression throughput and compression ratio.
The data ingestion model demonstrates that GSST offers the highest ingestion throughput among the tested compression algorithms when ingesting data over a connection with a throughput between 0.8 GB/s and 87 GB/s. This means GSST is ideally suited for use with top-of-the-line networking equipment and even provides headroom for future improvements in connection throughput. Additionally, GSST is extremely memory-efficient, using significantly less GPU memory than all of nvCOMP’s compression algorithms. In some cases, GSST uses 3,500 times less memory, and in the best scenarios, over 67 million times less.
By leveraging these format and memory management optimizations, GSST provides a powerful, efficient solution for industries using large-scale data systems such as high-performance computing and data analytics.
The GSST source code will be made available on GitHub [3].