A High-Bandwidth Snappy Decompressor in Reconfigurable Logic

Fang, J.; Chen, Jianyu; Al-Ars, Zaid; Hofstee, H.P.; Hidders, Jan

doi:10.1109/CODESISSS.2018.8525953

A High-Bandwidth Snappy Decompressor in Reconfigurable Logic

Conference paper (2018)

Authors

J. Fang Computer Engineering

Jianyu Chen Student

Zaid Al-Ars Computer Engineering

H.P. Hofstee Computer Engineering

Jan Hidders Vrije Universiteit Brussel

Research Group

Computer Engineering

DOI: https://doi.org/10.1109/CODESISSS.2018.8525953

To reference this document use:

http://resolver.tudelft.nl/uuid:d7c56cf6-9817-4f8d-bdb2-98e287e5b9d8

More Info

expand_more

Published Date

30-09-2018

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Research Group

Computer Engineering

Abstract

While in-memory databases have largely removed I/O as a bottleneck for database operations, loading the data from storage into memory remains a significant limiter to end-to end performance. Snappy is a widely used compression algorithm in the Hadoop ecosystem and in database systems and is an option in often-used file formats such as Parquet and ORC. Compression reduces the amount of data that must be transferred from/to the storage saving both storage space and storage bandwidth. While it is easy for a CPU Snappy decompressor to keep up with the bandwidth of a hard disk drive, when moving to NVMe devices attached with high bandwidth connections such as PCIe Gen4 or OpenCAPI, the decompression speed in a CPU is insufficient. We propose an FPGA-based Snappy decompressor that can process multiple tokens in parallel and operates on each FPGA block ram independently. Read commands are recycled until the read data is valid dramatically reducing control complexity. One instance of our decompression engine takes 9% of the LUTs in the XCKU15P FPGA, and achieves up to 3GB/s (5GB/s) decompression rate from the input (output) side, about an order of magnitude faster than a CPU (single thread). Parquet allows for independent decompression of multiple pages and instantiating eight of these units on a XCKU15P FPGA can keep up with the highest performance interface bandwidths.

Files

A_High_Bandwidth_Snappy_Decomp... (pdf)

(pdf | 0.192 Mb)

License info not available