Transparently Accelerating Spark SQL Code on Computing Hardware

Master thesis (2020)

Authors

F.M. Nonnenmacher Electrical Engineering, Mathematics and Computer Science

Contributors

Z. Al-Ars Computer Engineering - (mentor)

H.P. Hofstee Computer Engineering - (graduation committee member)

C. Hauff Web Information Systems - (graduation committee member)

J.J. Hoozemans Computer Engineering - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

To reference this document use:

http://resolver.tudelft.nl/uuid:f588ca1d-e4ae-4bf4-96ed-221d483b559d

More Info

expand_more

Published Date

19-08-2020

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Through new digital business models, the importance of big data analytics continuously grows. Initially, data analytics clusters were mainly bounded by the throughput of network links and the performance of I/O operations. With current hardware development, this has changed, and often the performance of CPUs and memory access became the new limiting factor. Heterogeneous computing systems, consisting of CPUs and other computing hardware, such as GPUs and FPGAs, try to overcome this by offloading the computational work to the best suitable hardware.

Accelerating the computation by offloading work to special computing hardware often requires specialized knowledge and extensive effort. In contrast, Apache Spark became one of the most used data analytics tools, among other reasons, because of its user-friendly API. Notably, the component Spark SQL allows defining declarative queries without having to write any code. The present work investigates to reduce this gap and elaborates on how Spark SQL's internal information can be used to offload computations without the user having to configure Spark further.

Thereby, the present work uses the Apache Arrow in-memory format to exchange data efficiently between different accelerators. It evaluates Spark SQL's extensibility for providing custom acceleration and its new columnar processing function, including the compatibility with the Apache Arrow format. Furthermore, the present work demonstrates the technical feasibility of such an acceleration by providing a Proof-of-Concept implementation, which integrates Spark with tools from the Arrow ecosystem, such as Gandiva and Fletcher. Gandiva uses modern CPUs' SIMD capabilities to accelerate computations, and Fletcher allows the execution of FPGA-accelerated computations. Finally, the present work demonstrates that already for simple computations integrating these accelerators led to significant performance improvements. With Gandiva the computation became 1.27 times faster and with Fletcher even up-to 13 times.

Files

Nonnenmacher_MSc_Thesis.pdf

(pdf | 2.76 Mb)