Workload Characterization and Modeling, and the Design and Evaluation of Cache Policies for Big Data Storage Workloads in the Cloud
More Info
expand_more
Abstract
The proliferation of big-data processing platforms has already led to radically different system designs, such as MapReduce and the newer Spark. Understanding the workloads of such systems enables tuning and could foster new designs. However, whereas MapReduce workloads have been characterized extensively, relatively little public knowledge exists about the characteristics of Spark workloads in representative environments. In this work, we focus on understanding the behavior and cache performance of the storage sub-system used for Spark workloads in the cloud. First, we statistically characterize its usage. Second, we design a generative model to tackle the scarcity of workload traces. Third, we design a cache policy putting our insight from the characterization to work. Finally, we evaluate the performance of different cache policies for big data workloads via simulation.