Volker Markl | TU Delft Repository

Scotty: General and Efficient Open-source Window Aggregation for Stream Processing Systems

Journal article (2021) - Jonas Traub (author) , Philipp Marian Grulich (author) , Alejandro Rodríguez Cuéllar (author) , Sebastian Breß (author) , Asterios Katsifodimos (author) , Tilmann Rabl (author) , Volker Markl (author)

Window aggregation is a core operation in data stream processing. Existing aggregation techniques focus on reducing latency, eliminating redundant computations, or minimizing memory usage. However, each technique operates under different assumptions with respect to workload chara ...

An Intermediate Representation for Optimizing Machine Learning Pipelines

Journal article (2019) - Andreas Kunft (author) , Asterios Katsifodimos (author) , Sebastian Schelter (author) , Sebastian Breß (author) , Tilmann Rabl (author) , Volker Markl (author)

Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses ...

Poster: Generating Reproducible Out-of-Order Data Streams

Conference paper (2019) - Philipp Grulich (author) , Jonas Traub (author) , Sebastian Bress (author) , A Katsifodimos (author) , Volker Markl (author) , Tilmann Rabl (author)

Evaluating modern stream processing systems in a reproducible manner requires data streams with different data distributions, data rates, and real-world characteristics such as delayed and out-of-order tuples. In this paper, we present an open source stream generator which genera ...

Efficient Window Aggregation with General Stream Slicing

Conference paper (2019) - Jonas Traub (author) , Philipp Marian Grulich (author) , Alejandro Rodríguez Cuéllar (author) , Sebastian Bress (author) , A Katsifodimos (author) , Tilmann Rabl (author) , Volker Markl (author)

Window aggregation is a core operation in data stream processing. Existing aggregation techniques focus on reducing latency, eliminating redundant computations, and minimizing memory usage. However, each technique operates under different assumptions with respect to workload char ...

Muses

Distributed data migration system for polystores

Conference paper (2019) - Abdulrahman Kaitoua (author) , Tilmann Rabl (author) , A. Katsifodimos (author) , Volker Markl (author)

Large datasets can originate from various sources and are being stored in heterogeneous formats, schemas, and locations. Typical data science tasks need to combine those datasets in order to increase their value and extract knowledge. This is done in various data processing syste ...

Benchmarking Distributed Stream Data Processing Systems

Conference paper (2018) - Jeyhun Karimov (author) , Tilmann Rabl (author) , A Katsifodimos (author) , Roman Samarev (author) , Henri Heiskanen (author) , Volker Markl (author)

The need for scalable and efficient stream analysis has led to the development of many open-source streaming data processing systems (SDPSs) with highly diverging capabilities and performance characteristics. While first initiatives try to compare the systems for simple workloads ...

Scotty

Efficient window aggregation for out-of-order stream processing

Conference paper (2018) - Jonas Traub (author) , Philipp Marian Grulich (author) , Alejandro Rodríguez Cuéllar (author) , Sebastian Bress (author) , A Katsifodimos (author) , Tilmann Rabl (author) , Volker Markl (author)

Computing aggregates over windows is at the core of virtually every stream processing job. Typical stream processing applications involve overlapping windows and, therefore, cause redundant computations. Several techniques prevent this redundancy by sharing partial aggregates amo ...

Optimized on-demand data streaming from sensor nodes

Conference paper (2017) - Jonas Traub (author) , Sebastian Breß (author) , Tilmann Rabl (author) , Asterios Katsifodimos (author) , Volker Markl (author)

Real-time sensor data enables diverse applications such as smart metering, traffic monitoring, and sport analysis. In the Internet of Things, billions of sensor nodes form a sensor cloud and offer data streams to analysis systems. However, it is impossible to transfer all availab ...

Large-scale data stream processing systems

Book chapter (2017) - Paris Carbone (author) , Gábor E. Gévay (author) , Gábor Hermann (author) , A Katsifodimos (author) , Juan Soto (author) , Volker Markl (author) , Seif Haridi (author)

In our data-centric society, online services, decision making, and other aspects are increasingly becoming heavily dependent on trends and patterns extracted from data. A broad class of societal-scale data management problems requires system support for processing unbounded data ...

BlockJoin

Efficient Matrix Partitioning Through Joins

Conference paper (2017) - Andreas Kunft (author) , Asterios Katsifodimos (author) , Sebastian Schelter (author) , Tilmann Rabl (author) , Volker Markl (author)

Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-to- end ML pipelines often consist of (i) relational operator ...

Apache Flink in current research

Journal article (2016) - Tilmann Rabl (author) , Jonas Traub (author) , A. Katsifodimos (author) , Volker Markl (author)

Emma in action

Declarative Dataflows for scalable data analysis

Conference paper (2016) - Alexander Alexandrov (author) , Andreas Salzmann (author) , Georgi Krastev (author) , Asterios Katsifodimos (author) , Volker Markl (author)

Parallel dataow APIs based on second-order functions were originally seen as a exible alternative to SQL. Over time, however, their complexity increased due to the number of physical aspects that had to be exposed by the underlying engines in order to facilitate efficient executi ...

Bridging the Gap

Towards optimization across linear and relational Algebra

Conference paper (2016) - Andreas Kunft (author) , Alexander Alexandrov (author) , A. Katsifodimos (author) , Volker Markl (author)

Advanced data analysis typically requires some form of preprocessing in order to extract and transform data before processing it with machine learning and statistical analysis techniques. Pre-processing pipelines are naturally expressed in dataflow APIs (e.g., MapReduce, Flink, e ...

Cutty

Aggregate sharing for user-defined windows

Conference paper (2016) - Paris Carbone (author) , Jonas Traub (author) , Asterios Katsifodimos (author) , Seif Haridi (author) , Volker Markl (author)

Aggregation queries on data streams are evaluated over evolving and often overlapping logical views called windows. While the aggregation of periodic windows were extensively studied in the past through the use of aggregate sharing techniques such as Panes and Pairs, little to no ...

Implicit Parallelism through Deep Language Embedding

Journal article (2016) - Alexander Alexandrov (author) , A Katsifodimos (author) , Georgi Krastev (author) , Volker Markl (author)

Parallel collection processing based on second-order functions such as map and reduce has been widely adopted for scalable data analysis. Initially popularized by Google, over the past decade this programming paradigm has found its way in the core APIs of parallel dataflow engine ...

Implicit parallelism through deep language embedding

Conference paper (2015) - Alexander Alexandrov (author) , Andreas Kunft (author) , A. Katsifodimos (author) , Felix Schüler (author) , Lauritz Thamsen (author) , Odej Kao (author) , Tobias Herb (author) , Volker Markl (author)

The appeal of MapReduce has spawned a family of systems that implement or extend it. In order to enable parallel collection processing with User-Defined Functions (UDFs), these systems expose extensions of the MapReduce programming model as library-based dataow APIs that are tigh ...

Optimistic recovery for iterative dataflows in action

Conference paper (2015) - Sergey Dudoladov (author) , C. Xu (author) , Sebastian Schelter (author) , Asterios Katsifodimos (author) , Stephan Ewen (author) , Kostas Tzoumas (author) , Volker Markl (author)

Over the past years, parallel dataflow systems have been employed for advanced analytics in the field of data mining where many algorithms are iterative. These systems typically provide fault tolerance by periodically checkpointing the algorithm's state and, in case of failure, r ...

Apache Flink™

Stream and Batch Processing in a Single Engine

Journal article (2015) - Paris Carbone (author) , Asterios Katsifodimos (author) , Stephan Ewen (author) , Volker Markl (author) , Seif Haridi (author) , Kostas Tzoumas (author)

Apache Flink is an open-source system for processing streaming and batch data. Flink is built on the philosophy that many classes of data processing applications, including real-time analytics, continuous data pipelines, historic data processing (batch), and iterative algorithms ...