M. Fragkoulis

Stateful Entities: Object-oriented Cloud Applications as Distributed Dataflows

Conference paper (2024) - Kyriakos Psarakis (author), K. Psarakis (author), Kyriakos Psarakis (author), W.D. Zorgdrager (author), Marios Fragkoulis (author), M. Fragkoulis (author), Marios Fragkoulis (author), Guido Salvaneschi (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author), A Katsifodimos (author), A. Katsifodimos (author)

Although the cloud has reached a state of robustness, the burden of using its resources falls on the shoulders of programmers who struggle to keep up with ever-growing cloud infrastructure services and abstractions. As a result, state management, scaling, operation, and failure m ...

Evaluating Stream Processing Autoscalers

While the concept of large-scale stream processing is very popular nowadays, efficient dynamic allocation of resources is still an open issue in the area. The database research community has yet to evaluate different autoscaling techniques for stream processing engines under a ro ...

CheckMate: Evaluating Checkpointing Protocols for Streaming Dataflows

Stream processing in the last decade has seen broad adoption in both commercial and research settings. One key element for this success is the ability of modern stream processors to handle failures while ensuring exactly-once processing guarantees. At the moment of writing, virtu ...

A survey on the evolution of stream processing systems

Journal article (2023) - Marios Fragkoulis (author), Marios Fragkoulis (author), Marios Fragkoulis (author), Marios Fragkoulis (author), M. Fragkoulis (author), M. Fragkoulis (author), Paris Carbone (author), Paris Carbone (author), Vasiliki Kalavri (author), Asterios Katsifodimos (author), A Katsifodimos (author), A. Katsifodimos (author), Asterios Katsifodimos (author)

Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamen ...

Towards Evaluating Stream Processing Autoscalers

In this work, we evaluate autoscaling solutions for stream processing engines. Although autoscaling has become a mainstream subject of research in the last decade, the database research community has yet to evaluate different autoscaling techniques under a proper benchmarking set ...

Stateful Entities: Object-oriented Cloud Applications as Distributed Dataflows

Abstract (2023) - Kyriakos Psarakis (author), K. Psarakis (author), Kyriakos Psarakis (author), W.D. Zorgdrager (author), Marios Fragkoulis (author), M. Fragkoulis (author), Marios Fragkoulis (author), Guido Salvaneschi (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author), A Katsifodimos (author), A. Katsifodimos (author)

While there are multiple approaches for distributed application programming (e.g., Bloom [2], Hilda [14], Cloudburst [12], AWS Lambda, Azure Durable Functions, and Orleans [3, 4]), in practice developers mainly use libraries of popular general purpose languages such as Spring Boo ...

While there are multiple approaches for distributed application programming (e.g., Bloom [2], Hilda [14], Cloudburst [12], AWS Lambda, Azure Durable Functions, and Orleans [3, 4]), in practice developers mainly use libraries of popular general purpose languages such as Spring Boot in Java, and Flask in Python. None of these approaches offers message processing guarantees, failing to support exactly-once processing: the ability of a system to reflect the changes of a message to the state exactly one time. Instead, all of the above approaches offer at-most- or at-least-once processing semantics. Programmers then have to “pollute” their business logic with consistency checks, state rollbacks, timeouts, retries, and idempotency [8, 9]. We argue that no matter how we approach cloud programming, unless an execution engine offers exactly-once processing guarantees, we will never remove the burden of distributed systems aspects from programmers. In short, exactly-once processing should be assumed at the level of the programming model. To the best of our knowledge, the only systems able to guarantee exactly-once message processing [5, 11] at the time of writing, are batch [1, 7, 15] and streaming [6, 10, 13] dataflow systems. However, their programming model follows the paradigm of functional dataflow APIs which are cumbersome to use, and require training, and heavy rewrites of the typical imperative code that developers prefer to use for expressing application logic. For these reasons, we believe that the dataflow model should be used as low-level IR for the modeling and execution of distributed applications, but not as a programmer-facing model. Technically, one of the main challenges in adopting a dataflow-based intermediate representation, is that the dataflow model is essentially functional, with immutable values being propagated across operators that typically do not share a global state. Hence, adopting a dataflow-based IR entails translating (arbitrary) imperative code into the functional style. Compiler research has systematically explored only the opposite direction: to compile code in functional programming languages into a representation that is executable on imperative architectures – like virtually all modern microprocessors. Yet, the translation from imperative to functional or dataflow programming remains largely unexplored. To this end, we report on Stateful Entities a prototypical programming model (exemplified in Figure 1), compiler pipeline, and IR that compiles imperative, transactional object-oriented applications into distributed dataflow graphs and executes them on existing dataflow systems. The proposed system presented in this paper can be found at: https://github.com/delftdata/stateflow. Our preliminary experiments showed that the translation of imperative programs into dataflow graphs yields very promising performance results, of less than 50ms latency. @en

Leveraging Large Language Models for Sequential Recommendation

Conference paper (2023) - Jesse Harte (author), Jesse Harte (author), Wouter Zorgdrager (author), Panagiotis Louridas (author), Asterios Katsifodimos (author), A. Katsifodimos (author), Asterios Katsifodimos (author), A Katsifodimos (author), Dietmar Jannach (author), Marios Fragkoulis (author), Marios Fragkoulis (author), M. Fragkoulis (author)

Sequential recommendation problems have received increasing attention in research during the past few years, leading to the inception of a large variety of algorithmic approaches. In this work, we explore how large language models (LLMs), which are nowadays introducing disruptive ...

S-QUERY

Opening the Black Box of Internal Stream Processor State

Conference paper (2022) - Jim Verheijde (author), Vassilios Karakoidas (author), M. Fragkoulis (author), Marios Fragkoulis (author), Marios Fragkoulis (author), Asterios Katsifodimos (author), A Katsifodimos (author), Asterios Katsifodimos (author), A. Katsifodimos (author)

Distributed streaming dataflow systems have evolved into scalable and fault-tolerant production-grade systems. Their applicability has departed from the mere analysis of streaming windows and complex-event processing, and now includes cloud applications and machine learning infer ...

Join Path-Based Data Augmentation for Decision Trees

Conference paper (2022) - A. Ionescu (author), Andra Ionescu (author), R. Hai (author), Rihan Hai (author), Marios Fragkoulis (author), M. Fragkoulis (author), Marios Fragkoulis (author), A Katsifodimos (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author), A. Katsifodimos (author)

Machine Learning (ML) applications require high-quality datasets. Automated data augmentation techniques can help increase the richness of training data, thus increasing the ML model accuracy. Existing solutions focus on efficiency and ML model accuracy but do not exploit the ric ...

Transactions across serverless functions leveraging stateful dataflows

Journal article (2022) - Martijn de Heus (author), Kyriakos Psarakis (author), K. Psarakis (author), Kyriakos Psarakis (author), Marios Fragkoulis (author), Marios Fragkoulis (author), M. Fragkoulis (author), A. Katsifodimos (author), A Katsifodimos (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author)

Serverless computing is currently the fastest-growing cloud services segment. The most prominent serverless offering is Function-as-a-Service (FaaS), where users write functions and the cloud automates deployment, maintenance, and scalability. Although FaaS is a good fit for exec ...

Serverless computing is currently the fastest-growing cloud services segment. The most prominent serverless offering is Function-as-a-Service (FaaS), where users write functions and the cloud automates deployment, maintenance, and scalability. Although FaaS is a good fit for executing stateless functions, it does not adequately support stateful constructs like microservices and scalable, low-latency cloud applications. Recently, there have been multiple attempts to add first-class support for state in FaaS systems, such as Microsoft Orleans, Azure Durable Functions, or Beldi. These approaches execute business code inside stateless functions, handing over their state to an external database. In contrast, approaches such as Apache Flink's StateFun follow a different design: a dataflow system such as Apache Flink handles all state management, messaging, and checkpointing by executing a stateful dataflow graph providing exactly-once state processing guarantees. This design relieves programmers from having to “pollute” their business logic with distributed systems error checking, management, and mitigation. Although programmers can easily develop applications without worrying about messaging and state management, executing transactions across stateful functions remains an open problem. In this paper, we introduce a programming model and implementation for transaction orchestration of stateful serverless functions. Our programming model supports serializable distributed transactions with two-phase commit, as well as eventually consistent workflows with Sagas. We design and implement our programming model on Apache Flink StateFun to leverage Flink's exactly-once processing and state management guarantees. Our experiments show that the approach of building transactional systems on top of dataflow graphs can achieve very high throughput, but with latency overhead due to checkpointing mechanism guaranteeing the exactly-once processing. We compare our approach to Beldi that implements two-phase commit on AWS lambda functions backed by DynamoDB for state management, as well as an implementation of a system that makes use of CockroachDB as its backend.

@en

Clonos

Consistent Causal Recovery for Highly-Available Streaming Dataflows

Conference paper (2021) - P.M. Fortunato Silvestre (author), Pedro F. Silvestre (author), P.M. Silvestre (author), Pedro F. Fortunato Silvestre (author), M. Fragkoulis (author), Marios Fragkoulis (author), Marios Fragkoulis (author), D. Spinellis (author), Diomidis Spinellis (author), Asterios Katsifodimos (author), A. Katsifodimos (author), Asterios Katsifodimos (author), A Katsifodimos (author)

Stream processing lies in the backbone of modern businesses, being employed for mission critical applications such as real-time fraud detection, car-trip fare calculations, traffic management, and stock trading. Large-scale applications are executed by scale-out stream processing ...

Valentine in Action

Matching Tabular Data at Scale

Journal article (2021) - C. Koutras (author), Christos Koutras (author), Kyriakos Psarakis (author), K. Psarakis (author), Kyriakos Psarakis (author), G. Siachamis (author), George Siachamis (author), A. Ionescu (author), Andra Ionescu (author), Marios Fragkoulis (author), Marios Fragkoulis (author), M. Fragkoulis (author), Angela Bonifati (author), A. Katsifodimos (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author), A Katsifodimos (author)

Capturing relationships among heterogeneous datasets in large data lakes - traditionally termed schema matching - is one of the most challenging problems that corporations and institutions face nowadays. Discovering and integrating datasets heavily relies on the effectiveness of ...

Hazelcast jet

Low-latency stream processing at the 99.99^th percentile

Journal article (2021) - Can Gencer (author), Marko Topolnik (author), Viliam Ďurina (author), Emin Demirci (author), Ensar B. Kahveci (author), Ali Gürbüz (author), Ondřej Lukáš (author), Marios Fragkoulis (author), Marios Fragkoulis (author), M. Fragkoulis (author), A Katsifodimos (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author), A. Katsifodimos (author), More Authors..., More authors...

Jet is an open-source, high-performance, distributed stream processor built at Hazelcast during the last five years. Jet was engineered with millisecond latency on the 99.99th percentile as its primary design goal. Originally Jet’s purpose was to be an execution engine that perfo ...

Distributed transactions on serverless stateful functions

Conference paper (2021) - Martijn de Heus (author), Kyriakos Psarakis (author), K. Psarakis (author), Kyriakos Psarakis (author), Marios Fragkoulis (author), Marios Fragkoulis (author), M. Fragkoulis (author), A. Katsifodimos (author), A Katsifodimos (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author)

Serverless computing is currently the fastest-growing cloud services segment. The most prominent serverless offering is Function-as-a-Service (FaaS), where users write functions and the cloud automates deployment, maintenance, and scalability. Although FaaS is a good fit for exec ...

Valentine: Evaluating Matching Techniques for Dataset Discovery

Conference paper (2021) - C. Koutras (author), Christos Koutras (author), G. Siachamis (author), George Siachamis (author), A. Ionescu (author), Andra Ionescu (author), K. Psarakis (author), Kyriakos Psarakis (author), Kyriakos Psarakis (author), Jerry Brons (author), H.A.J. Brons (author), Marios Fragkoulis (author), M. Fragkoulis (author), Marios Fragkoulis (author), Christoph Lofi (author), Christoph Lofi (author), C. Lofi (author), Angela Bonifati (author), Asterios Katsifodimos (author), A Katsifodimos (author), A. Katsifodimos (author), Asterios Katsifodimos (author)

Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema ...

Beyond Analytics

The Evolution of Stream Processing Systems

Conference paper (2020) - Paris Carbone (author), Marios Fragkoulis (author), Marios Fragkoulis (author), M. Fragkoulis (author), Vasiliki Kalavri (author), Asterios Katsifodimos (author), A Katsifodimos (author), A. Katsifodimos (author), Asterios Katsifodimos (author)

Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. The goal of this tutorial is threefold. First, we aim to ...

REMA

Graph embeddings-based relational schema matching

Abstract (2020) - Christos Koutras (author), C. Koutras (author), Marios Fragkoulis (author), Marios Fragkoulis (author), M. Fragkoulis (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author), A Katsifodimos (author), A. Katsifodimos (author), C. Lofi (author), Christoph Lofi (author), Christoph Lofi (author)

Schema matching is the process of capturing correspondence between attributes of different datasets and it is one of the most important prerequisite steps for analyzing heterogeneous data collections. State-of-the-art schema matching algorithms that use simple schema- or instance ...

Live interactive queries to a software application's memory profile

Journal article (2019) - M. Fragkoulis (author), Marios Fragkoulis (author), Marios Fragkoulis (author), D. Spinellis (author), Diomidis Spinellis (author), Panagiotis Louridas (author)

Memory operations are critical to an application's reliability and performance. To reason about their correctness and track opportunities for optimisations, sophisticated instrumentation frameworks, such as Valgrind and Pin, have been developed. Both provide only limited faciliti ...

Operational stream processing

Towards scalable and consistent event-driven applications

Conference paper (2019) - A Katsifodimos (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author), A. Katsifodimos (author), M. Fragkoulis (author), Marios Fragkoulis (author), Marios Fragkoulis (author)

In the last decade we are witnessing a widespread adoption of architectural styles such as microservices, for building event-driven software applications and deploying them in cloud infrastructures. Such services favor the separation of a database into independent silos of data, ...

Smelly relations

Measuring and understanding database schema quality

Conference paper (2018) - Tushar Sharma (author), Marios Fragkoulis (author), Marios Fragkoulis (author), M. Fragkoulis (author), Stamatia Rizou (author), Magiel Bruntink (author), D. Spinellis (author), Diomidis Spinellis (author)

Context: Databases are an integral element of enterprise applications. Similarly to code, database schemas are also prone to smells - best practice violations. Objective: We aim to explore database schema quality, associated characteristics and their relationships with other soft ...

Stateful Entities: Object-oriented Cloud Applications as Distributed Dataflows

Evaluating Stream Processing Autoscalers

CheckMate: Evaluating Checkpointing Protocols for Streaming Dataflows

A survey on the evolution of stream processing systems

Towards Evaluating Stream Processing Autoscalers

Stateful Entities: Object-oriented Cloud Applications as Distributed Dataflows

Leveraging Large Language Models for Sequential Recommendation

S-QUERY

Opening the Black Box of Internal Stream Processor State

Join Path-Based Data Augmentation for Decision Trees

Transactions across serverless functions leveraging stateful dataflows

Clonos

Consistent Causal Recovery for Highly-Available Streaming Dataflows

Valentine in Action

Matching Tabular Data at Scale

Hazelcast jet

Low-latency stream processing at the 99.99th percentile

Distributed transactions on serverless stateful functions

Valentine: Evaluating Matching Techniques for Dataset Discovery

Beyond Analytics

The Evolution of Stream Processing Systems

REMA

Graph embeddings-based relational schema matching

Live interactive queries to a software application's memory profile

Operational stream processing

Towards scalable and consistent event-driven applications

Smelly relations

Measuring and understanding database schema quality

Low-latency stream processing at the 99.99^th percentile