A. Katsifodimos | TU Delft Repository

Evaluating Stream Processing Autoscalers

Conference paper (2024) - G. Siachamis (author), G.C. Christodoulou (author), K. Psarakis (author), M. Fragkoulis (author), A. van Deursen (author), A Katsifodimos (author)

While the concept of large-scale stream processing is very popular nowadays, efficient dynamic allocation of resources is still an open issue in the area. The database research community has yet to evaluate different autoscaling techniques for stream processing engines under a ro ...

Stateful Entities: Object-oriented Cloud Applications as Distributed Dataflows

Conference paper (2024) - K. Psarakis (author), W.D. Zorgdrager (author), M. Fragkoulis (author), Guido Salvaneschi (author), A Katsifodimos (author)

Although the cloud has reached a state of robustness, the burden of using its resources falls on the shoulders of programmers who struggle to keep up with ever-growing cloud infrastructure services and abstractions. As a result, state management, scaling, operation, and failure m ...

CheckMate: Evaluating Checkpointing Protocols for Streaming Dataflows

Conference paper (2024) - G. Siachamis (author), K. Psarakis (author), M. Fragkoulis (author), A. van Deursen (author), Paris Carbone (author), Paris Carbone (author), Paris Carbone (author), A Katsifodimos (author)

Stream processing in the last decade has seen broad adoption in both commercial and research settings. One key element for this success is the ability of modern stream processors to handle failures while ensuring exactly-once processing guarantees. At the moment of writing, virtu ...

Key Insights from a Feature Discovery User Study

Conference paper (2024) - A. Ionescu (author), Zeger Mouw (author), E.A. Aivaloglou (author), A Katsifodimos (author)

Multiple works in data management research focus on automating the processes of data augmentation and feature discovery to save users from having to perform these tasks manually. Yet, this automation often leads to a disconnect with the users, as it fails to consider the specific ...

Adaptive Distributed Streaming Similarity Joins

Conference paper (2023) - G. Siachamis (author), K. Psarakis (author), M. Fragkoulis (author), Odysseas Papapetrou (author), A. van Deursen (author), A Katsifodimos (author)

How can we perform similarity joins of multi-dimensional streams in a distributed fashion, achieving low latency? Can we adaptively repartition those streams in order to retain high performance under concept drifts? Current approaches to similarity joins are either restricted to ...

Optimizing Machine Learning Inference Queries for Multiple Objectives

Conference paper (2023) - Z. Li (author), Mariette Schonfeld (author), R. Hai (author), A. Bozzon (author), A Katsifodimos (author)

Given a set of pre-trained Machine Learning (ML) models, can we solve complex analytic tasks that make use of those models by formulating ML inference queries? Can we mitigate different tradeoffs, e.g., high accuracy, low execution costs and memory footprint, when optimizing the ...

Amalur

Data Integration Meets Machine Learning

Conference paper (2023) - R. Hai (author), C. Koutras (author), A. Ionescu (author), Z. Li (author), W. Sun (author), Jessie van Schijndel (author), Yan Kang (author), A Katsifodimos (author)

Machine learning (ML) training data is often scattered across disparate collections of datasets, called data silos. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manua ...

A survey on the evolution of stream processing systems

Journal article (2023) - M. Fragkoulis (author), M. Fragkoulis (author), Paris Carbone (author), Paris Carbone (author), Vasiliki Kalavri (author), A Katsifodimos (author)

Stream processing has been an active research field for more than 20 years, but it is now witnessing its prime time due to recent successful efforts by the research community and numerous worldwide open-source communities. This survey provides a comprehensive overview of fundamen ...

Leveraging Large Language Models for Sequential Recommendation

Conference paper (2023) - Jesse Harte (author), Jesse Harte (author), Wouter Zorgdrager (author), Panos Louridas (author), A Katsifodimos (author), Dietmar Jannach (author), M. Fragkoulis (author)

Sequential recommendation problems have received increasing attention in research during the past few years, leading to the inception of a large variety of algorithmic approaches. In this work, we explore how large language models (LLMs), which are nowadays introducing disruptive ...

An Empirical Performance Comparison between Matrix Multiplication Join and Hash Join on GPUs

Conference paper (2023) - W. Sun (author), A Katsifodimos (author), R. Hai (author)

Recent advances in Graphic Processing Units (GPUs) have facilitated a significant performance boost for database operators, in particular, joins. It has been intensively studied how conventional join implementations, such as hash joins, benefit from the massive parallelism of GPU ...

Accelerating Machine Learning Queries with Linear Algebra Query Processing

Conference paper (2023) - W. Sun (author), A Katsifodimos (author), R. Hai (author)

The rapid growth of large-scale machine learning (ML) models has led numerous commercial companies to utilize ML models for generating predictive results to help business decision-making. As two primary components in traditional predictive pipelines, data processing, and model pr ...

Topio: An Open-Source Web Platform for Trading Geospatial Data

Conference paper (2023) - A. Ionescu (author), Kostas Patroumpas (author), K. Psarakis (author), Georgios Chatzigeorgakidis (author), Diego Collarana (author), Kai Barenscher (author), Dimitrios Skoutas (author), A Katsifodimos (author), Spiros Athanasiou (author)

The increasing need for data trading across businesses nowadays has created a demand for data marketplaces. However, despite the intentions of both data providers and consumers, today’s data marketplaces remain mere data catalogs. We believe that marketplaces of the future requir ...

Towards Evaluating Stream Processing Autoscalers

Conference paper (2023) - G. Siachamis (author), Job Kanis (author), Wybe Koper (author), K. Psarakis (author), M. Fragkoulis (author), M. Fragkoulis (author), A. van Deursen (author), A Katsifodimos (author)

In this work, we evaluate autoscaling solutions for stream processing engines. Although autoscaling has become a mainstream subject of research in the last decade, the database research community has yet to evaluate different autoscaling techniques under a proper benchmarking set ...

Metadata Representations for Queryable Repositories of Machine Learning Models

Journal article (2023) - Z. Li (author), Henk Kant (author), R. Hai (author), A Katsifodimos (author), Marco Brambilla (author), A. Bozzon (author)

Machine learning (ML) practitioners and organizations are building model repositories of pre-trained models, referred to as model zoos. These model zoos contain metadata describing the properties of the ML models and datasets. The metadata serves crucial roles for reporting, audi ...

Optimizing ML Inference Queries Under Constraints

Conference paper (2023) - Z. Li (author), W. Sun (author), R. Hai (author), A. Bozzon (author), A Katsifodimos (author)

The proliferation of pre-trained ML models in public Web-based model zoos facilitates the engineering of ML pipelines to address complex inference queries over datasets and streams of unstructured content. Constructing optimal plan for a query is hard, especially when constraints ...

Stateful Entities: Object-oriented Cloud Applications as Distributed Dataflows

Abstract (2023) - K. Psarakis (author), W.D. Zorgdrager (author), M. Fragkoulis (author), Guido Salvaneschi (author), A Katsifodimos (author)

While there are multiple approaches for distributed application programming (e.g., Bloom [2], Hilda [14], Cloudburst [12], AWS Lambda, Azure Durable Functions, and Orleans [3, 4]), in practice developers mainly use libraries of popular general purpose languages such as Spring Boo ...

While there are multiple approaches for distributed application programming (e.g., Bloom [2], Hilda [14], Cloudburst [12], AWS Lambda, Azure Durable Functions, and Orleans [3, 4]), in practice developers mainly use libraries of popular general purpose languages such as Spring Boot in Java, and Flask in Python. None of these approaches offers message processing guarantees, failing to support exactly-once processing: the ability of a system to reflect the changes of a message to the state exactly one time. Instead, all of the above approaches offer at-most- or at-least-once processing semantics. Programmers then have to “pollute” their business logic with consistency checks, state rollbacks, timeouts, retries, and idempotency [8, 9]. We argue that no matter how we approach cloud programming, unless an execution engine offers exactly-once processing guarantees, we will never remove the burden of distributed systems aspects from programmers. In short, exactly-once processing should be assumed at the level of the programming model. To the best of our knowledge, the only systems able to guarantee exactly-once message processing [5, 11] at the time of writing, are batch [1, 7, 15] and streaming [6, 10, 13] dataflow systems. However, their programming model follows the paradigm of functional dataflow APIs which are cumbersome to use, and require training, and heavy rewrites of the typical imperative code that developers prefer to use for expressing application logic. For these reasons, we believe that the dataflow model should be used as low-level IR for the modeling and execution of distributed applications, but not as a programmer-facing model. Technically, one of the main challenges in adopting a dataflow-based intermediate representation, is that the dataflow model is essentially functional, with immutable values being propagated across operators that typically do not share a global state. Hence, adopting a dataflow-based IR entails translating (arbitrary) imperative code into the functional style. Compiler research has systematically explored only the opposite direction: to compile code in functional programming languages into a representation that is executable on imperative architectures – like virtually all modern microprocessors. Yet, the translation from imperative to functional or dataflow programming remains largely unexplored. To this end, we report on Stateful Entities a prototypical programming model (exemplified in Figure 1), compiler pipeline, and IR that compiles imperative, transactional object-oriented applications into distributed dataflow graphs and executes them on existing dataflow systems. The proposed system presented in this paper can be found at: https://github.com/delftdata/stateflow. Our preliminary experiments showed that the translation of imperative programs into dataflow graphs yields very promising performance results, of less than 50ms latency. @en

Automatic Table Union Search with Tabular Representation Learning

Conference paper (2023) - Xuming Hu (author), Shen Wang (author), Xiao Qin (author), Chuan Lei (author), Zhengyuan Shen (author), Christos Faloutsos (author), A Katsifodimos (author), George Karypis (author), Lijie Wen (author), Philip S. Yu (author), Philip S. Yu (author)

Given a data lake of tabular data as well as a query table, how can we retrieve all the tables in the data lake that can be unioned with the query table? Table union search constitutes an essential task in data discovery and preparation as it enables data scientists to navigate m ...

Macaroni: Crawling and Enriching Metadata from Public Model Zoos

Conference paper (2023) - Z. Li (author), R. Hai (author), A Katsifodimos (author), A. Bozzon (author)

Machine learning (ML) researchers and practitioners are building repositories of pre-trained models, called model zoos. These model zoos contain metadata that detail various properties of the ML models and datasets, which are useful for reporting, auditing, reproducibility, and i ...

Topio Marketplace: Search and Discovery of Geospatial Data

Conference paper (2023) - A. Ionescu (author), Alexandra Alexandridou (author), K. Psarakis (author), Kostas Patroumpas (author), Georgios Chatzigeorgakidis (author), Dimitrios Skoutas (author), Spiros Athanasiou (author), R. Hai (author), A Katsifodimos (author)

The increasing need for data trading has created a high demand for data marketplaces. These marketplaces require a set of valueadded services, such as advanced search and discovery, that have been proposed in the database research community for years, but are yet to be put to pra ...

Join Path-Based Data Augmentation for Decision Trees

Conference paper (2022) - A. Ionescu (author), R. Hai (author), M. Fragkoulis (author), A Katsifodimos (author)

Machine Learning (ML) applications require high-quality datasets. Automated data augmentation techniques can help increase the richness of training data, thus increasing the ML model accuracy. Existing solutions focus on efficiency and ML model accuracy but do not exploit the ric ...