A. Ionescu | TU Delft Repository

Feature Discovery for Data-Centric AI

Doctoral thesis (2025) - Andra Ionescu (author) , GJPM Houben (promotor) , A Katsifodimos (copromotor) , Rihan Hai (copromotor)

We are witnessing a paradigm shift in machine learning (ML) and artificial intelligence (AI) from a focus primarily on innovating ML models, the model-centric paradigm, to prioritising high-quality, reliable data for AI/ML applications, the data-centric paradigm. This emphasis on ...

We are witnessing a paradigm shift in machine learning (ML) and artificial intelligence (AI) from a focus primarily on innovating ML models, the model-centric paradigm, to prioritising high-quality, reliable data for AI/ML applications, the data-centric paradigm. This emphasis on data has led to the development of an economy around data, creating data marketplace platforms where data is traded as a commodity. However, trading data involves constraints that reflect the specific needs of users, such as enriching or augmenting their datasets or creating datasets with particular properties. These constraints pose challenges the data management community has already addressed independently of the marketplace platform context. As such, in this thesis, as a first act of research, we integrate approaches and practices from the data management community into the context of an open-source data marketplace platform, following a survey of industry professionals who produce, trade, and purchase data assets.

Aligned with the objectives of the data-centric AI paradigm to create high-quality training datasets, our research is focused on developing automated methods to identify relevant and related features (e.g., columns) that can be augmented to a given dataset. This effort has led to the research and design of feature discovery, which sits at the intersection of dataset discovery by discovering related datasets, data integration by joining datasets, and feature selection by selecting high-predictive features for ML models. We have developed an automated approach for feature discovery that improves upon existing automated data augmentation techniques, improving the effectiveness and efficiency of finding the most relevant features.

However, with the adoption of automatic approaches, we discovered that in moving towards data-centric AI, we risk detaching not only from model-centric but also from user-centric AI. To assess the extent to which users (e.g., data scientists, data engineers, ML engineers) rely on and trust automatic approaches and to determine their feature discovery pipeline, we conducted 19 interviews based on a use-case study. The results revealed that users doubt the automated methods and want to be involved in the process instead. Consequently, we decided to incorporate the users into the feature discovery process and to explore whether their involvement (e.g., by adding domain and business knowledge) improves the quality of the resulting dataset and the feature discovery process.

Thus, we created a human-in-the-loop approach for feature discovery, which was evaluated by conducting interviews with a subset of our initial candidate pool. The results confirmed that a human-in-the-loop method is more approachable for users as it provides control over and insights into the process, as well as the opportunity to inject their knowledge, ensuring that the resulting dataset is relevant for their data tasks.

With this thesis, we make scientific contributions to the field of data management by offering novel insights into users' workflows and designing and developing resources that enhance feature discovery. We hope our contributions will serve as a valuable resource for future work in user-centric and data-centric feature discovery.@en

Key Insights from a Feature Discovery User Study

Conference paper (2024) - A. Ionescu (author) , Zeger Mouw (author) , E. Aivaloglou (author) , Asterios Katsifodimos (author)

Multiple works in data management research focus on automating the processes of data augmentation and feature discovery to save users from having to perform these tasks manually. Yet, this automation often leads to a disconnect with the users, as it fails to consider the specific ...

Human-in-the-Loop Feature Discovery for Tabular Data

Conference paper (2024) - A. Ionescu (author) , Zeger Mouw (author) , E. Aivaloglou (author) , Rihan Hai (author) , Asterios Katsifodimos (author)

In recent years, researchers have developed several methods to automate discovering datasets and augmenting features for training Machine Learning (ML) models. Together with feature selection, these efforts have paved the way towards what is termed the feature discovery process. ...

Tutorials at the International Conference on Distributed and Event-based Systems (DEBS 2024)

Conference paper (2024) - Pieter Bonte (author) , A. Ionescu (author)

This paper introduces the 4 tutorials that were organized at the International Conference on Distributed and Event-based Systems (DEBS 2024).@en

Amalur

Data Integration Meets Machine Learning

Conference paper (2023) - R. Hai (author) , Christos Koutras (author) , A. Ionescu (author) , Ziyu Li (author) , Wenbo Sun (author) , Jessie van Schijndel (author) , Yan Kang (author) , A. Katsifodimos (author)

Machine learning (ML) training data is often scattered across disparate collections of datasets, called data silos. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manua ...

Topio: An Open-Source Web Platform for Trading Geospatial Data

Conference paper (2023) - Andra Ionescu (author) , Kostas Patroumpas (author) , K. Psarakis (author) , Georgios Chatzigeorgakidis (author) , Diego Collarana (author) , Kai Barenscher (author) , Dimitrios Skoutas (author) , A Katsifodimos (author) , Spiros Athanasiou (author)

The increasing need for data trading across businesses nowadays has created a demand for data marketplaces. However, despite the intentions of both data providers and consumers, today’s data marketplaces remain mere data catalogs. We believe that marketplaces of the future requir ...

Topio Marketplace: Search and Discovery of Geospatial Data

Conference paper (2023) - A. Ionescu (author) , Alexandra Alexandridou (author) , K. Psarakis (author) , Kostas Patroumpas (author) , Georgios Chatzigeorgakidis (author) , Dimitrios Skoutas (author) , Spiros Athanasiou (author) , R. Hai (author) , Asterios Katsifodimos (author)

The increasing need for data trading has created a high demand for data marketplaces. These marketplaces require a set of valueadded services, such as advanced search and discovery, that have been proposed in the database research community for years, but are yet to be put to pra ...

Join Path-Based Data Augmentation for Decision Trees

Conference paper (2022) - A. Ionescu (author) , R. Hai (author) , Marios Fragkoulis (author) , A Katsifodimos (author)

Machine Learning (ML) applications require high-quality datasets. Automated data augmentation techniques can help increase the richness of training data, thus increasing the ML model accuracy. Existing solutions focus on efficiency and ML model accuracy but do not exploit the ric ...

Amalur

Next-generation Data Integration in Data Lakes

Abstract (2022) - Rihan Hai (author) , Christos Koutras (author) , Andra Ionescu (author) , A. Katsifodimos (author)

Data science workflows often require extracting, preparing and integrating data from multiple data sources. This is a cumbersome and slow process: most of the times, data scientists prepare data in a data processing system or a data lake, and export it as a table, in order for it ...

Interactive Data Discovery in Data Lakes

Conference paper (2021) - A. Ionescu (author) , A Katsifodimos (author) , GJPM Houben (author)

As data is produced at an unprecedented rate, the need and ex- pectation to make it easily available for the end-users is growing. Dataset Discovery has become an important subject in the data management community, as it represents the means of providing the data to the user and ...

Valentine in Action

Matching Tabular Data at Scale

Journal article (2021) - Christos Koutras (author) , K. Psarakis (author) , G. Siachamis (author) , Andra Ionescu (author) , Marios Fragkoulis (author) , Angela Bonifati (author) , A Katsifodimos (author)

Capturing relationships among heterogeneous datasets in large data lakes - traditionally termed schema matching - is one of the most challenging problems that corporations and institutions face nowadays. Discovering and integrating datasets heavily relies on the effectiveness of ...

Valentine: Evaluating Matching Techniques for Dataset Discovery

Conference paper (2021) - Christos Koutras (author) , G. Siachamis (author) , A. Ionescu (author) , K. Psarakis (author) , H.A.J. Brons (author) , Marios Fragkoulis (author) , Christoph Lofi (author) , Angela Bonifati (author) , Asterios Katsifodimos (author)

Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema ...