C. Koutras | TU Delft Repository

Tabular Schema Matching for Modern Settings

Doctoral thesis (2024) - C. Koutras (author) , Geert Jan Houben (promotor) , Asterios Katsifodimos (copromotor) , Christoph Lofi (copromotor)

Schema matching is a critical data integration process, which aims at capturing relevance between elements of different datasets; when datasets are tabular, it translates to the process of discovering related columns among them. Accurately discovering column matches is integral f ...

Schema matching is a critical data integration process, which aims at capturing relevance between elements of different datasets; when datasets are tabular, it translates to the process of discovering related columns among them. Accurately discovering column matches is integral for several applications, such as entity resolution, data cleaning and data augmentation. While there exists a multitude of schema matching methods in the literature, we identify three major issues: i) there is no comprehensive study of comparing them in terms of effectiveness and efficiency, due to not available implementations and lack of evaluation datasets, ii) existing methods might be impractical and even inapplicable in certain modern settings, and iii) the heterogeneity and complexity of data can impede capturing relevance among columns for existing methods, as certain assumptions might not be holding for the entirety of underlying datasets. In this thesis, we tackle these issues by reviewing existing schema matching techniques and proposing novel methods capable to address challenges imposed by modern settings.
Starting with Chapter 2, we present an extensive comparison study on existing schema matching methods, by introducing Valentine. Specifically, Valentine constitutes an open-source experimental suite, which encompasses several state-of-the-art schema matching solutions. To guide the evaluation process towards modern applications, we extract four relatedness scenarios from the dataset discovery literature. To tackle the lack of existing datasets with ground truth, we devise a principled fabrication process. Our findings lead to insights that can help to improve future research on the field of schema matching, while they affect the design choices we make for novel methods we present in the following chapters.
Next, in Chapter 3, we turn our focus on applying schema matching among datasets stored in different data silos, which cannot be collocated and each contains information about column matches. Towards this direction, we introduce SiMa, a matching method that leverages existing matches in each silo, to build a column match prediction model, powered by the employment of a Graph Neural Network (GNN). To do so, SiMa transforms columns and matches among them in each silo to a graph, while it performs targeted negative edge sampling and incremental training to enhance the learning process. In our experimental evaluation, we show the benefits of using SiMa over state-of-the-art techniques, both in terms of effectiveness and efficiency.
Finally, Chapter 4 discusses the problem of discovering join relationships among datasets in a repository. To ameliorate the shortcomings of previous methods, we propose OmniMatch, a self-supervised method that can effectively capture both equi- and fuzzy-joins among tabular data. At the core of the method is the exploitation of a comprehensive set of similarity signals among columns, which are then transformed into a similarity graph. This graph, in conjunction with automatically generated positive and negative column match examples, enable the employment of a Relational Graph Convolution Network (RGCN) towards training a generalizable join prediction model. We compare the effectiveness of OmniMatch with several other state-of-the-art matching and column representation methods, while we verify the usefulness of utilizing a wide-spectrum of similarity signals to capture joins.
We conclude the thesis by reviewing our main findings, reflecting on our contributions and discussing potential limitations of the methods and approaches presented. Moreover, based on the insights we gain from surveying and developing novel matching methods, we discuss challenges and future directions in the field.
@en

Data Lakes

A Survey of Functions and Systems

Journal article (2023) - R. Hai (author) , C. Koutras (author) , Christoph Quix (author) , Matthias Jarke (author)

Data lakes are becoming increasingly prevalent for Big Data management and data analytics. In contrast to traditional 'schema-on-write' approaches such as data warehouses, data lakes are repositories storing raw data in its original formats and providing a common access interface ...

Amalur

Data Integration Meets Machine Learning

Conference paper (2023) - R. Hai (author) , Christos Koutras (author) , A. Ionescu (author) , Ziyu Li (author) , Wenbo Sun (author) , Jessie van Schijndel (author) , Yan Kang (author) , A. Katsifodimos (author)

Machine learning (ML) training data is often scattered across disparate collections of datasets, called data silos. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manua ...

Amalur

Next-generation Data Integration in Data Lakes

Abstract (2022) - Rihan Hai (author) , Christos Koutras (author) , Andra Ionescu (author) , A. Katsifodimos (author)

Data science workflows often require extracting, preparing and integrating data from multiple data sources. This is a cumbersome and slow process: most of the times, data scientists prepare data in a data processing system or a data lake, and export it as a table, in order for it ...

Valentine in Action

Matching Tabular Data at Scale

Journal article (2021) - Christos Koutras (author) , K. Psarakis (author) , G. Siachamis (author) , Andra Ionescu (author) , Marios Fragkoulis (author) , Angela Bonifati (author) , A Katsifodimos (author)

Capturing relationships among heterogeneous datasets in large data lakes - traditionally termed schema matching - is one of the most challenging problems that corporations and institutions face nowadays. Discovering and integrating datasets heavily relies on the effectiveness of ...

Valentine: Evaluating Matching Techniques for Dataset Discovery

Conference paper (2021) - Christos Koutras (author) , G. Siachamis (author) , Andra Ionescu (author) , K. Psarakis (author) , Jerry Brons (author) , M. Fragkoulis (author) , C. Lofi (author) , Angela Bonifati (author) , A Katsifodimos (author)

Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema ...

REMA

Graph embeddings-based relational schema matching

Abstract (2020) - Christos Koutras (author) , Marios Fragkoulis (author) , Asterios Katsifodimos (author) , C. Lofi (author)

Schema matching is the process of capturing correspondence between attributes of different datasets and it is one of the most important prerequisite steps for analyzing heterogeneous data collections. State-of-the-art schema matching algorithms that use simple schema- or instance ...

Data as a language

A novel approach to data integration

Abstract (2019) - C. Koutras (author)

In modern enterprises, both operational and organizational data is typically spread across multiple heterogeneous systems, databases and file systems. Recognizing the value of their data assets, companies and institutions construct data lakes, storing disparate datasets from dier ...