R. Hai | TU Delft Repository

LLM-PQA

LLM-enhanced Prediction Query Answering

Conference paper (2024) - Z. Li (author), Ziyu Li (author), Wenjie Zhao (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author), A Katsifodimos (author), A. Katsifodimos (author), R. Hai (author), Rihan Hai (author)

The advent of Large Language Models (LLMs) provides an opportunity to change the way queries are processed, moving beyond the constraints of conventional SQL-based database systems. However, using an LLM to answer a prediction query is still challenging, since an external ML mode ...

SiloFuse

Cross-silo Synthetic Data Generation with Latent Tabular Diffusion Models

Conference paper (2024) - A. Shankar (author), Aditya Shankar (author), J.C. Brouwer (author), Hans Brouwer (author), R. Hai (author), Rihan Hai (author), Lydia Y. Chen (author), Lydia Y. Chen (author), Lydia Chen (author), Lydia Chen (author), Y. Chen (author), Y. Chen (author)

Synthetic tabular data is crucial for sharing and augmenting data across silos, especially for enterprises with proprietary data. However, existing synthesizers are designed for centrally stored data. Hence, they struggle with real-world scenarios where features are distributed a ...

Amalur

The Convergence of Data Integration and Machine Learning

Journal article (2024) - Ziyu Li (author), Z. Li (author), W. Sun (author), Wenbo Sun (author), Danning Zhan (author), D. Zhan (author), Yan Kang (author), Lydia Y. Chen (author), Lydia Y. Chen (author), Y. Chen (author), Y. Chen (author), Lydia Chen (author), Lydia Chen (author), A Bozzon (author), Alessandro Bozzon (author), A. Bozzon (author), Alessandro Bozzon (author), Rihan Hai (author), R. Hai (author)

Machine learning (ML) training data is often scattered across disparate collections of datasets, called <italic>data silos</italic>. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different ...

Model Selection with Model Zoo via Graph Learning

Conference paper (2024) - Z. Li (author), Ziyu Li (author), H.J. Van Der Wilk (author), Hilco Van Der Wilk (author), H.J. van der Wilk (author), Hilco van der Wilk (author), D. Zhan (author), Danning Zhan (author), M. Khosla (author), Megha Khosla (author), A. Bozzon (author), Alessandro Bozzon (author), A Bozzon (author), Alessandro Bozzon (author), R. Hai (author), Rihan Hai (author)

Pre-trained deep learning (DL) models are increasingly accessible in public repositories, i.e., model zoos. Given a new prediction task, finding the best model to fine-tune can be computationally intensive and costly, especially when the number of pre-trained models is large. Sel ...

Human-in-the-Loop Feature Discovery for Tabular Data

Conference paper (2024) - A. Ionescu (author), Andra Ionescu (author), Zeger Mouw (author), E. Aivaloglou (author), Efthimia Aivaloglou (author), E.A. Aivaloglou (author), Efthimia Aivaloglou (author), Rihan Hai (author), R. Hai (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author), A Katsifodimos (author), A. Katsifodimos (author)

In recent years, researchers have developed several methods to automate discovering datasets and augmenting features for training Machine Learning (ML) models. Together with feature selection, these efforts have paved the way towards what is termed the feature discovery process. ...

Quantum Data Management

From Theory to Opportunities

Conference paper (2024) - Rihan Hai (author), R. Hai (author), Shih Han Hung (author), Shih Han Hung (author), Shih Han Hung (author), Shih Te Hung (author), Shih Te Hung (author), Shih Te Hung (author), S. Hung (author), S. Hung (author), S. Hung (author), Shih-Te Hung (author), Shih-Te Hung (author), Shih-Te Hung (author), S. Feld (author), S. Feld (author), Sebastian Feld (author), Sebastian Feld (author)

Quantum computing has emerged as a transformative tool for future data management. Classical problems in database domains, including query optimization, data integration, and transaction management, have recently been addressed using quantum computing techniques. This tutorial ai ...

Data Lakes

A Survey of Functions and Systems

Journal article (2023) - R. Hai (author), Rihan Hai (author), C. Koutras (author), Christos Koutras (author), Christoph Quix (author), Christoph Quix (author), Matthias Jarke (author)

Data lakes are becoming increasingly prevalent for Big Data management and data analytics. In contrast to traditional 'schema-on-write' approaches such as data warehouses, data lakes are repositories storing raw data in its original formats and providing a common access interface ...

Amalur

Data Integration Meets Machine Learning

Conference paper (2023) - R. Hai (author), Rihan Hai (author), Christos Koutras (author), C. Koutras (author), A. Ionescu (author), Andra Ionescu (author), Ziyu Li (author), Z. Li (author), Wenbo Sun (author), W. Sun (author), Jessie van Schijndel (author), Yan Kang (author), A. Katsifodimos (author), Asterios Katsifodimos (author), A Katsifodimos (author), Asterios Katsifodimos (author)

Machine learning (ML) training data is often scattered across disparate collections of datasets, called data silos. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of manua ...

Topio Marketplace: Search and Discovery of Geospatial Data

Conference paper (2023) - Andra Ionescu (author), A. Ionescu (author), Alexandra Alexandridou (author), Kyriakos Psarakis (author), Kyriakos Psarakis (author), K. Psarakis (author), Kostas Patroumpas (author), Georgios Chatzigeorgakidis (author), Dimitrios Skoutas (author), Spiros Athanasiou (author), Rihan Hai (author), R. Hai (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author), A Katsifodimos (author), A. Katsifodimos (author)

The increasing need for data trading has created a high demand for data marketplaces. These marketplaces require a set of valueadded services, such as advanced search and discovery, that have been proposed in the database research community for years, but are yet to be put to pra ...

Optimizing Machine Learning Inference Queries for Multiple Objectives

Conference paper (2023) - Z. Li (author), Ziyu Li (author), Mariette Schonfeld (author), R. Hai (author), Rihan Hai (author), A Bozzon (author), Alessandro Bozzon (author), A. Bozzon (author), Alessandro Bozzon (author), A Katsifodimos (author), Asterios Katsifodimos (author), A. Katsifodimos (author), Asterios Katsifodimos (author)

Given a set of pre-trained Machine Learning (ML) models, can we solve complex analytic tasks that make use of those models by formulating ML inference queries? Can we mitigate different tradeoffs, e.g., high accuracy, low execution costs and memory footprint, when optimizing the ...

Optimizing ML Inference Queries Under Constraints

Conference paper (2023) - Ziyu Li (author), Z. Li (author), W. Sun (author), Wenbo Sun (author), Rihan Hai (author), R. Hai (author), Alessandro Bozzon (author), A Bozzon (author), A. Bozzon (author), Alessandro Bozzon (author), A. Katsifodimos (author), Asterios Katsifodimos (author), A Katsifodimos (author), Asterios Katsifodimos (author)

The proliferation of pre-trained ML models in public Web-based model zoos facilitates the engineering of ML pipelines to address complex inference queries over datasets and streams of unstructured content. Constructing optimal plan for a query is hard, especially when constraints ...

Metadata Representations for Queryable Repositories of Machine Learning Models

Journal article (2023) - Z. Li (author), Ziyu Li (author), Henk Kant (author), R. Hai (author), Rihan Hai (author), A Katsifodimos (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author), A. Katsifodimos (author), Marco Brambilla (author), Alessandro Bozzon (author), Alessandro Bozzon (author), A. Bozzon (author), A Bozzon (author)

Machine learning (ML) practitioners and organizations are building model repositories of pre-trained models, referred to as model zoos. These model zoos contain metadata describing the properties of the ML models and datasets. The metadata serves crucial roles for reporting, audi ...

An Empirical Performance Comparison between Matrix Multiplication Join and Hash Join on GPUs

Conference paper (2023) - Wenbo Sun (author), W. Sun (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author), A. Katsifodimos (author), A Katsifodimos (author), R. Hai (author), Rihan Hai (author)

Recent advances in Graphic Processing Units (GPUs) have facilitated a significant performance boost for database operators, in particular, joins. It has been intensively studied how conventional join implementations, such as hash joins, benefit from the massive parallelism of GPU ...

Macaroni: Crawling and Enriching Metadata from Public Model Zoos

Conference paper (2023) - Ziyu Li (author), Z. Li (author), Rihan Hai (author), R. Hai (author), A. Katsifodimos (author), A Katsifodimos (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author), A Bozzon (author), A. Bozzon (author), Alessandro Bozzon (author), Alessandro Bozzon (author)

Machine learning (ML) researchers and practitioners are building repositories of pre-trained models, called model zoos. These model zoos contain metadata that detail various properties of the ML models and datasets, which are useful for reporting, auditing, reproducibility, and i ...

Accelerating Machine Learning Queries with Linear Algebra Query Processing

Conference paper (2023) - Wenbo Sun (author), W. Sun (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author), A. Katsifodimos (author), A Katsifodimos (author), R. Hai (author), Rihan Hai (author)

The rapid growth of large-scale machine learning (ML) models has led numerous commercial companies to utilize ML models for generating predictive results to help business decision-making. As two primary components in traditional predictive pipelines, data processing, and model pr ...

Join Path-Based Data Augmentation for Decision Trees

Conference paper (2022) - A. Ionescu (author), Andra Ionescu (author), R. Hai (author), Rihan Hai (author), Marios Fragkoulis (author), M. Fragkoulis (author), Marios Fragkoulis (author), A Katsifodimos (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author), A. Katsifodimos (author)

Machine Learning (ML) applications require high-quality datasets. Automated data augmentation techniques can help increase the richness of training data, thus increasing the ML model accuracy. Existing solutions focus on efficiency and ML model accuracy but do not exploit the ric ...

Amalur

Next-generation Data Integration in Data Lakes

Abstract (2022) - Rihan Hai (author), R. Hai (author), Christos Koutras (author), C. Koutras (author), Andra Ionescu (author), A. Ionescu (author), A. Katsifodimos (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author), A Katsifodimos (author)

Data science workflows often require extracting, preparing and integrating data from multiple data sources. This is a cumbersome and slow process: most of the times, data scientists prepare data in a data processing system or a data lake, and export it as a table, in order for it ...

Metadata Representations for Queryable ML Model Zoos

Conference paper (2022) - Ziyu Li (author), Z. Li (author), Rihan Hai (author), R. Hai (author), A Bozzon (author), A. Bozzon (author), Alessandro Bozzon (author), Alessandro Bozzon (author), A. Katsifodimos (author), A Katsifodimos (author), Asterios Katsifodimos (author), Asterios Katsifodimos (author)

Machine learning (ML) practitioners and organizations are building model zoos of pre-trained models, containing metadata describing properties of the ML models and datasets that are useful for reporting, auditing, reproducibility, and interpretability purposes. The metatada is cu ...

Dynamic Digital Twin

Diagnosis, Treatment, Prediction, and Prevention of Disease During the Life Course

A digital twin (DT), originally defined as a virtual representation of a physical asset, system, or process, is a new concept in health care. A DT in health care is not a single technology but a domain-adapted multimodal modeling approach incorporating the acquisition, management ...

Architecture of Data Lakes

Book chapter (2020) - Houssem Chihoub (author), Cédrine Madera (author), Christoph Quix (author), Christoph Quix (author), R. Hai (author), Rihan Hai (author)

This chapter introduces the most important features of data lake systems, and from there it outlines an architecture for these systems. The vision for a data lake system is based on a generic and extensible architecture with a unified data model, facilitating the ingestion, stora ...