AI

A. Ionescu

20 records found

Authored

This paper introduces the 4 tutorials that were organized at the International Conference on Distributed and Event-based Systems (DEBS 2024).@en

Multiple works in data management research focus on automating the processes of data augmentation and feature discovery to save users from having to perform these tasks manually. Yet, this automation often leads to a disconnect with the users, as it fails to consider the speci ...

The increasing need for data trading has created a high demand for data marketplaces. These marketplaces require a set of valueadded services, such as advanced search and discovery, that have been proposed in the database research community for years, but are yet to be put to pra ...
The increasing need for data trading across businesses nowadays has created a demand for data marketplaces. However, despite the intentions of both data providers and consumers, today’s data marketplaces remain mere data catalogs. We believe that marketplaces of the future requir ...

Amalur

Data Integration Meets Machine Learning

Machine learning (ML) training data is often scattered across disparate collections of datasets, called data silos. This fragmentation poses a major challenge for data-intensive ML applications: integrating and transforming data residing in different sources demand a lot of ma ...

Machine Learning (ML) applications require high-quality datasets. Automated data augmentation techniques can help increase the richness of training data, thus increasing the ML model accuracy. Existing solutions focus on efficiency and ML model accuracy but do not exploit the ric ...

Amalur

Next-generation Data Integration in Data Lakes

Data science workflows often require extracting, preparing and integrating data from multiple data sources. This is a cumbersome and slow process: most of the times, data scientists prepare data in a data processing system or a data lake, and export it as a table, in order for ...

Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema ...

Valentine in Action

Matching Tabular Data at Scale

Capturing relationships among heterogeneous datasets in large data lakes - traditionally termed schema matching - is one of the most challenging problems that corporations and institutions face nowadays. Discovering and integrating datasets heavily relies on the effectiveness of ...
As data is produced at an unprecedented rate, the need and ex- pectation to make it easily available for the end-users is growing. Dataset Discovery has become an important subject in the data management community, as it represents the means of providing the data to the user and ...

Contributed

The advancement of artificial intelligence (AI) has led to an increased demand for both a greater volume and quality of data. In many companies, data is dispersed across multiple tables, yet AI models typically require data in a single table format. This necessitates the merging ...
Since every day more and more data is collected, it becomes more and more expensive to process. To reduce these costs, you can use dimensionality reduction to reduce the number of features per instance in a given dataset.

In this paper, we will compare four possible met ...
Thus far the democratization of machine learning, which resulted in the field of AutoML, has focused on the automation of model selection and hyperparameter optimization. Nevertheless, the need for high-quality databases to increase performance has sparked interest in correlation ...

Automatic feature discovery

A comparative study between filter and wrapper feature selection techniques

The curse of dimensionality is a common challenge in machine learning, and feature selection techniques are commonly employed to address this issue by selecting a subset of relevant features. However, there is no consistently superior approach for choosing the most significant su ...

Encoding methods for categorical data

A comparative analysis for linear models, decision trees, and support vector machines

This paper presents a comprehensive evaluation and comparison of encoding methods for categorical data in the context of machine learning. The study focuses on five popular encoding techniques: one-hot, ordinal, target, catboost, and count encoders. These methods are evaluated us ...
The data used in machine learning algorithms strongly influences the algorithms' capabilities. Feature selection techniques can choose a set of columns that meet a certain learning goal. There is a wide variety of feature selection methods, however, the ones we cover in this comp ...
Current speed of data growth has exponentially increased over the past decade, highlighting the need of modern organizations for data discovery systems. Several (automated) schema matching approaches have been proposed to find related data, exploiting different parts of schema in ...
Automatic machine learning is a subfield of machine learning that automates the common procedures faced in predictive tasks. The problem of one such procedure is automatic data augmentation, where one desires to enrich the existing data to increase model performance. In relationa ...
The democratization of data science, and in particular of the machine learning pipeline, has focused on the automation of model selection, feature processing, and hyperparameter tuning. Nevertheless, the need for high-quality data for increased performance has sparked interest in ...
Machine learning models require rich, quality data sets to achieve high accuracy. With current exponential growth of data being generated it is becoming increasingly hard to prepare high-quality tables within reasonable time frame. To combat this issue automated data augmentation ...