Circular Image

A. Katsifodimos

54 records found

We are witnessing a paradigm shift in machine learning (ML) and artificial intelligence (AI) from a focus primarily on innovating ML models, the model-centric paradigm, to prioritising high-quality, reliable data for AI/ML applications, the data-centric paradigm. This emphasis on ...
In the digital era, XML data is fundamental for various applications, requiring robust methods to ensure data integrity and security. Traditional digital watermarking techniques face challenges due to XML's hierarchical structure. Zero-watermarking, which derives a watermark from ...
In the digital age, the proliferation of personal data within databases has made them prime targets for cyberattacks. As the volume of data increases, so does the frequency and sophistication of these attacks. This thesis investigates database security threats by deploying open s ...
Security researchers and industry firms employ Internet-wide scanning for information collection, vulnerability detection and security evaluation, while cybercriminals make use of it to find and attack unsecured devices. Internet scanning plays a considerable role in threat ...
The advancement of artificial intelligence (AI) has led to an increased demand for both a greater volume and quality of data. In many companies, data is dispersed across multiple tables, yet AI models typically require data in a single table format. This necessitates the merging ...
This thesis embarks on the quest to efficiently compute similarities between data streams in real-time, a task burgeoning in importance with the advent of big data and real-time analytics. At the heart of this endeavor is the expansion of the Condor framework to accommodate new p ...
Schema matching is a critical data integration process, which aims at capturing relevance between elements of different datasets; when datasets are tabular, it translates to the process of discovering related columns among them. Accurately discovering column matches is integral f ...
Over the last two decades, the machine learning (ML) field has witnessed a dramatic expansion, propelled by burgeoning data volumes and the advancement of computational technologies. Deep learning (DL) in particular has demonstrated remarkable success across a wide range of domai ...
Data processing has heavily evolved in the last two decades, from single-node processing to distributed processing and from the MapReduce paradigm to the stream processing paradigm. At the same time, cloud computing has emerged as the primary means of deploying and operating a da ...
Similarity joins are operations which involve identifying similar pairs of records within one or multiple datasets. These operations are typically time-sensitive, as timely identification of relations can lead to increased profitability. Therefore, it is advantageous to analyze t ...
General-purpose GPUs, renowned for their exceptional parallel processing capabilities and throughput, hold great promise for enhancing the efficiency of data analytics tasks. At the same time, recent developments in query execution engines have integrated the support of OLAP oper ...
The use of data streams has increased a lot over the last two decades or so. and
With this increase comes the need for fast and consistent fault recovery. Rollback
recovery mechanisms from traditional distributed systems have been adapted successfully for stream engines. ...
Serverless computing has allowed developers to write pieces of code comprising solely of the necessary functionality whilst not having to think about the underlying infrastructure. One prominent model is Function-as-a-Service (FaaS), where the code is structured into functions th ...
Today's need for highly available systems leads to data partitioning and replication across multiple nodes. Providing strong transactional consistency in a distributed database requires extensive communication. For this, algorithms such as two phase commit are used. These communi ...
The adoption of the serverless architecture and the Function-as-a-Service model has significantly increased in recent years, with more enterprises migrating their software and hardware to the cloud. However, most applications require state management, leading to the use of extern ...
The data used in machine learning algorithms strongly influences the algorithms' capabilities. Feature selection techniques can choose a set of columns that meet a certain learning goal. There is a wide variety of feature selection methods, however, the ones we cover in this comp ...

Encoding methods for categorical data

A comparative analysis for linear models, decision trees, and support vector machines

This paper presents a comprehensive evaluation and comparison of encoding methods for categorical data in the context of machine learning. The study focuses on five popular encoding techniques: one-hot, ordinal, target, catboost, and count encoders. These methods are evaluated us ...

Automatic feature discovery

A comparative study between filter and wrapper feature selection techniques

The curse of dimensionality is a common challenge in machine learning, and feature selection techniques are commonly employed to address this issue by selecting a subset of relevant features. However, there is no consistently superior approach for choosing the most significant su ...
Thus far the democratization of machine learning, which resulted in the field of AutoML, has focused on the automation of model selection and hyperparameter optimization. Nevertheless, the need for high-quality databases to increase performance has sparked interest in correlation ...
Since every day more and more data is collected, it becomes more and more expensive to process. To reduce these costs, you can use dimensionality reduction to reduce the number of features per instance in a given dataset.

In this paper, we will compare four possible met ...