From Feature Selection to Data Augmentation: the ADA Algorithm

Bachelor thesis (2022)

Authors

E. Cruset Pla Electrical Engineering, Mathematics and Computer Science

Contributors

R. Hai Web Information Systems - (mentor)

A. Ionescu Web Information Systems - (mentor)

D.H.J. Epema Data-Intensive Systems - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

To reference this document use:

http://resolver.tudelft.nl/uuid:ece35d68-e261-4c8b-9ae5-a497715d1059

More Info

expand_more

Published Date

22-06-2022

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

The democratization of data science, and in particular of the machine learning pipeline, has focused on the automation of model selection, feature processing, and hyperparameter tuning. Nevertheless, the need for high-quality data for increased performance has sparked interest in the inclusion of data augmentation in these automatic machine learning techniques. This research approaches this topic by examining different feature selection techniques that will ultimately allow devising what makes a feature desirable. We introduce an automatic data augmentation process, tailored for support vector machines, that employs sample joins. This approach is evaluated through different setups, datasets, and other machine learning models: CART, random forests, and XGBoost. The results are mixed: the algorithm identifies the features containing the signal, resulting in accuracy scores close to the models trained with all the data. However, the computational time is higher. A theoretical analysis suggest that the methodology might be helpful in particular cases where data is structured in specific ways.

Files

CSE3000ResearchProject.pdf

(pdf | 0.268 Mb)