From Feature Selection to Data Augmentation: the ADA Algorithm

More Info
expand_more

Abstract

The democratization of data science, and in particular of the machine learning pipeline, has focused on the automation of model selection, feature processing, and hyperparameter tuning. Nevertheless, the need for high-quality data for increased performance has sparked interest in the inclusion of data augmentation in these automatic machine learning techniques. This research approaches this topic by examining different feature selection techniques that will ultimately allow devising what makes a feature desirable. We introduce an automatic data augmentation process, tailored for support vector machines, that employs sample joins. This approach is evaluated through different setups, datasets, and other machine learning models: CART, random forests, and XGBoost. The results are mixed: the algorithm identifies the features containing the signal, resulting in accuracy scores close to the models trained with all the data. However, the computational time is higher. A theoretical analysis suggest that the methodology might be helpful in particular cases where data is structured in specific ways.