Data-Driven Empirical Analysis of Correlation-Based Feature Selection Techniques

More Info
expand_more

Abstract

Thus far the democratization of machine learning, which resulted in the field of AutoML, has focused on the automation of model selection and hyperparameter optimization. Nevertheless, the need for high-quality databases to increase performance has sparked interest in correlation-based feature selection, a simple and fast, yet effective approach to removing noise and redundancy in relational data. However, little to no attention has been paid to what correlation metric to choose in order to maximize the performance of ML systems. Our research investigates the effectiveness and efficiency of four widely-known correlation measures, in particular Pearson, Spearman, Cramér's V, Symmetric Uncertainty, in a manner that simulates an AutoML-like setting. We show that the exact theoretical assumptions of the methods do not always hold in practice, as well as shed light on the main aspects that need to be considered when integrating correlation-based feature selection in ML systems. Notably, the results indicate that the performance obtained by correlation-based methods is highly tied to the types and number of features present in the underlying database rather than the choice of ML algorithm. We devise promising conclusions that can further serve the advancement of AutoML systems by making feature selection fully automatic and computationally tractable.