Machine learning models require rich, quality data sets to achieve high accuracy. With current exponential growth of data being generated it is becoming increasingly hard to prepare high-quality tables within reasonable time frame. To combat this issue automated data augmentation
...
Machine learning models require rich, quality data sets to achieve high accuracy. With current exponential growth of data being generated it is becoming increasingly hard to prepare high-quality tables within reasonable time frame. To combat this issue automated data augmentation methods has emerged in recent years. However, existing solution do not focus on specific ML algorithm used for training the data.
In this paper we propose data augmentation framework designed specifically for the random forest classifier. The algorithm uses sample joins to estimate partial correlation between features in the neighbouring tables and the target column, while controlling for all other features.
Moreover, we show that partial correlation is the most optimal characteristic for determining features’ importance for random forest classifier. Apart from it, we demonstrate hat PCADA can improve accuracy and run-time in comparison with other baseline data augmentation approaches.
Finally, we show that the framework can also be used for other decision trees classifiers (CART, XGBoost) and linear classifier (Support Vector Machine).