A comparative study for using PCA, LDA, GDA, and Lasso for dimensionality reduction before classification algorithms

More Info
expand_more

Abstract

Since every day more and more data is collected, it becomes more and more expensive to process. To reduce these costs, you can use dimensionality reduction to reduce the number of features per instance in a given dataset.

In this paper, we will compare four possible methods of dimensionality reduction. The feature extraction methods PCA, LDA, and GDA, and the feature selection method Lasso. We will mainly be comparing how the amount of features left over by these methods affects the accuracy of certain classification algorithms, and how long the methods take to achieve their task.

Our research highlights LDA as a highly effective method for significantly reducing the dimensionality of data used in logistic regression and Support Vector Machines (SVMs) with remarkable success. Additionally, we identified Lasso as the preferred choice for situations involving a limited training dataset or when utilizing the random forest algorithm for classification. Notably, Principal Component Analysis (PCA) was observed to occupy a middle ground between LDA’s strengths in aggressive data reduction and Lasso’s accuracy while retaining. GDA (with a linear kernel function) turned out to be significantly slower than the other methods, while its results where most of the time on par with LDA.

Files