A comparative study for using PCA, LDA, GDA, and Lasso for dimensionality reduction before classification algorithms

Bachelor thesis (2023)

Authors

D. Anceaux Electrical Engineering, Mathematics and Computer Science

Contributors

A Katsifodimos Web Information Systems - (mentor)

A. Ionescu Web Information Systems - (mentor)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Feature selection Feature extraction Dimensionality reduction

To reference this document use:

http://resolver.tudelft.nl/uuid:dc867f53-d950-4580-a49a-120b56b07378

More Info

expand_more

Published Date

25-06-2023

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Since every day more and more data is collected, it becomes more and more expensive to process. To reduce these costs, you can use dimensionality reduction to reduce the number of features per instance in a given dataset.

In this paper, we will compare four possible methods of dimensionality reduction. The feature extraction methods PCA, LDA, and GDA, and the feature selection method Lasso. We will mainly be comparing how the amount of features left over by these methods affects the accuracy of certain classification algorithms, and how long the methods take to achieve their task.

Our research highlights LDA as a highly effective method for significantly reducing the dimensionality of data used in logistic regression and Support Vector Machines (SVMs) with remarkable success. Additionally, we identified Lasso as the preferred choice for situations involving a limited training dataset or when utilizing the random forest algorithm for classification. Notably, Principal Component Analysis (PCA) was observed to occupy a middle ground between LDA’s strengths in aggressive data reduction and Lasso’s accuracy while retaining. GDA (with a linear kernel function) turned out to be significantly slower than the other methods, while its results where most of the time on par with LDA.

Files

BEP_Paper_5_.pdf

(pdf | 0.329 Mb)