Exploring Intronic RNA-Seq Read Counts for Machine Learning Phenotype Prediction

Master thesis (2023)

Authors

T.J. Zuiker Electrical Engineering, Mathematics and Computer Science

Contributors

Joana Gonçalves Pattern Recognition and Bioinformatics - (mentor)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Machine Learning (ML) Gene Expression Introns Transcriptomics

To reference this document use:

http://resolver.tudelft.nl/uuid:cc70564c-0797-43e4-be8c-a5fb0c0ba0c8

More Info

expand_more

Published Date

01-11-2023

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

The inclusion of intronic reads in the downstream analysis of RNA-sequencing (RNA-seq) data has long been controversial. Recent studies show that intronic reads do contain relevant biological signal. Additionally, studies have discovered differential expression unique to intronic reads in certain diseases. Nevertheless, most disease prediction studies only use exonic read counts as input to their models. In this study, we investigate the informativeness of intronic read counts for RNA-seq-based machine learning prediction tasks. Furthermore, we explore possibilities to combine exonic and intronic read counts to increase predictive performance. To this end, we use an RNA-seq dataset originating from four different brain regions and try to predict multiple different clinical labels, including Alzheimer's disease and dementia. We start by identifying differently expressed genes by performing differential gene expression (DGE) analysis. Next, we evaluate the predictive performance of both exonic and intronic read counts using logistic regression. Subsequently, we explore some basic machine learning techniques to combine the information contained in both sets. Furthermore, we construct our own model architectures with the aim of gaining information by using both sets. We show, for this dataset, that exonic and intronic reads have overlapping but also unique differentially expressed genes. Using these genes we show that the predictive performance using the exonic and intronic reads is very similar for all predicted labels. We further show that even though different genes are identified, the biologically relevant signal for the prediction task appears to be the same in exonic and intronic read counts. We are not able to leverage the combination of the counts to further increase predictive performance. Existing disease prediction models have neglected the inclusion of intronic reads. In light of our findings, machine learning models that incorporate intronic reads could potentially discover novel biological insights.

Files

MSc_Thesis_Thomas_Zuiker_25_10... (pdf)

(pdf | 36.8 Mb)

Unknown license