Exploring Intronic RNA-Seq Read Counts for Machine Learning Phenotype Prediction
More Info
expand_more
Abstract
The inclusion of intronic reads in the downstream analysis of RNA-sequencing (RNA-seq) data has long been controversial. Recent studies show that intronic reads do contain relevant biological signal. Additionally, studies have discovered differential expression unique to intronic reads in certain diseases. Nevertheless, most disease prediction studies only use exonic read counts as input to their models. In this study, we investigate the informativeness of intronic read counts for RNA-seq-based machine learning prediction tasks. Furthermore, we explore possibilities to combine exonic and intronic read counts to increase predictive performance. To this end, we use an RNA-seq dataset originating from four different brain regions and try to predict multiple different clinical labels, including Alzheimer's disease and dementia. We start by identifying differently expressed genes by performing differential gene expression (DGE) analysis. Next, we evaluate the predictive performance of both exonic and intronic read counts using logistic regression. Subsequently, we explore some basic machine learning techniques to combine the information contained in both sets. Furthermore, we construct our own model architectures with the aim of gaining information by using both sets. We show, for this dataset, that exonic and intronic reads have overlapping but also unique differentially expressed genes. Using these genes we show that the predictive performance using the exonic and intronic reads is very similar for all predicted labels. We further show that even though different genes are identified, the biologically relevant signal for the prediction task appears to be the same in exonic and intronic read counts. We are not able to leverage the combination of the counts to further increase predictive performance. Existing disease prediction models have neglected the inclusion of intronic reads. In light of our findings, machine learning models that incorporate intronic reads could potentially discover novel biological insights.