Representation counts: the impact of embedding models on disease detection tasks from microbiome sequencing data

More Info
expand_more

Abstract

The human microbiome, the ensemble of microorganisms found in and on the human body, plays a key role in human health and disease. However, the current state of microbiome analysis represents a significant challenge for machine learning algorithms. Datasets of microbiome sequences are often characterized by a regime of large dimensionality and relatively few labels, making it difficult for a model to discriminate features from random noise and avoid overfitting. It is, therefore, paramount to reduce the dimensionality of the input data while preserving their structure and information for a model to properly learn from them. K-mer frequency vectors and learnable representations through encoders are some of the embedding methods that have been proposed in literature to reduce the dimensions of the input space for machine learning algorithms operating on biological sequences. This work aims to compare how various embedding techniques influence the performance of a downstream disease detection task from microbiome sequencing data. In particular, the research shows that k-mer frequency vectors lead to better classification metrics (AUC = 0.88) compared to NeuroSEED embeddings (AUC = 0.76) on euclidean space. The work also presents how the classification problem formulation is critical to improving the overall disease detection performance.