Interpretable Machine Learning for Biomarker Discovery in Imaging Mass Spectrometry Data
More Info
expand_more
Abstract
Imaging mass spectrometry (IMS) is a multiplexed chemical imaging technique that enables the spatially targeted molecular mapping of biological samples at cellular resolutions. Within a single experiment, IMS can measure the spatial distribution and relative concentration of thousands of distinct molecular species across the surface of a tissue sample. The large size and high-dimensionality of IMS datasets, which can consist of hundreds of thousands of pixels and hundreds to thousands of molecular ions tracked per pixel, have made computational approaches necessary for effective analysis. This thesis focuses primarily on biomarker discovery in IMS data using supervised machine learning algorithms. Biomarker discovery is the identification of molecular markers that enable the recognition of a specific biological state, for example recognizing diseased tissue from healthy tissue. Biomarkers are increasingly used in biology and medicine for diagnostic and prognostic purposes, as well as for driving the development of new drugs and therapies. Traditionally, the focus has been on maximizing the predictive performance of supervised machine learning models, without necessarily examining the models' internal decision-making processes. Yet, in order to generate insight into the underlying chemical mechanism of disease or drug action, we must go beyond the scope of just prediction and learn how these empirically trained models make their decisions and who are the primary chemical drivers of this prediction process. Machine learning model interpretability is the ability to explain a model's predictions, and can practically be translated into the ability to explicitly report the relative predictive importance of each of the dataset's features. When analyzing IMS data, interpretability is crucial for understanding how the spatial distribution and relative concentration of certain molecular features relate to the labeling of pixels into different physiological classes. The key to our data-driven approach to biomarker discovery in IMS data is to establish (in relation to a specific biomedical recognition task) a means of ranking the molecular features of supervised machine learning models according to their respective predictive importance scores. Ensuring model interpretability and feature ranking in supervised machine learning allows empirical model building to be used as a filtering mechanism to rapidly determine, among thousands of features, those features that exert a large amount of relevance to a specific class determination. With regards to biology, the top-ranking features can help empirically highlight important molecular drivers in the biological process under examination, and can help generate new hypotheses. In terms of translational medicine, such top-ranking features can yield a shortlist of candidate biomarkers worthy of further clinical investigation. Three different classifiers, namely logistic regression, random forests, and support vector machines, are implemented and their performance is compared in terms of accuracy, precision, recall, scale invariance, sensitivity to noise, and computational efficiency. Subsequently, several approaches to explaining these classifiers' predictions are implemented and investigated: model-specific interpretability methods are tied to intrinsically interpretable classifiers, such as generalized linear models and decision trees, whereas model-agnostic interpretability methods can also explain the predictions of black-box models, such as support vector machines with nonlinear kernels or deep neural networks. In addition to three model-specific methods, we present two post-hoc model-agnostic interpretability methods: permutation importance and Shapley importance. Our implementation of Shapley importance, based on Shapley values from cooperative game theory, is novel. Having observed a variability between the rankings of different interpretability methods, we investigate improving the inter-method reliability of feature rankings by decorrelating the features prior to training the classifiers. We also propose a robust ensemble approach to interpretability that aggregates the importance scores attributed to each feature by different model-specific interpretability methods. We demonstrate our methodology on two biomedical case studies: one MALDI-FTICR IMS dataset taken from the coronal section of a rat brain, and one MALDI-TOF IMS dataset taken from the sagittal section of a mouse-pup.