Background: Histopathological examination in the diagnostic workflow of oropharyngeal squamous cell carcinoma (OPSCC) is essential. We aimed to develop a machine learning pipeline to predict human papillomavirus (HPV) status in OPSCC patients based on clinical variables and multi
...
Background: Histopathological examination in the diagnostic workflow of oropharyngeal squamous cell carcinoma (OPSCC) is essential. We aimed to develop a machine learning pipeline to predict human papillomavirus (HPV) status in OPSCC patients based on clinical variables and multiparametric magnetic resonance imaging (MRI).
Methods: In a dataset of OPSCC patients (n=59), we extracted features from three categories: clinical variables; histogram parameters from diffusion weighted imaging (DWI)-MRI model; radiomics based on T2-weighted and DWI-MRI. We performed ten-times repeated stratified five-fold cross-validation and divided each outer training set (80%) into an inner training set (80% of outer training set) and validation set (20% of outer training set) using five-fold stratified cross-validation. We performed three types of feature selection methods (LASSO, statistical analysis and manual selection), tuned and trained seven classifiers (logistic regression, k-nearest-neighbours, naive bayes, random forest, support vector machine, XGBoost and LightGBM) to find the optimal combination of features, hyperparameters and classifier on each inner training set. We ensembled the inner fold models to fit on the outer training set and tested on the outer test set (20%). We constructed additional models with subsets of the features.
Results: the combined model area under the curve was 0.793±0.136. Models including clinical features outperformed models without clinical features (p<0.001). Features from all feature categories were selected for the combined model.
Conclusion: we were able to predict HPV status in OPSCC patients using multiparametric MRI and clinical variables with reasonable accuracy, though retraining and validating on larger, external datasets is needed before implementation in clinic.