Motivation. DNA molecules mutate thousands of times every day. Some mutations are harmful to human cells, and may lead to the loss of function in important genes involved in DNA damage repair (DDR) mechanisms. Diseases such as tumors can exploit mutations in important, dri
...
Motivation. DNA molecules mutate thousands of times every day. Some mutations are harmful to human cells, and may lead to the loss of function in important genes involved in DNA damage repair (DDR) mechanisms. Diseases such as tumors can exploit mutations in important, driver DDR genes to rapidly proliferate. Specific patterns of mutations (or signatures) are insightful indicators for the presence of DDR malfunctioning, which can be exploited to provide targeted treatment (e.g., by leveraging synthetic lethalities). Different methods have been developed to successfully extract relevant mutational signatures from the genomes of tumor patients. Most approaches are unsupervised and thus do not optimize toward distinguishing DDR deficiencies (DDRd). Supervised approaches achieve this, but rely on labeled in vitro data from tumor cell line genomes during training, due to the lack of DDRd ground truth for tumor patient genomes. Semi-supervised learning could bridge the gap and jointly exploit labeled cell line and unlabeled patient mutation profiles to generalize to patient tumors and provide more clinically relevant DDRd mutational signatures.
Results. We propose Pseudo-labeling Semi-Supervised NMF (PSS-NMF), a novel integrated signature extraction and label prediction method, which extends supervised non-negative matrix factorization (NMF) with the ability to incorporate unlabeled samples into the training via pseudo-labeling. Models learned using PSS-NMF were benchmarked on two different tasks, cancer type and DDRd prediction. PSS-NMF consistently improved prediction for patient tumors over the supervised NMF baseline for both tasks, learning signatures that better transferred to the patient tumor domain: the models achieved Macro F1 scores of 0.3842 and 0.1331 respectively for cancer type prediction, and 0.4928 vs 0.4704 for DDRd prediction. We further validated that PSS-NMF identified DDRd signatures were biologically relevant, by comparing them to known DDRd-related mutational signatures curated in COSMIC and investigating their exposures in patient tumor genomes.