Mitigating selection bias in synthetic lethality prediction using metric learning
More Info
expand_more
Abstract
Synthetic lethality (SL) is a relationship between two genes, exploited for targeted anti-cancer therapy, whereby functional loss of both genes induces cell death, but the functional loss of either gene alone is non-lethal. Computational prediction of SL gene pairs is sought after because it is expensive to do lab screening for SL. Existing SL labeled pairs from wet- lab experiments often focus on specific genes or pathways, resulting in notable selection bias. Current SL prediction methods ignore this bias when training on available SL labels, and fail to generalize if test sets follow a different selection bias. One way to mitigate bias is to incorporate unlabeled pairs during model learning. However, conventional semi-supervised methods such as self-training can reinforce bias by adding confidently pseudolabeled pairs, which tend to be most similar to previously included samples. We present DBST, a self-training strategy that addresses the issue by promoting diversity in the selection of pseudolabeled samples. This is achieved using metric learning to find a class-contrastive representation of the feature space, based on which DBST selects diverse (or dissimilar) pseudolabeled pairs. In results for five cancer types, semi-supervised models, including DBST, delivered improved SL prediction performance over the supervised model. Additionally, DBST successfully incorporated unlabeled samples that were more dissimilar among them compared to standard self-training. In experiments with differing biases between train and test sets, DBST showed a slight improvement in performance compared to the supervised model.