Synthetic lethality (SL) arises between two genes when loss of function of both genes would lead cells to become inviable. This can be exploited for therapy, where a drug is used to selectively kill diseased cells by perturbing one gene of an SL pair where the other gene is inact
...
Synthetic lethality (SL) arises between two genes when loss of function of both genes would lead cells to become inviable. This can be exploited for therapy, where a drug is used to selectively kill diseased cells by perturbing one gene of an SL pair where the other gene is inactive (e.g. through naturally occurring mutation). Computational prediction of SL relationships is very appealing as it can help reduce cost- and labour-intensive experimental testing to the most promising candidate pairs. Even though machine learning models have shown promising results for SL prediction compared to traditional statistical approaches, crucial questions remain. First, which sources of molecular data are most useful for SL prediction? Many approaches rely on either cell line or patient tumour data separately, and ignore data from healthy tissue. We argue these should be combined to leverage relevant data sources that are exclusively available for cancer cell models and patient tumours, and to enable the transfer of knowledge between models and actual patient tumours. Likewise, changes in the relationship of gene pairs between healthy and tumour tissue may be informative for SL prediction. We assess several machine learning techniques to best leverage molecular profiles for cancer-specific or pan-cancer SL prediction. Second, what are the effects of selection bias on SL prediction methods and which techniques are most robust? This has been insufficiently addressed, as models in the literature are often tested using data from one or two cancer types or datasets. We investigate robustness to cancer representation and gene selection biases, which are inherent to most SL datasets. We hypothesise that approaches based on matrix factorisation will be especially sensitive to the latter, as they are dependent on an a priori SL network structure, which also determines the scope of the prediction space.