Cancer is one of the leading causes of death. To reduce the amount of deaths caused by cancer, a number of different screening methods are used to detect cancer in an earlier stage, to improve sur vival rates when treating patients with cancer. Cur rent screening methods are ofte
...
Cancer is one of the leading causes of death. To reduce the amount of deaths caused by cancer, a number of different screening methods are used to detect cancer in an earlier stage, to improve sur vival rates when treating patients with cancer. Cur rent screening methods are often invasive, costly and not very accurate. Therefore, new methods are being sought that aim to be cheaper, less in vasive and provide more accurate results. One of these methods is fragmentomics. Multiple methods have been proposed to use fragmentomics analy sis in the context of screening for cancer, includ ing using the short/long ratio as well as investigat ing the nucleotides at the ends of the fragments. Across previous works using fragmentomics anal ysis to predict cancer, different pre-proccessing steps are used, with limited explanation why the pre-processing methods were chosen. Research into the effects of pre-processing steps used when using fragmentomics analysis is lacking. Two main pre-processing steps in the field are correct ing GC-bias and filtering on MAPQ. Here we in vestigated the impact of three GC-correction meth ods by applying the correction method and then analyzing the resulting fragmentation profiles us ing short/long fragment ratios. Furthermore, three different MAPQ filtering thresholds were studied. This showed that Deeptools correction of the GC bias lowered performance, with the accuracy drop ping from 77.8% to 69.4%. Applying LOESS cor rection using all fragments at the same time re sulted in an accuracy of 83.3%, while applying LOESS correction using the short and long frag ments separately resulted in an accuracy of 91.7%. The impact of filtering the data based on mapping quality was determined by comparing the results of analysing all fragments, analyzing only fragments with mapping quality 5, 20 or 30. This showed that not filtering by mapping quality has a big impact on the profiles of cancer samples, with a KS-test statistic of 0.08 for MAPQ 5 and MAPQ 20 and larger differences in correlations between healthy and cancer samples. The performance of classi fication was much higher when not filtering, with an accuracy of 97.3%, which dropped whenever the filtering threshold was raised, bottoming out at 62.7% for a threshold of MAPQ 30. Due to limita tions with the study, the combined pre-processing of not filtering on MAPQ and using the LOESS separate correction were not studied.