Computational analysis of copy number profiles of tumors
More Info
expand_more
Abstract
Cancer is a genetic disease. The activation, alteration or deactivation of cancer genes can stimulate undesirable cell-proliferation. Cancer genes can be subdivided into oncogenes and tumor suppressors. Oncogenes, such as growth factor receptors, are altered and/or overexpressed genes that are causally linked to tumorigenesis. Tumor suppressors, by contrast, are typically under expressed or deleted in tumors since they would otherwise serve a protective role.
There are two main genetic mechanism that can activate or deactivate cancer genes: mutations and DNA copy number alterations. In this work, we focus on detecting novel cancer genes using somatic DNA copy number data. The philosophy is simple: if independently acquired somatic amplifications or deletions occur frequently across multiple tumor samples, they are likely to harbor oncogenes or tumor-suppressors respectively. With a single tumor DNA copy number profile,it is not possible to know which copy number alterations activate or deactivate cancer genes, since many of the alterations (referred to as passenger aberrations) occur due to genomic instability and do not necessarily provide a selective advantage for cancerous cells. However, when aggregating across many samples, we expect cancer genes to be amplified or deleted more frequently than by chance, which allows us to detect them.
This application can be regarded as a peak calling problem. We aggregate (sum) copy number profiles across many tumors and call peaks that are significantly high. To do this we define a null model that describes the behavior of an aggregate copy number profile that would arise if only passenger aberrations occurred. The null aggregate profile (also called the noise profile) exhibits high auto correlation across the genome due to the segmented nature of copy number profiles.
We therefore developed a statistical framework for calling peaks (at varying widths) where the noise profile can exhibit strong auto correlation. The framework allows us to detect peaks (at varying widths) with high statistical power while controlling the false discovery rate of detected peaks. We employ two concepts. First, we take advantage of the fact that broad peaks can be detected with much higher statistical power when smoothing the profile and we developed techniques for adaptive smoothing. Second, we use a powerful statistic called the expected Euler characteristic that is insensitive to platform resolution, directly compatible with our smoothing methodology and that can be directly used to estimate the expected number of false positive peaks called.
This framework does not rely directly on the inherent properties of DNA copy number profiles and can therefore be applied in many more applications with suitably defined null-models. Although the mathematics we develop in this framework might be taxing at times, we observe thatthe equations that result and that are ultimately used in our peak calling algorithms are simple and the validity can easily be verified by simulating data and comparing our theoretical expectations with measured observations.