Biostatistics-Lecture 20 Copy number variation detection

Biostatistics-Lecture 20Copy number variation detection Ruibin Xi Peking University School of Mathematical Sciences

Copy number variation (CNV) • CNVs: gains or losses of genomic segments • CNVs account for a substantial proportion of human genomic variations • In the Database of Genomic Variation (DGV), over 30% of human genome can be influenced by CNVs

CNV Distributions in 200+ Normal Human Genomes (Nature 2006)

CNVs are associated with many diseases

CNV in cancer genome Kim et. al. 2013 Genome Res.

CNV detection strategies with HTS data

CNV detection using read-depth • Read-depth: read density in a genomic region • If there is no bias, the read-depth in a genomic region should be roughly proportional to the copy number • But there are often biases in the NGS data.

Algorithms • CNV-seq (Xie and Tammi 2009) • SegSeq (Chiang et al. 2009) • rSW-seq (Kim et al. 2010) • BIC-seq(Xi et al. 2011) and NBIC-seq • FREEC (Boeva et al. 2011)

Circular Binary Segmentation (CBS) • CBS (Olshen et al. 2004) a sequence of random variables An index γ is called a change point if ~ ~ • Binary segmentation Test statistic for one change point

Circular Binary Segmentation (CBS) • Circular Binary Segmentation Test statistic

CNV-seq • Use sliding window • Model the read count in each window as Poisson distribution N: number of reads in the window W: window size G: Genome size When λ is large, this can be approximated by a Gaussian distribution • Given a case and a control, the copy ratio z is the read count ratio

CNV-seq • The distribution of Gaussian ratio distribution is cumbersome, instead use Approximately a standard Gaussian distribution P-value

Seg-seq • In a window a length L, the number of reads for normal genome follows a Poisson distribution with A: reference genome size : total normal reads • The number of reads for tumor in the window r: copy ratio : total tumor reads

Seg-seq • The read count ratio Log(R) approximately a log-normal distribution when To test if a position x is a change point use

Seg-seq • Algorithm: • For each position in the genome, choose a window such that its left and right window contain a fixed number of reads w • Test if the window is a possible change point (i.e. ) if yes, remove the all position in this window and consider next position • Merge the initial segments if two adjacent segments has p-value more than

Poisson or Negative binomial model ?

BIC-seq: Statistical Model • Given a short read that is mapped to the reference genome, it consists of two pieces of information • The position on the reference genome • The read type : tumor ( ) or normal ( ). • Assume the distribution of is . • By Bayes’ theorem where is the marginal distribution of .

Statistical Model (cont. 1) • Denote be the probability of a read at position being a tumor read, i.e. . • Given mapped short reads , the joint likelihood is • To identify CNV regions, it is enough to identify the breakpoints. Larger smaller Larger smaller

Statistical Model (cont. 2) • Assume that is a constant between any two neighboring breakpoints. • Given the breakpoints on a chromosome c, where is the length of the chromosome c. • Let be the common probabilities between the breakpoints and . The likelihood can be written as • One set of breakpoints corresponds to one model. Then, we could use a model selection criterion such as the Bayesian information criterion (BIC) to select the breakpoints.

Bayesian information criterion (BIC) • The general definition of the BIC of a model is • L: the likelihood function evaluated at the MLE • k: the number of parameters in the model • n: the total number of observations

BIC (cont.) • Given the breakpoints , the BIC is • : the number of tumor reads between and . • : the total number of reads between and • : the MLE of the parameter • λ>0 : tuning parameter • Note that the term is common for all different models. Therefore, we can drop it when comparing different models.

Asymptotic result • Assume for all . Then, the breakpoint set that minimizes the BIC is a consistent estimator of the true breakpoint set, i.e. it will converge to the true breakpoint set in probability as N, the number of observations, goes to infinity.

BIC-seq: an algorithm for detecting somatic CNVs in tumor genomes

Some remarks • Used red-black tree to accelerate the algorithm • Outlier removal: • Look at local genomic window to determine if the read count at a nucleotide position is an outlier • Assign credible interval to a breakpoint • Gibbs sampling An outlier example

Application of BIC-seq on a GBM tumor genome • Applied BIC-seq on a GBM tumor genome • Tumor: 10X • Normal: 7x • Detected 291 putative CNVs ranging from 40bp to 5.7 Mb • Compare the copy ratio estimate given by BIC-seq and an array-based platform

Application of BIC-seq on a GBM tumor genome (Cont.) • Selected 16 small CNVs ranging from 110 bp to 14 kb for qPCR validation • 14 out of 16 (87.5%) were validated

Application of BIC-seq on a GBM tumor genome (Cont.)

Biostatistics-Lecture 20 Copy number variation detection