320 likes | 637 Views
Biostatistics-Lecture 20 Copy number variation detection. Ruibin Xi Peking University School of Mathematical Sciences. Copy number variation (CNV). CNVs: gains or losses of genomic segments CNVs account for a substantial proportion of human genomic variations
E N D
Biostatistics-Lecture 20Copy number variation detection Ruibin Xi Peking University School of Mathematical Sciences
Copy number variation (CNV) • CNVs: gains or losses of genomic segments • CNVs account for a substantial proportion of human genomic variations • In the Database of Genomic Variation (DGV), over 30% of human genome can be influenced by CNVs
CNV Distributions in 200+ Normal Human Genomes (Nature 2006)
CNV in cancer genome Kim et. al. 2013 Genome Res.
CNV in cancer genome Kim et. al. 2013 Genome Res.
CNV detection using read-depth • Read-depth: read density in a genomic region • If there is no bias, the read-depth in a genomic region should be roughly proportional to the copy number • But there are often biases in the NGS data.
Algorithms • CNV-seq (Xie and Tammi 2009) • SegSeq (Chiang et al. 2009) • rSW-seq (Kim et al. 2010) • BIC-seq(Xi et al. 2011) and NBIC-seq • FREEC (Boeva et al. 2011)
Circular Binary Segmentation (CBS) • CBS (Olshen et al. 2004) a sequence of random variables An index γ is called a change point if ~ ~ • Binary segmentation Test statistic for one change point
Circular Binary Segmentation (CBS) • Circular Binary Segmentation Test statistic
CNV-seq • Use sliding window • Model the read count in each window as Poisson distribution N: number of reads in the window W: window size G: Genome size When λ is large, this can be approximated by a Gaussian distribution • Given a case and a control, the copy ratio z is the read count ratio
CNV-seq • The distribution of Gaussian ratio distribution is cumbersome, instead use Approximately a standard Gaussian distribution P-value
Seg-seq • In a window a length L, the number of reads for normal genome follows a Poisson distribution with A: reference genome size : total normal reads • The number of reads for tumor in the window r: copy ratio : total tumor reads
Seg-seq • The read count ratio Log(R) approximately a log-normal distribution when To test if a position x is a change point use
Seg-seq • Algorithm: • For each position in the genome, choose a window such that its left and right window contain a fixed number of reads w • Test if the window is a possible change point (i.e. ) if yes, remove the all position in this window and consider next position • Merge the initial segments if two adjacent segments has p-value more than
Seg-seq • Algorithm: • For each position in the genome, choose a window such that its left and right window contain a fixed number of reads w • Test if the window is a possible change point (i.e. ) if yes, remove the all position in this window and consider next position • Merge the initial segments if two adjacent segments has p-value more than
BIC-seq: Statistical Model • Given a short read that is mapped to the reference genome, it consists of two pieces of information • The position on the reference genome • The read type : tumor ( ) or normal ( ). • Assume the distribution of is . • By Bayes’ theorem where is the marginal distribution of .
Statistical Model (cont. 1) • Denote be the probability of a read at position being a tumor read, i.e. . • Given mapped short reads , the joint likelihood is • To identify CNV regions, it is enough to identify the breakpoints. Larger smaller Larger smaller
Statistical Model (cont. 2) • Assume that is a constant between any two neighboring breakpoints. • Given the breakpoints on a chromosome c, where is the length of the chromosome c. • Let be the common probabilities between the breakpoints and . The likelihood can be written as • One set of breakpoints corresponds to one model. Then, we could use a model selection criterion such as the Bayesian information criterion (BIC) to select the breakpoints.
Bayesian information criterion (BIC) • The general definition of the BIC of a model is • L: the likelihood function evaluated at the MLE • k: the number of parameters in the model • n: the total number of observations
BIC (cont.) • Given the breakpoints , the BIC is • : the number of tumor reads between and . • : the total number of reads between and • : the MLE of the parameter • λ>0 : tuning parameter • Note that the term is common for all different models. Therefore, we can drop it when comparing different models.
Asymptotic result • Assume for all . Then, the breakpoint set that minimizes the BIC is a consistent estimator of the true breakpoint set, i.e. it will converge to the true breakpoint set in probability as N, the number of observations, goes to infinity.
BIC-seq: an algorithm for detecting somatic CNVs in tumor genomes
Some remarks • Used red-black tree to accelerate the algorithm • Outlier removal: • Look at local genomic window to determine if the read count at a nucleotide position is an outlier • Assign credible interval to a breakpoint • Gibbs sampling An outlier example
Application of BIC-seq on a GBM tumor genome • Applied BIC-seq on a GBM tumor genome • Tumor: 10X • Normal: 7x • Detected 291 putative CNVs ranging from 40bp to 5.7 Mb • Compare the copy ratio estimate given by BIC-seq and an array-based platform
Application of BIC-seq on a GBM tumor genome (Cont.) • Selected 16 small CNVs ranging from 110 bp to 14 kb for qPCR validation • 14 out of 16 (87.5%) were validated