1 / 29

Biostatistics-Lecture 20 Copy number variation detection

Biostatistics-Lecture 20 Copy number variation detection. Ruibin Xi Peking University School of Mathematical Sciences. Copy number variation (CNV). CNVs: gains or losses of genomic segments CNVs account for a substantial proportion of human genomic variations

Download Presentation

Biostatistics-Lecture 20 Copy number variation detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biostatistics-Lecture 20Copy number variation detection Ruibin Xi Peking University School of Mathematical Sciences

  2. Copy number variation (CNV) • CNVs: gains or losses of genomic segments • CNVs account for a substantial proportion of human genomic variations • In the Database of Genomic Variation (DGV), over 30% of human genome can be influenced by CNVs

  3. CNV Distributions in 200+ Normal Human Genomes (Nature 2006)

  4. CNVs are associated with many diseases

  5. CNV in cancer genome Kim et. al. 2013 Genome Res.

  6. CNV in cancer genome Kim et. al. 2013 Genome Res.

  7. CNV detection strategies with HTS data

  8. CNV detection using read-depth • Read-depth: read density in a genomic region • If there is no bias, the read-depth in a genomic region should be roughly proportional to the copy number • But there are often biases in the NGS data.

  9. Algorithms • CNV-seq (Xie and Tammi 2009) • SegSeq (Chiang et al. 2009) • rSW-seq (Kim et al. 2010) • BIC-seq(Xi et al. 2011) and NBIC-seq • FREEC (Boeva et al. 2011)

  10. Circular Binary Segmentation (CBS) • CBS (Olshen et al. 2004) a sequence of random variables An index γ is called a change point if ~ ~ • Binary segmentation Test statistic for one change point

  11. Circular Binary Segmentation (CBS) • Circular Binary Segmentation Test statistic

  12. CNV-seq • Use sliding window • Model the read count in each window as Poisson distribution N: number of reads in the window W: window size G: Genome size When λ is large, this can be approximated by a Gaussian distribution • Given a case and a control, the copy ratio z is the read count ratio

  13. CNV-seq • The distribution of Gaussian ratio distribution is cumbersome, instead use Approximately a standard Gaussian distribution P-value

  14. Seg-seq • In a window a length L, the number of reads for normal genome follows a Poisson distribution with A: reference genome size : total normal reads • The number of reads for tumor in the window r: copy ratio : total tumor reads

  15. Seg-seq • The read count ratio Log(R) approximately a log-normal distribution when To test if a position x is a change point use

  16. Seg-seq • Algorithm: • For each position in the genome, choose a window such that its left and right window contain a fixed number of reads w • Test if the window is a possible change point (i.e. ) if yes, remove the all position in this window and consider next position • Merge the initial segments if two adjacent segments has p-value more than

  17. Seg-seq • Algorithm: • For each position in the genome, choose a window such that its left and right window contain a fixed number of reads w • Test if the window is a possible change point (i.e. ) if yes, remove the all position in this window and consider next position • Merge the initial segments if two adjacent segments has p-value more than

  18. Poisson or Negative binomial model ?

  19. BIC-seq: Statistical Model • Given a short read that is mapped to the reference genome, it consists of two pieces of information • The position on the reference genome • The read type : tumor ( ) or normal ( ). • Assume the distribution of is . • By Bayes’ theorem where is the marginal distribution of .

  20. Statistical Model (cont. 1) • Denote be the probability of a read at position being a tumor read, i.e. . • Given mapped short reads , the joint likelihood is • To identify CNV regions, it is enough to identify the breakpoints. Larger smaller Larger smaller

  21. Statistical Model (cont. 2) • Assume that is a constant between any two neighboring breakpoints. • Given the breakpoints on a chromosome c, where is the length of the chromosome c. • Let be the common probabilities between the breakpoints and . The likelihood can be written as • One set of breakpoints corresponds to one model. Then, we could use a model selection criterion such as the Bayesian information criterion (BIC) to select the breakpoints.

  22. Bayesian information criterion (BIC) • The general definition of the BIC of a model is • L: the likelihood function evaluated at the MLE • k: the number of parameters in the model • n: the total number of observations

  23. BIC (cont.) • Given the breakpoints , the BIC is • : the number of tumor reads between and . • : the total number of reads between and • : the MLE of the parameter • λ>0 : tuning parameter • Note that the term is common for all different models. Therefore, we can drop it when comparing different models.

  24. Asymptotic result • Assume for all . Then, the breakpoint set that minimizes the BIC is a consistent estimator of the true breakpoint set, i.e. it will converge to the true breakpoint set in probability as N, the number of observations, goes to infinity.

  25. BIC-seq: an algorithm for detecting somatic CNVs in tumor genomes

  26. Some remarks • Used red-black tree to accelerate the algorithm • Outlier removal: • Look at local genomic window to determine if the read count at a nucleotide position is an outlier • Assign credible interval to a breakpoint • Gibbs sampling An outlier example

  27. Application of BIC-seq on a GBM tumor genome • Applied BIC-seq on a GBM tumor genome • Tumor: 10X • Normal: 7x • Detected 291 putative CNVs ranging from 40bp to 5.7 Mb • Compare the copy ratio estimate given by BIC-seq and an array-based platform

  28. Application of BIC-seq on a GBM tumor genome (Cont.) • Selected 16 small CNVs ranging from 110 bp to 14 kb for qPCR validation • 14 out of 16 (87.5%) were validated

  29. Application of BIC-seq on a GBM tumor genome (Cont.)

More Related