390 likes | 531 Views
SEGMENTAL VARIATION ( C opy N umber V ariants and other gross chromosomal rearrangements). Allen E. Bale, M.D. Dept. of Genetics. Importance of Copy Number Variants (CNVs) and Other Rearrangements in Health and Disease. Constitutional (germ-line) variants in hereditary conditions
E N D
SEGMENTAL VARIATION (Copy Number Variants and other gross chromosomal rearrangements) Allen E. Bale, M.D. Dept. of Genetics
Importance of Copy Number Variants (CNVs) and Other Rearrangements in Health and Disease • Constitutional (germ-line) variants in hereditary conditions • Large and small copy number variants • Translocations and inversions: rarely cause a phenotype but may generate CNVs due to mis-pairing during meiosis • Somatically acquired variants in cancer • Duplications and deletions: amplification of oncogene; loss of tumor suppressor • Translocations and inversions: place oncogene under control of an active promoter
What is the origin of structural variants? • An area of active research • Recurrent constitutional CNVs: Often related to illegitimate recombination between homologous, but non-identical, sequences • Rare, non-recurrent, constitutional CNVs: No obvious sequence homology at breakpoints, ?non-homologous end joining • Tumor CNVs: Any mechanism to create a rearrangement that favors tumor growth, often non-homologous end joining.
Limitations of Cytogenetics • Cell has to be proliferating in order to arrest chromosomes at metaphase (when they are visible under the microscope) • Resolution is limited (in the range of 5 Mb) • Requires highly skilled technologists and still a lot of hands-on time, even with sophisticated image processing
Submicroscopic CNVs: Array CGH* *Frequently referred to as “chromosome microarray”
Example: Submicroscopic 22q deletion • Abnormal nose, ears, and palate • Also heart, parathyroid, and thymus abnormalities
Limitations of Array CGH • Can’t detect translocations and inversions • Resolution still limited by number of probes on the array—typical resolution about 100 kb • Still a fair amount of variability in results depending on exactly which array is used
Genome-scale sequencing to detect rearrangements If you could sequence each chromosome as one continuous piece of DNA, from one end to the other with no gaps in the sequence, what structural variants would you miss?
Genome-scale sequencing to detect rearrangements What methods are currently in use? • Depth-of-coverage methods Regions that are deleted or duplicated should yield lesser or greater numbers of reads • Detection of breakpoints by: • Short paired reads (like Illumina paired-end sequencing) Are the sequences at two ends of a fragment both from the same chromosome? Are they the right distance apart? • Long reads (kb-scale) Direct sequencing of breakpoints
Genome-scale sequencing to detect rearrangements • Depth-of-coverage method • Detection of breakpoints by short paired reads • Detection of breakpoints by long reads • Compared with cytogenetics and array CGH, how would the approaches above perform? • What would be missed by depth-of-coverage reading? • What would be missed by detection of breakpoints? • What problems do you foresee with these two approaches?
Depth-of-coverage example:Whole exome sequencing as a tool to identify both sequence variants and CNVs
Whole exome sequencing (see Dr. Lifton’s lecture) • Capture portions of the genome containing exons in order to efficiently sequence coding regions • Not designed for CNV detection, but potentially contains information on gene dosage • For any gene, the number of fragments captured on the array and sequenced should be proportional to the representation in the starting material
Does this work at all? • Total reads on the X chromosome were counted in a series of males and females • Gene dosage for the X chromosome in males should be half the gene dosage for the X chromosome in females
Does it work for single exons? Reads counted for each exon of the OTC gene on X chromosome Males should have one half the female dosage. • Read number varies among exons due to different capture efficiencies but is consistent subject to subject. • Exons with sufficient read numbers show dosage effect. • Performs very well for this 70 kb gene taken as a single unit.
Approach to scanning the whole genome for CNVs • The genome was divided into 50 kb windows. • Intervals with zero reads were removed. • Mean number of reads and standard deviations for each interval were calculated from 10 exome sequences. • Depth of coverage in a single patient was compared to average and standard deviation of depth of coverage. • Algorithms were developed for: • Classifying X chromosome as being deleted in males compared with females • Classifying X chromosome as being duplicated in females compared with males
Chromosomal coverage with non-zero, 50 kb intervals corresponds exactly to density of coding sequences
Test case: Female with a 338 kb duplication on 5q35Diagram shows all loci passing initial algorithm
Filter #1: Require two adjacent intervals to both be deleted or duplicated
Filter #2: Remove “deleted regions” that contain heterozygous variants
Application to 7 subjects with deletions or duplications in 500 kb to 1 Mb range
Some problems with use of exome data • Intervals with no genes are not covered (important?) • Intervals with large genes having close homologs elsewhere in the genome can not be accurately evaluated. • Because this technology is evolving rapidly, the normal standard to which a test sample is compared needs to be a pool of recent exome sequences (huge FDR with non-homogeneous samples).
For a review of published depth-of-coverage methods for exome or genome data see:Klambauer, G. et. al. (2012). "cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate." Nucleic Acids Res.Compares several programs, none of which work really well.Two newer programs for exome sequencing are in your reading list.
Paired-end methods • Illumina HiSeq, the current industry leader in high-throughput sequencing, generates short reads from fragments 200 to 600 bp long. • Reading both ends of the same fragment gives you sequences that should lie 200 to 600 bp apart • Other methods can generate paired fragments that lie even farther apart
Long paired-end methods Paired end mapping—up to thousands of bp apart From Korbel et al., 2009
Analyzing structural variations from paired end data • PEMer (Korbel et al., 2009): For discovery of CNVs and inversions; could also be implemented for translocations • Breakdancer (Chen et al., 2009): For discovery of CNVs, inversions, and translocations
Identifying Structural Mutations with paired end sequence: What goes wrong?
How to overcome problems with paired end detection of CNVs Separating the wheat from the chaff • Technical artifacts (ligation of unrelated fragments during library preparation) may be numerous but will be random • Artifacts related to homologous sequences (see previous slide) will be reproducible but common to all samples • Real structural variants will be reproducible within a sample and not common to all samples • How much reading depth do you need to detect the real variants?
Toward direct sequencing of breakpoints • Long reads • PACbio can generate reads of 1000 bp or so • Nanopore sequencing said to generate reads in the 10s of thousands • Strobe sequencing with PACbio: Normally read length is limited due to inactivation of polymerase by laser. Short bursts of laser give sample sequences along a stretch of DNA in the 20 kb range.
Programs for analysis of longer reads that directly sequence breakpoints • CREST (Wang et. al., 2011): Detects small and large structural variants by direct sequencing of breakpoints. • SRiC (Zhang et al., 2011): Similar to CREST • Algorithm for strobe reads (Ritz et al., 2010)
Conclusions • Structural variation in the genome accounts for a great deal of human phenotypic variability including disease • Depth-of-coverage methods can detect many CNVs but not inversions and translocations. Variation from sample to sample limits sensitivity and specificity. • Whole genome sequencing, which can identify all types of structural variants, will supersede depth-of-coverage methods. • Large scale and small scale duplications and repetitive sequences remain a major obstacle.
Acknowledgments for exome CNV analysis Department of Genetics Patricia Gordon Christopher Heffelfinger Murim Choi Shrikant Mane Richard Lifton Allen Bale Neuropsychiatric Genetics Program Stephan Sanders Matthew State School of Public Health, Biostatistics Division Annette Molinaro