1 / 18

Structural Variation in the 1000 Genomes Project

Structural Variation in the 1000 Genomes Project. Bob Handsaker Broad Institute, Program in Medical and Population Genetics Harvard Medical School Dept. of Genetics ( McCarroll Lab) on behalf of the 1000 Genomes Structural Variation Analysis Group October 23, 2013.

prentice
Download Presentation

Structural Variation in the 1000 Genomes Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Structural Variation in the 1000 Genomes Project Bob Handsaker Broad Institute, Program in Medical and Population Genetics Harvard Medical School Dept. of Genetics (McCarroll Lab) on behalf of the 1000 Genomes Structural Variation Analysis Group October 23, 2013

  2. Ascertaining large variants from short reads Mobile element (MEI) insertion Tandem duplication Deletion Read Pairs (RP) MEI No SV sample reference Read Depth (RD) sample reads Duplication Deletion reference Assembly (AS) Split Reads (SR) Novel sequence insertion Deletion reference Slide courtesy of Jan Korbel, Ryan Mills reference

  3. Why is structural variation calling challenging? • Artifacts abound • Millions of chimeric molecules generated during library construction • Read depth varies across the genome and across libraries • Alignment algorithms are misled by the genome’s repeats • Low-coverage sequencing • Data is not definitive in each genome • False discoveries can accumulate across genomes • Deep genomes • Increased depth can help, but methodology is more important

  4. Structural Variation in 1000 Genomes What are the goals? Create a reference panel for imputing structural polymorphisms Create a comprehensive catalog of human structural polymorphism

  5. 1000 Genomes Project Phases Structural Variation Goals Pilot Phase (2010/2011) 179 lowcov genomes 2 deep trios Plus exome sequencing Phase 1 (2012) 1092 lowcov genomes Phases 3 (2013/2014) 2535 lowcovgenomes 2 deep trios (updated) 427deep genomes (135 trios) fromComplete Genomics • Variant catalog of multiple variant types • Genotypes for some variants (deletions, mobile element insertions) • Expanded catalog of deletions • Integrated haplotypes combining deletions with SNPs / indels • Expanded variant catalog covering many variant types • Integrated haplotypes including more forms of structural variation

  6. 1000G Phase 1 – Deletion calling pipeline Five deletion discovery algorithms BreakDancerRead pairs, Washington University CNVnatorRead depth, Yale DellyRead pairs/depth, EMBL Genome STRiPRead pairs/depth, Harvard/Broad PindelSplit reads, Leiden University Three validation methods OMNI 2.5 SNP arrays, probe intensities (Broad) Array CGH, 2x1M arrays, 25 samples (HMS) PCR on 100 sites from each algorithm (EMBL) Two breakpoint assembly methods Tigra_SV + CROSSMATCH alignment (U Texas) Tigra_SV + AGE alignment CNVnator (Yale) Genotyping / call reconciliation Genome STRiP (Harvard/Broad) Phasing onto integrated hapotypes Beagle (Univ. of Washington) plus MaCH (Univ. of Michigan)

  7. Deletion call sets and validation results Genotyped call set Median length 2,974 57% of sites are novel The union and filtered call sets are available as supplemental data files from the 1000 Genomes ftp site. These are less stringent supersets of calls on that are phased on the integrated haplotypes. Length distribution and novelty

  8. Deletion genotyping Goal: Accurate genotype likelihoods across all samples Genome STRiP uses an integrated likelihood framework to combine evidence from read depth and read pairs (and optionally split reads). Adjudicates between redundant calls. Genotyping accuracy normalized read depth Handsaker, R.E., Korn, J.M., Nemesh, J. & McCarroll, S.A. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale. Nat Genet43, 269-76 (2011)

  9. Variant Call FormatConventions used for large deletions in 1000G VCF file format specification: http://vcftools.sourceforge.net/specs.html CHROM POS ID REF ALT QUAL FILTER INFO 1 2371985 DEL_455 CGCTGG… C . . END=2374312;HOMLEN=5;HOMSEQ=GCTGG;CIPOS=-25,9;CIEND=-10,25 1 2918690 DEL_833 G <DEL> . . END=2919922;CIPOS=-18,19;CIEND=-17,33 Precise variantsExample REF gives the entire reference allele (may be long) CGCTGGCCT… ALT gives the entire alternate allele (usually short) C HOMSEQ identical sequence at breakpoint (if any) GCTGG HOMLEN length of HOMSEQ 5 CIPOS confidence interval on POS (before bkptassy) -25,9 CIEND confidence interval on END (before bkptassy) -10,25 Imprecise variants REF is a single base (at POS) G ALT will be <DEL> <DEL> END gives best-estimate of end coordinate 2919922 CIPOS confidence interval on POS -18,19 CIEND confidence interval on END -17,33

  10. Genotyping novel variants in 1000 Genomes Scenario You find evidence of a deletion in your sample/patient. Is it rare, or has it been observed in other people? chr20 14517379-14527474 10.1 Kb Monomorphic Site • You can genotype it in 1000 Genomes • You can use Genome STRiP to genotype your variant in the public 1000 Genomes data set • More accurate than looking up sites in databases, • where calls may be based on older methods • Evaluate and interpret the evidence directly • Allows you to see if a site is definitively monomorphic • Features • Results in minutes • Includes population frequency data and plots • Remotely access 1000G BAM files (s3, ftp or http) • Recipes and tools available for running on the Amazon cloudor using your local compute resources individuals copy number chr20 3821195-3825139 3.9 Kb Polymorphic Site individuals http://www.broadinstitute.org/software/genomestrip/cookbook Genome STRiP is part of iSeqTools, the NHGRI informatics tools network copy number

  11. Structural Variation in 1000 Genomes What are the goals? Create a reference panel for imputing structural polymorphisms Create a comprehensive catalog of human structural polymorphism

  12. Extended Variant Catalog – Phase 1 and Pilot • Union of raw Phase 1 deletion calls (113,694 sites) • ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/analysis_results/input_call_sets/ ALL.wgs.merged_5_del_call_sets_bps.20101123.sv_dels.low_coverage.sites.vcf.gz • High specificity subset (23,594 sites, estimated FDR < 5%) • Same file, use all records with FILTER != “NONVAL” • Genotyped subset (site list only, 14,422 sites) • Same file, use records with FILTER == “PASS” • Genotypes are in the integrated call set • ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/phase1/analysis_results/integrated_call_sets/ • 1000 Genomes Pilot SV Catalog • Includes 5,371 mobile element insertions, tandem duplications, novel sequence insertions • ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/paper_data_sets/ • companion_papers/mapping_structural_variation/ Mapping copy number variation by population-scale genome sequencing. Mills, et al. Nature 2011 Feb 3;470(7332):59-65

  13. What is coming in Phase 3 Large deletions, duplications and multi-allelic CNVs Six or more algorithms contributing calls (UW, Yale, EMBL, WashU, Sanger, BI) Tens of thousands of new variants As many as possible will be phased on to haplotypes Will support imputation of these variants into other data sets (e.g. GWAS) chr7:143884042-143951887 ARHGEF35, OR2A42 chr8:7200001-7436084 DEFB4B, SPAG11B chr11:18941449-18963993 MRGPRX1 samples copy number copy number copy number • Multi-allelic CNV analysis using Genome STRiP

  14. What is coming in Phase 3 Mobile elements Polymorphic ALUs, L1 LINEs, SVA, HERV Polymorphic reference elements and insertions relative to the reference As many as possible will be phased on to haplotypes Reference ALU ALU Sample MEI Position ALU ALU ALU ME Type RP events Unique-Aligned Read Multi-Aligned Read Jiantao Wu, Wan-Ping Lee, Gabor Marth Non-reference ME

  15. What is coming in Phase 3 Polymorphic inversions Reference Region Observed Line corresponds to expected location of alignment if the event does not add or delete sequence Ali Bashir, Markus Fritz, Eric Schadt, Jan Korbel Box corresponds to event boundaries

  16. What is coming in Phase 3 Mitochondrial DNA that has been integrated into the nuclear genome Nuclear Mitochondrial Insertions (NUMTs) Bounded by Insert Size Reference (b37) Sample MT Insertion GargiDayama, Sarah Emery, Jeff Kidd, Ryan Mills

  17. Structural variation in 1000 Genomes • 1000 Genomes Phase 1 • High quality deletion call set • Low false discovery rate; high genotype accuracy • Integrated reference panel for imputation • 1000 Genomes Phase 3 • Greatly expanded variant catalog • Larger reference panel (26 populations) • Imputation resources for other variant types

  18. 1000 Genomes Structural Variation Analysis Group WashU – AsifChinwalla, Kai Ye WT Sanger Institute – Klaudia Walter, Manuela Zanda, Sarah Lindsay, Thomas Keane Yale – AlexejAbyzov, Jasmine Mu, EktaKhurana, Mark Gerstein EMBL – Adrian Stütz, Tobias Rausch, Andreas Schlattl, Markus Fritz Univ of Washington – Peter Sudmant, Art Ko, FereydounHormozdiari, John Huddleston Oxford– ZaminIqbal, Gil McVean Bilkent University – Can Alkan LSU – Miriam Konkel, Jerilyn Walker, Mark Batzer UNC Charlotte– Mindy Shi MSSM – Seungtai Yoon, Vlad Makarov, JayonLihm AECOM – Kenny Ye Boston College – Chip Stewart, DenizKural, Michael Stromberg, Alistair Ward, JiantaoWu, Wan-Ping Lee, Gabor Marth Broad Institute – Josh Korn, Jim Nemesh, Marcin von Grotthuss, Bob Handsaker, Steve McCarroll UCSD– Doug Greer, Jonathan Sebat UT / MC Anderson– Ken Chen Univ of Michigan– Goo Jun, GargiDayama, Sarah Emery, Jeff Kidd, Ryan Mills Univ of Maryland– Eugene Gardner, Scott Devine Co-chairs:Jan Korbel (EMBL) Evan Eichler (U. Washington) Charles Lee (Jax)

More Related