1 / 33

1000G Pilot 3 Progress in silico analysis and comparison to experimental validation

1000G Pilot 3 Progress in silico analysis and comparison to experimental validation. Gabor Marth (Boston College ) + A + L Kiran Garimella (Broad Institute ) + C February 2, 2010. Acknowledgements. Boston College Amit Indap Wen Fung Leong Gabor Marth Cornell Andy Clark Stanford

mirit
Download Presentation

1000G Pilot 3 Progress in silico analysis and comparison to experimental validation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1000G Pilot 3 Progressin silico analysis and comparison to experimental validation Gabor Marth (Boston College) + A + L KiranGarimella (Broad Institute) + C February 2, 2010

  2. Acknowledgements Boston College AmitIndap Wen Fung Leong Gabor Marth Cornell Andy Clark Stanford Simon Gravel Carlos Bustamante Michigan Tom Blackwell Baylor Matthew Bainbridge Fuli Yu Donna Muzny Richard Gibbs Broad Chris Hartl KiranGarimella Carrie Sougnez Mark DePristo WUGSC Dan Koboldt Bob Fulton WTSI AarnoPalotie

  3. Data • Capture targets: • Started with ~1,000 genes / ~10,000 exons / 2.3Mb • 1.43Mb of total target length shared between 4 data centers used for this analysis • Samples: • 697 total samples • 7 populations • Sequence coverage: • Goal was deep per-sample coverage • Effective coverage somewhat reduced by fragment duplications • Capture technologies: • Nimblegen solid phase • Agilent liquid phase • Sequencing technologies: • SLX • 454 • Data producers: • BCM • BI • WTSI • WUGSC

  4. Pipelines All 697 samples CEU CHB JPT YRI SNP calling TSI LWK CHD CEU CHB JPT YRI All 697 samples TSI LWK CHD SNP statistics Segregating sites in each population sample Union of all called sites in all 697 samples

  5. BC and BI call sets are converging All called sites Called sites per population (BC/BI intersection)

  6. SNP calls (per population)

  7. SNP calls (all samples) BI: 18,149 SNPs BC: 14,502 SNPs 1,741 SNPs 79 dbSNPs dbSNP=4.54% 12,761 SNPs 3,869 dbSNPs dbSNP=30.32% 5,388 SNPs 172 dbSNPs dbSNP=3.19% BC U BI = 19,890

  8. Genotype call accuracyrelative to HapMap3 Data quality in CHB and JPT samples seems consistently lower Statistics only include genotype calls at SNP sites in BC∩BI

  9. Genotype calls • Filtering: • BC filters on genotype call quality • BI reports a genotype for any site where at least one read covers • Nominally, BI makes more calls than BC, and has, on average, higher AF # SNP sites=3,489 r=0.9921 # SNP sites=3,075 r=0.9979 The Broad caller does not filter on genotype quality All SNP sites considered Only SNP sites with >= 80% called genotypes • Good allele frequency concordance between BC and BI • At genotype calls that passes BC filter, and BI also makes a call, no discordance was found

  10. 1KG validation executive summary • Evaluated BI and BC calls against validation • 1KG chip1 • 312/697 samples across 7 populations represented • ~300 sites (150 novel) overlap with Pilot 3 target region • Concordance with 1KG chip is very high • Where covered (> 5 reads): • 302/312 (97%) of samples have >90% variant sensitivity • 269/312 (86%) of samples have >90% genotype sensitivity • Remaining disparities between 1KG chip and Pilot 3 calls can be explained by data quality issues • Later sequencing has far greater concordance with chip than earlier sequencing 1. Details in Appendix

  11. Nearly all samples in call-set overlap have high sensitivity and specificity All but one sample with low PPV (false-positive rate > 10%) are among the earliest-sequenced samples (JPT/CHB/CHD) These 10 low-sensitivity samples have strange allele balances and are likely contaminated Pilot 3 individual (312 individuals total after eliminating low-coverage samples)

  12. Mean sensitivity/PPV per population is good, and improves on more recently-sequenced populations 8/2008 ILMN/454 All Ctrs 8/2008 ILMN/454 All Ctrs 8/2008 ILMN/454 BI/BCM 8/2008 ILMN/454 BI/BCM 1/2009 454 BCM 10/2008 ILMN BI/SC 2008/2009 ILMN/454 All Ctrs 13 N Samples: 69 27 102 69 3 24

  13. Low-frequency / singleton validation: executive summary • Low-frequency Sequenom assay1 • Chose 105 putative novel singletons from early Pilot 3 46-CEU-sample callsets (called in at least 2/4 callers) • Validated sites in those 46 individuals • 89/105 are true singletons • 16/105 are false-positive singletons (hom-refs and two non-singletons) • Concordance with low-frequency assay is very high • Callsets today (January 2010) • In BI and BC overlap, recovered 71/89 (80%) of assayed singletons with 0 false-positives and 0 non-singletons • In BI and BC union, recovered all 89 singletons with 3 false-positives and 0 non-singletons 1. Details in Appendix

  14. Callers are able to detect most singletons with very low false-positive rate Joint calls find every singleton in the assay, with exceedingly few false positives.

  15. Conclusions / future directions • Data quality has improved significantly over the life of the project • Both BC and BI pipelines produce high-quality call sets • Good agreement between call sets • intersection highly concordant with experimental validation data • Estimated FP rate below 5% • The current Pilot 3 release is the BC∩BI (intersection) call set • We are proceeding with validations • Dual focus: accuracy and functional classes • Results will inform future releases

  16. APPENDIX

  17. Population spectrum of called SNPs

  18. Population-spectrum of called SNPs • Observation: BC call more SNPs on the population level, but less SNP sites overall • Reason: BC tends to call the same site in more populations…

  19. BC/BI SNP calls per population (more detail)

  20. SNP calls (per population)

  21. Broad & BC calls: CEU BC Broad 613 122(19.90%) 0.92 3,489 2,300(65.92%) 3.47 327 52(15.90%) 1.32 SNP #dBSnp(%) Ts/Tv

  22. Broad & BC calls: CHB BC Broad 925 247(26.70%) 1.23 3,415 1,795(52.56%) 3.74 557 32(5.75%) 1.37 SNP #dBSnp(%) Ts/Tv

  23. Broad & BC calls: CHD BC Broad 3431 1,724(50.25%) 3.64 450 31(6.44%) 1.33 831 200(24.07%) 1.68 SNP #dBSnp(%) Ts/Tv

  24. Broad & BC calls: JPT BC Broad 983 271(27.57%) 1.54 2,900 1,679 (57.90%) 3.67 1819 31(1.70%) 0.74 SNP #dBSnp(%) Ts/Tv

  25. Broad & BC calls: LWK BC Broad 580 136(23.45%) 2.09 5,459 2,736(50.12%) 3.67 911 89(9.77%) 1.56 SNP #dBSnp(%) Ts/Tv

  26. Broad & BC calls: TSI BC Broad 448 105(23.44%) 0.71 3,281 2152(65.59%) 3.54 1,004 48(4.78%) 0.85 SNP #dBSnp(%) Ts/Tv

  27. Broad & BC calls: YRI BC Broad 716 112(15.64%) 0.95 5,175 2,785(53.82%) 3.56 694 71(1023%) 1.48 SNP #dBSnp(%) Ts/Tv

  28. BC vs. BI allele frequency comparisons per population at SNPs in the BC∩BI call set

  29. BC/BI genotype calls (CHB & CHD) #sites=3415 r=0.9925 #sites=3028 r=0.9993 CHB SNPs with >= 80% called genotypes All SNPs #sites=3310 r=0.9991 #sites=3431 r=0.9941 CHD All SNPs SNPs with >= 80% called genotypes

  30. BC/BI genotype calls (TSI & JPT) #sites=2370 r=0.9991 #sites=2900 r=0.9922 JPT SNPs with >= 80% called genotypes All SNPs #sites=3108 r=0.9973 #sites=3281 r=0.9912 TSI All SNPs SNPs with >= 80% called genotypes

  31. BC/BI genotype calls (LWK & YRI) #sites=5459 r=0.9924 #sites=5337 r=0.9984 LWK SNPs with >= 80% called genotypes All SNPs #sites=5175 r=0.9917 #sites=4276 r=0.9978 YRI All SNPs SNPs with >= 80% called genotypes

  32. Low frequency / singleton validation design

  33. Per population PPV and sensitivity

More Related