1 / 54

1000G Pilot 3 Progress ( in silico analysis and comparison to experimental validation)

1000G Pilot 3 Progress ( in silico analysis and comparison to experimental validation). Amit Indap , Wen -Fung Leong Gabor Marth (Boston College ) Chris Hartl and Kiran Garimella (Broad Institute ) 1000 Genomes Project Analysis Group February 2, 2010. Acknowledgements. Baylor

mikaia
Download Presentation

1000G Pilot 3 Progress ( in silico analysis and comparison to experimental validation)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1000G Pilot 3 Progress(in silicoanalysis and comparison to experimental validation) AmitIndap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl and KiranGarimella (Broad Institute) 1000 Genomes Project Analysis Group February 2, 2010

  2. Acknowledgements Baylor Matthew Bainbridge Fuli Yu Donna Muzny Richard Gibbs Broad Chris Hartl KiranGarimella Carrie Sougnez Mark DePristo Wash. U. Dan Koboldt Bob Fulton Sanger AarnoPalotie Boston College AmitIndap Wen Fung Leong Gabor Marth Cornell Andy Clark Stanford Simon Gravel Carlos Bustamante Michigan Tom Blackwell

  3. Data • Capture targets: • Started with ~1,000 genes / ~10,000 exons / 2.3Mb • 1.43Mb of total target length shared between 4 data centers used for this analysis • Samples: • 697 total samples • 7 populations • Sequence coverage: • Goal was deep per-sample coverage • Effective coverage somewhat reduced by fragment duplications • Capture technologies: • Nimblegen solid phase • Agilent liquid phase • Sequencing technologies: • SLX • 454 • Data producers: • BCM • BI • WTSI • WUGSC 1. Mean of coverage medians per sample and population

  4. Pipelines All 697 samples CEU CHB JPT YRI SNP calling TSI LWK CHD CEU CHB JPT YRI All 697 samples TSI LWK CHD SNP statistics Segregating sites in each population sample Union of all called sites in all 697 samples

  5. BC and BI call sets are converging All called sites Called sites per population (BC/BI intersection)

  6. SNP calls (per population)

  7. SNP calls (all samples) BI: 18,149 SNPs BC: 14,502 SNPs BC∩BI 1,741 SNPs 79 dbSNPs dbSNP=4.54% 12,761 SNPs 3,869 dbSNPs dbSNP=30.32% 5,388 SNPs 172 dbSNPs dbSNP=3.19% BC U BI = 19,890

  8. Genotype call accuracy relative to HapMap3 Data quality in CHB and JPT samples seems consistently lower Statistics only include genotype calls at SNP sites in BC∩BI

  9. Genotype calls • Filtering: • BC filters on genotype call quality • BI reports a genotype for any site where at least one read covers • Nominally, BI makes more calls than BC, and has, on average, higher AF # SNP sites=3,489 r=0.9921 # SNP sites=3,075 r=0.9979 The Broad caller does not filter on genotype quality All SNP sites considered Only SNP sites with >= 80% called genotypes • Good allele frequency concordance between BC and BI • At genotype calls that passes BC filter, and BI also makes a call, no discordance was found

  10. 1KG validation executive summary • Evaluated BI and BC calls against validation • 1KG chip1 • 312/697 samples across 7 populations represented • ~300 sites (150 novel) overlap with Pilot 3 target region • Concordance with 1KG chip is very high • Where covered (> 5 reads): • 302/312 (97%) of samples have >90% variant sensitivity • 269/312 (86%) of samples have >90% genotype sensitivity • Remaining disparities between 1KG chip and Pilot 3 calls can be explained by data quality issues • Later sequencing has far greater concordance with chip than earlier sequencing 1. Details in Appendix

  11. Variant PPV/Sensitivity to 1KG chip is reasonably high for most samples; discordant samples are poorly sequenced Spikes in Variant PPV are due to low-quality sequencing in JPT samples (see Appendix) Sample (318 Pilot 3 samples overlapping with 1KG chip)

  12. After filtering out sites with < 4 reads, nearly all samples in call-set overlap have high sensitivity and specificity All but one sample with low PPV (false-positive rate > 10%) are among the earliest-sequenced samples (JPT/CHB/CHD) These 10 low-sensitivity samples have strange allele balances and are likely contaminated Sample (312 Pilot 3 samples after eliminating those with low-coverage)

  13. Concordance to chip tracks closely with submission-to-DCC date (proxy for sequencing date) The most recently sequenced samples have higher concordance to 1KG chip. Submitted: 8/08-10/08 Median number of lanes: 3 Submitted: 12/08-7/09 Median number of lanes: 2 Increase in number of sites with < 4 reads corresponds with fewer lanes being run per sample. Sample (312 Pilot 3 samples sorted by earliest DCC submission date)

  14. Mean sensitivity/PPV per population is good, and improves on more recently-sequenced populations 8/2008 ILMN/454 All Ctrs 8/2008 ILMN/454 All Ctrs 8/2008 ILMN/454 BI/BCM 8/2008 ILMN/454 BI/BCM 1/2009 454 BCM 10/2008 ILMN BI/SC 2008/2009 ILMN/454 All Ctrs 13 N Samples: 69 27 102 69 3 24

  15. Low-frequency / singleton validation: executive summary • Low-frequency Sequenom assay1 • Chose 105 putative novel singletons from early Pilot 3 46-CEU-sample callsets (called in at least 2/4 callers) • Validated sites in those 46 individuals • 89/105 are true singletons • 16/105 are false-positive singletons (hom-refs and two non-singletons) • Concordance with low-frequency assay is very high • Callsets today (January 2010) • In BI and BC overlap, recovered 71/89 (80%) of assayed singletons with 0 false-positives and 0 non-singletons • In BI and BC union, recovered all 89 singletons with 3 false-positives and 0 non-singletons 1. Details in Appendix

  16. Callers are able to detect most singletons with very low false-positive rate Assay Performance Callset union finds every singleton in the assay with few false-positives. Callset Performance 1. HWE violations, no-call rate > 5%

  17. Many sites shared between P3 and external projects; low overall FP rate Calls (90 CEU samples) Loci in P1/P2 = 60% Loci in other projects/databases = 71%1 FP Rate (sites on validation chips) =5.3% FN Rate (sites on validation chips) < 5%2 Calls (overall) FP Rate (sites on validation chips) = 9.1%3 FN Rate (sites on validation chips) < 5%3 FP rate is likely a slight overestimate because a hom-ref site across the 69 CEU samples on the chip doesn’t preclude the possibility of a variant harbored in one of the other 21 samples not represented in the validation assay. Some of these FPs are also due to sample contamination in older lanes. 1. Sites seen across all 91 Pilot 3 CEU individuals, occurring in dbSNP 129, Hapmap 3, Pilot 1, or Pilot 2 2. No per-locus FNs observed in overlapping set 3. Includes FP and FN errors due to sample contamination/data quality

  18. Conclusions / future directions • Data quality has improved significantly over the life of the project • Both BC and BI pipelines produce high-quality call sets • Good agreement between call sets • intersection highly concordant with experimental validation data • Estimated FP rate below ~9% • The current Pilot 3 release is the BC∩BI (intersection) call set • We are proceeding with validations • Dual focus: accuracy and functional classes • Results will inform future releases

  19. APPENDIX

  20. Population spectrum of called SNPs

  21. Population-spectrum of called SNPs • Observation: BC call more SNPs on the population level, but less SNP sites overall • Reason: BC tends to call the same site in more populations…

  22. BC/BI SNP calls per population (more detail)

  23. SNP calls (per population)

  24. Broad & BC calls: CEU BC Broad 613 122(19.90%) 0.92 3,489 2,300(65.92%) 3.47 327 52(15.90%) 1.32 SNP #dBSnp(%) Ts/Tv

  25. Broad & BC calls: CHB BC Broad 925 247(26.70%) 1.23 3,415 1,795(52.56%) 3.74 557 32(5.75%) 1.37 SNP #dBSnp(%) Ts/Tv

  26. Broad & BC calls: CHD BC Broad 3431 1,724(50.25%) 3.64 450 31(6.44%) 1.33 831 200(24.07%) 1.68 SNP #dBSnp(%) Ts/Tv

  27. Broad & BC calls: JPT BC Broad 983 271(27.57%) 1.54 2,900 1,679 (57.90%) 3.67 1819 31(1.70%) 0.74 SNP #dBSnp(%) Ts/Tv

  28. Broad & BC calls: LWK BC Broad 580 136(23.45%) 2.09 5,459 2,736(50.12%) 3.67 911 89(9.77%) 1.56 SNP #dBSnp(%) Ts/Tv

  29. Broad & BC calls: TSI BC Broad 448 105(23.44%) 0.71 3,281 2152(65.59%) 3.54 1,004 48(4.78%) 0.85 SNP #dBSnp(%) Ts/Tv

  30. Broad & BC calls: YRI BC Broad 716 112(15.64%) 0.95 5,175 2,785(53.82%) 3.56 694 71(1023%) 1.48 SNP #dBSnp(%) Ts/Tv

  31. BC vs. BI allele frequency comparisons per population at SNPs in the BC∩BI call set

  32. BC/BI genotype calls (CHB & CHD) #sites=3415 r=0.9925 #sites=3028 r=0.9993 CHB SNPs with >= 80% called genotypes All SNPs #sites=3310 r=0.9991 #sites=3431 r=0.9941 CHD All SNPs SNPs with >= 80% called genotypes

  33. BC/BI genotype calls (TSI & JPT) #sites=2370 r=0.9991 #sites=2900 r=0.9922 JPT SNPs with >= 80% called genotypes All SNPs #sites=3108 r=0.9973 #sites=3281 r=0.9912 TSI All SNPs SNPs with >= 80% called genotypes

  34. BC/BI genotype calls (LWK & YRI) #sites=5459 r=0.9924 #sites=5337 r=0.9984 LWK SNPs with >= 80% called genotypes All SNPs #sites=5175 r=0.9917 #sites=4276 r=0.9978 YRI All SNPs SNPs with >= 80% called genotypes

  35. Low frequency / singleton validation design

  36. Recap: Novel singletons from 66 CEU samples chosen for validation • Interesting singleton: a putative SNP… • that is novel (not in dbSNP 129) • that has been identified by the BC or BI caller • that only occurs in 1 out of 66 of the test individuals • where the individual in whom the SNP is identified is the same among callers • that is also identified by one other caller • whose locus has nominal coverage in other non-variant samples

  37. Data and Definitions • Sequenom validation run on 46 of 66 individuals (Broad did not have DNA for all 66 samples) • Sequenom calls filtered by Broad standard metrics (no significant deviation from Hardy-Weinberg; no-call rate of <5%) • Concordance checked across call sets which were used for selection, and the new Broad and BC calls

  38. Validated true singletons may not be singletons • Because 20 members of the population were unable to be genotyped, it is possible that true novel singletons are actually present in one or more of the additional 20 individuals • Basic pop-gen gives some ballpark estimates: • Probability that a validated singleton is in one of the other 20 individuals: • 1.2% ( = 1 – ( 1 – θ )20 ) • All validated singletons are truly singletons • 33.5% ( = ( 1 – P[event above] )89) *θ = 1/1600

  39. Per population PPV and sensitivity

  40. Variant PPV/Sensitivity – unadjusted for depth Individual in Pilot 3 (318 overlapping individuals)

  41. Variant PPV/Sensitivity for CEU Per-Locus FP Rate: 5.3% Per-Locus FN Rate: < 5% *No FN observed CEU Individual in Pilot 3 (68 well-covered individuals)

  42. Variant PPV/Sensitivity for CEU – Counting Low Depth CEU Individual in Pilot 3 (69 individuals)

  43. Variant PPV/Sensitivity for CHB Per-Locus FP Rate: 9.4% Per-Locus FN Rate: < 5% *No Locus FN observed CHB Individual in Pilot 3 (13 well-covered individuals)

  44. Variant PPV/Sensitivity for CHB – Counting Low Depth CHB Individual in Pilot 3 (14 individuals)

  45. Variant PPV/Sensitivity for CHD Per-Locus FP Rate: 3.4% Per-Locus FN Rate: < 5% * 3 FN in 555 TP observed CHD Individual in Pilot 3 (28 well-covered individuals)

  46. Variant PPV/Sensitivity for CHD – Counting Low Depth CHD Individual in Pilot 3 (28 individuals)

  47. Variant PPV/Sensitivity for JPT Per-Locus FP Rate: 2.2% Per-Locus FN Rate: < 5% * No Locus FN observed JPT Individual in Pilot 3 (104 well-covered individuals)

  48. Variant PPV/Sensitivity for JPT – Counting Low Depth JPT Individual in Pilot 3 (104 individuals)

  49. Variant PPV/Sensitivity for LWK Per-Locus FP Rate:1.3% Per-Locus FN Rate: < 5% * 1 FN in 755 TP observed LWK Individual in Pilot 3 (70 well-covered individuals)

  50. Variant PPV/Sensitivity for LWK – Counting Low Depth LWK Individual in Pilot 3 (70 individuals)

More Related