1000G Pilot 3 Progress ( in silico analysis and comparison to experimental validation)

1000G Pilot 3 Progress(in silicoanalysis and comparison to experimental validation) AmitIndap, Wen-Fung Leong Gabor Marth (Boston College) Chris Hartl and KiranGarimella (Broad Institute) 1000 Genomes Project Analysis Group February 2, 2010

Acknowledgements Baylor Matthew Bainbridge Fuli Yu Donna Muzny Richard Gibbs Broad Chris Hartl KiranGarimella Carrie Sougnez Mark DePristo Wash. U. Dan Koboldt Bob Fulton Sanger AarnoPalotie Boston College AmitIndap Wen Fung Leong Gabor Marth Cornell Andy Clark Stanford Simon Gravel Carlos Bustamante Michigan Tom Blackwell

Data • Capture targets: • Started with ~1,000 genes / ~10,000 exons / 2.3Mb • 1.43Mb of total target length shared between 4 data centers used for this analysis • Samples: • 697 total samples • 7 populations • Sequence coverage: • Goal was deep per-sample coverage • Effective coverage somewhat reduced by fragment duplications • Capture technologies: • Nimblegen solid phase • Agilent liquid phase • Sequencing technologies: • SLX • 454 • Data producers: • BCM • BI • WTSI • WUGSC 1. Mean of coverage medians per sample and population

Pipelines All 697 samples CEU CHB JPT YRI SNP calling TSI LWK CHD CEU CHB JPT YRI All 697 samples TSI LWK CHD SNP statistics Segregating sites in each population sample Union of all called sites in all 697 samples

BC and BI call sets are converging All called sites Called sites per population (BC/BI intersection)

SNP calls (per population)

SNP calls (all samples) BI: 18,149 SNPs BC: 14,502 SNPs BC∩BI 1,741 SNPs 79 dbSNPs dbSNP=4.54% 12,761 SNPs 3,869 dbSNPs dbSNP=30.32% 5,388 SNPs 172 dbSNPs dbSNP=3.19% BC U BI = 19,890

Genotype call accuracy relative to HapMap3 Data quality in CHB and JPT samples seems consistently lower Statistics only include genotype calls at SNP sites in BC∩BI

Genotype calls • Filtering: • BC filters on genotype call quality • BI reports a genotype for any site where at least one read covers • Nominally, BI makes more calls than BC, and has, on average, higher AF # SNP sites=3,489 r=0.9921 # SNP sites=3,075 r=0.9979 The Broad caller does not filter on genotype quality All SNP sites considered Only SNP sites with >= 80% called genotypes • Good allele frequency concordance between BC and BI • At genotype calls that passes BC filter, and BI also makes a call, no discordance was found

1KG validation executive summary • Evaluated BI and BC calls against validation • 1KG chip1 • 312/697 samples across 7 populations represented • ~300 sites (150 novel) overlap with Pilot 3 target region • Concordance with 1KG chip is very high • Where covered (> 5 reads): • 302/312 (97%) of samples have >90% variant sensitivity • 269/312 (86%) of samples have >90% genotype sensitivity • Remaining disparities between 1KG chip and Pilot 3 calls can be explained by data quality issues • Later sequencing has far greater concordance with chip than earlier sequencing 1. Details in Appendix

Variant PPV/Sensitivity to 1KG chip is reasonably high for most samples; discordant samples are poorly sequenced Spikes in Variant PPV are due to low-quality sequencing in JPT samples (see Appendix) Sample (318 Pilot 3 samples overlapping with 1KG chip)

After filtering out sites with < 4 reads, nearly all samples in call-set overlap have high sensitivity and specificity All but one sample with low PPV (false-positive rate > 10%) are among the earliest-sequenced samples (JPT/CHB/CHD) These 10 low-sensitivity samples have strange allele balances and are likely contaminated Sample (312 Pilot 3 samples after eliminating those with low-coverage)

Concordance to chip tracks closely with submission-to-DCC date (proxy for sequencing date) The most recently sequenced samples have higher concordance to 1KG chip. Submitted: 8/08-10/08 Median number of lanes: 3 Submitted: 12/08-7/09 Median number of lanes: 2 Increase in number of sites with < 4 reads corresponds with fewer lanes being run per sample. Sample (312 Pilot 3 samples sorted by earliest DCC submission date)

Mean sensitivity/PPV per population is good, and improves on more recently-sequenced populations 8/2008 ILMN/454 All Ctrs 8/2008 ILMN/454 All Ctrs 8/2008 ILMN/454 BI/BCM 8/2008 ILMN/454 BI/BCM 1/2009 454 BCM 10/2008 ILMN BI/SC 2008/2009 ILMN/454 All Ctrs 13 N Samples: 69 27 102 69 3 24

Low-frequency / singleton validation: executive summary • Low-frequency Sequenom assay1 • Chose 105 putative novel singletons from early Pilot 3 46-CEU-sample callsets (called in at least 2/4 callers) • Validated sites in those 46 individuals • 89/105 are true singletons • 16/105 are false-positive singletons (hom-refs and two non-singletons) • Concordance with low-frequency assay is very high • Callsets today (January 2010) • In BI and BC overlap, recovered 71/89 (80%) of assayed singletons with 0 false-positives and 0 non-singletons • In BI and BC union, recovered all 89 singletons with 3 false-positives and 0 non-singletons 1. Details in Appendix

Callers are able to detect most singletons with very low false-positive rate Assay Performance Callset union finds every singleton in the assay with few false-positives. Callset Performance 1. HWE violations, no-call rate > 5%

Many sites shared between P3 and external projects; low overall FP rate Calls (90 CEU samples) Loci in P1/P2 = 60% Loci in other projects/databases = 71%1 FP Rate (sites on validation chips) =5.3% FN Rate (sites on validation chips) < 5%2 Calls (overall) FP Rate (sites on validation chips) = 9.1%3 FN Rate (sites on validation chips) < 5%3 FP rate is likely a slight overestimate because a hom-ref site across the 69 CEU samples on the chip doesn’t preclude the possibility of a variant harbored in one of the other 21 samples not represented in the validation assay. Some of these FPs are also due to sample contamination in older lanes. 1. Sites seen across all 91 Pilot 3 CEU individuals, occurring in dbSNP 129, Hapmap 3, Pilot 1, or Pilot 2 2. No per-locus FNs observed in overlapping set 3. Includes FP and FN errors due to sample contamination/data quality

Conclusions / future directions • Data quality has improved significantly over the life of the project • Both BC and BI pipelines produce high-quality call sets • Good agreement between call sets • intersection highly concordant with experimental validation data • Estimated FP rate below ~9% • The current Pilot 3 release is the BC∩BI (intersection) call set • We are proceeding with validations • Dual focus: accuracy and functional classes • Results will inform future releases

APPENDIX

Population spectrum of called SNPs

Population-spectrum of called SNPs • Observation: BC call more SNPs on the population level, but less SNP sites overall • Reason: BC tends to call the same site in more populations…

BC/BI SNP calls per population (more detail)

SNP calls (per population)

Broad & BC calls: CEU BC Broad 613 122(19.90%) 0.92 3,489 2,300(65.92%) 3.47 327 52(15.90%) 1.32 SNP #dBSnp(%) Ts/Tv

Broad & BC calls: CHB BC Broad 925 247(26.70%) 1.23 3,415 1,795(52.56%) 3.74 557 32(5.75%) 1.37 SNP #dBSnp(%) Ts/Tv

Broad & BC calls: CHD BC Broad 3431 1,724(50.25%) 3.64 450 31(6.44%) 1.33 831 200(24.07%) 1.68 SNP #dBSnp(%) Ts/Tv

Broad & BC calls: JPT BC Broad 983 271(27.57%) 1.54 2,900 1,679 (57.90%) 3.67 1819 31(1.70%) 0.74 SNP #dBSnp(%) Ts/Tv

Broad & BC calls: LWK BC Broad 580 136(23.45%) 2.09 5,459 2,736(50.12%) 3.67 911 89(9.77%) 1.56 SNP #dBSnp(%) Ts/Tv

Broad & BC calls: TSI BC Broad 448 105(23.44%) 0.71 3,281 2152(65.59%) 3.54 1,004 48(4.78%) 0.85 SNP #dBSnp(%) Ts/Tv

Broad & BC calls: YRI BC Broad 716 112(15.64%) 0.95 5,175 2,785(53.82%) 3.56 694 71(1023%) 1.48 SNP #dBSnp(%) Ts/Tv

BC vs. BI allele frequency comparisons per population at SNPs in the BC∩BI call set

BC/BI genotype calls (CHB & CHD) #sites=3415 r=0.9925 #sites=3028 r=0.9993 CHB SNPs with >= 80% called genotypes All SNPs #sites=3310 r=0.9991 #sites=3431 r=0.9941 CHD All SNPs SNPs with >= 80% called genotypes

BC/BI genotype calls (TSI & JPT) #sites=2370 r=0.9991 #sites=2900 r=0.9922 JPT SNPs with >= 80% called genotypes All SNPs #sites=3108 r=0.9973 #sites=3281 r=0.9912 TSI All SNPs SNPs with >= 80% called genotypes

BC/BI genotype calls (LWK & YRI) #sites=5459 r=0.9924 #sites=5337 r=0.9984 LWK SNPs with >= 80% called genotypes All SNPs #sites=5175 r=0.9917 #sites=4276 r=0.9978 YRI All SNPs SNPs with >= 80% called genotypes

Low frequency / singleton validation design

Recap: Novel singletons from 66 CEU samples chosen for validation • Interesting singleton: a putative SNP… • that is novel (not in dbSNP 129) • that has been identified by the BC or BI caller • that only occurs in 1 out of 66 of the test individuals • where the individual in whom the SNP is identified is the same among callers • that is also identified by one other caller • whose locus has nominal coverage in other non-variant samples

Data and Definitions • Sequenom validation run on 46 of 66 individuals (Broad did not have DNA for all 66 samples) • Sequenom calls filtered by Broad standard metrics (no significant deviation from Hardy-Weinberg; no-call rate of <5%) • Concordance checked across call sets which were used for selection, and the new Broad and BC calls

Validated true singletons may not be singletons • Because 20 members of the population were unable to be genotyped, it is possible that true novel singletons are actually present in one or more of the additional 20 individuals • Basic pop-gen gives some ballpark estimates: • Probability that a validated singleton is in one of the other 20 individuals: • 1.2% ( = 1 – ( 1 – θ )20 ) • All validated singletons are truly singletons • 33.5% ( = ( 1 – P[event above] )89) *θ = 1/1600

Per population PPV and sensitivity

Variant PPV/Sensitivity – unadjusted for depth Individual in Pilot 3 (318 overlapping individuals)

Variant PPV/Sensitivity for CEU Per-Locus FP Rate: 5.3% Per-Locus FN Rate: < 5% *No FN observed CEU Individual in Pilot 3 (68 well-covered individuals)

Variant PPV/Sensitivity for CEU – Counting Low Depth CEU Individual in Pilot 3 (69 individuals)

Variant PPV/Sensitivity for CHB Per-Locus FP Rate: 9.4% Per-Locus FN Rate: < 5% *No Locus FN observed CHB Individual in Pilot 3 (13 well-covered individuals)

Variant PPV/Sensitivity for CHB – Counting Low Depth CHB Individual in Pilot 3 (14 individuals)

Variant PPV/Sensitivity for CHD Per-Locus FP Rate: 3.4% Per-Locus FN Rate: < 5% * 3 FN in 555 TP observed CHD Individual in Pilot 3 (28 well-covered individuals)

Variant PPV/Sensitivity for CHD – Counting Low Depth CHD Individual in Pilot 3 (28 individuals)

Variant PPV/Sensitivity for JPT Per-Locus FP Rate: 2.2% Per-Locus FN Rate: < 5% * No Locus FN observed JPT Individual in Pilot 3 (104 well-covered individuals)

Variant PPV/Sensitivity for JPT – Counting Low Depth JPT Individual in Pilot 3 (104 individuals)

Variant PPV/Sensitivity for LWK Per-Locus FP Rate:1.3% Per-Locus FN Rate: < 5% * 1 FN in 755 TP observed LWK Individual in Pilot 3 (70 well-covered individuals)

Variant PPV/Sensitivity for LWK – Counting Low Depth LWK Individual in Pilot 3 (70 individuals)

1000G Pilot 3 Progress ( in silico analysis and comparison to experimental validation)

1000G Pilot 3 Progress ( in silico analysis and comparison to experimental validation)

Presentation Transcript

Pilot sites - progress

PILOT

Progress and Promise: Lessons from the Boston Pilot Schools

MD#3 Progress

Progress Report 3 -Privacy

1000G Phase 1 Release chr20 call sets

Check Your Progress 3

Progress check 3

WBT – Academic Vocabulary List Pilot Lesson (work in progress)

1000G Pilot 3 Progress in silico analysis and comparison to experimental validation

Chapter Quality Network Asthma Pilot Project Team Progress Presentation

Pilot

PILOT NAVIGATION Part 3

Team Progress Report #3

Pilot GCSE Pre-Release Lesson.3

Polarized He-3 Target Progress

Progress report of new PHENIX pilot chip

NHI Pilot Districts 12 months progress report

NHI Pilot Districts 12 months progress report

Amira Organic Traditional White Basmati Rice, 1000g

Progress Presentation #3

Progress report of new PHENIX pilot chip