1 / 32

Ingredients for a successful genome-wide association studies: A statistical view

Ingredients for a successful genome-wide association studies: A statistical view. Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical Care Medicine Brigham and Women’s Hospital Boston, Massachusetts Department of Biostatistics Harvard School of Public Health

alta
Download Presentation

Ingredients for a successful genome-wide association studies: A statistical view

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical Care Medicine Brigham and Women’s Hospital Boston, Massachusetts Department of Biostatistics Harvard School of Public Health Boston, Massachusetts

  2. Overview: • What are genome-wide association studies? • What are the statistical requirements for a successful • genome-wide association study? • Sufficient sample sizes • LD coverage • Genotype quality • Design of genome-wide association studies / • Handling of the multiple testing problem

  3. The human genome • 22 chromosomes • many possible genes • ~30,000-50,000 genes • ~8,000,000 SNPs • How can we find disease genes?

  4. The human genome How can we find disease genes? Genotyping all loci is not possible (not yet! ) => Utilization of 2 concepts: 1.) Linkage disequilibrium (LD):Correlation of alleles at two loci 2.) Genetic association: a particular form of a DNA polymorphism occurs more frequently in subjects with a phenotype of interest

  5. Genetic Association Disease Phenotype Test for association between phenotype and marker locus Test for genetic association between the phenotype and the DSL LD / correlation DSL: disease susceptibility locus Marker

  6. Genome-wide association study Definition: Association analysis performed with a panel of polymorphic markers adequately spaced to capture most of the linkage disequilibrium information in the entire genome in the study population. Usually: 100,000 SNPs and more Human Genome ? => Test for association Disease Phenotype

  7. What are the statistical requirements for a successful genome-wide association study? • Sufficient sample sizes • LD coverage • Genotyping quality • Design of genome-wide association studies / • Handling of the multiple testing problem

  8. Sample size requirements: Disease Phenotype Test for association between phenotype and marker locus Test for genetic association between the phenotype and the DSL LD / correlation DSL: disease susceptibility locus Marker Sufficient statistical power is needed to detect the association

  9. Example for required sample sizes Required sample sizes to achieve 80% power in a case/control study for a significance level of 10-7

  10. What are the statistical requirements for a successful genome-wide association study? • Sufficient sample sizes • LD coverage • Genotyping quality • Design of genome-wide association studies / • Handling of the multiple testing problem

  11. Linkage disequilibrium (LD): Test for genetic association between the phenotype and the DSL Disease Phenotype Test for association between phenotype and marker locus LD / correlation The set of markers has to contain a marker that is “sufficiently” correlated with the DSL so that the genetic association at the DSL is also visible that the marker locus DSL: disease susceptibility locus Marker

  12. Measures of genetic correlation between markers

  13. The interpretation of r^2 r2 N is the “effective sample size” If a marker M and causal gene G are in LD, then a study with N cases and controls which measures M (but not G) will have the same power to detect an association as a study with r2 N cases and controls that directly measured G Goal: The markers that are genotyped should be selected so that they have high r^2-values (preferable at least 80%) with the marker that are not genotyped A good SNPs selection will be key for the success of GWAs

  14. SNP Selection for GWA Studies • Really a challenge for industry development, not an investigator’s laboratory • However, need to select a panel with adequate LD coverage for study population • Assessment of Illumina Sentrix HumanHap300 BeadChip (R. Lazarus) • Studied LD coverage of ENCODE regions: Ten 500 kb regions that were completely sequenced in HapMap in 60 CEPH parents • Assessed LD coverage of 6226 common ENCODE regions SNPs (MAF > 0.1) • Found maximum r2 of each ENCODE SNP with a SNP on HumanHap300 Panel

  15. Genotyping quality (QC): Test for genetic association between the phenotype and the DSL Disease Phenotype Test for association between phenotype and marker locus LD / correlation The genotype quality has to be sufficient to so that the genetic association at the DSL is also visible that the marker locus that are in LD with the DSL. DSL: disease susceptibility locus Marker

  16. For example, the dependence of the power of a GWA on the call rate Scenario: • Case/control study: 1,500 cases & controls • Odds-ratio: 1.5 • Overall significance level: 5% • Adjustment for multiple comparisons: Bonferroni 5%/500,000 = 10-7 => Power as a function of allele frequency and call rates

  17. Power levels and avg number of false positives:Avg call rate by genotype: 100%, 100%,100%

  18. Power levels and avg number of false positives:Avg call rate by genotype: 99%, 99%, 99%

  19. Power levels and avg number of false positives:Avg call rate by genotype: 98%, 98%, 98%

  20. Power levels and avg number of false positives:Avg call rate by genotype: 99%, 95%, 99%

  21. For example, the dependence of the power of a GWA on the call rate Conclusion: • Call rate has moderate effect on power (for nearly perfect call rates) • Call rate has large effect on number of false positives (for nearly perfect call rates) Situation even worse for multi-stage designs!

  22. Genotyping quality (QC): Test for genetic association between the phenotype and the DSL Disease Phenotype Test for association between phenotype and marker locus LD / correlation The genotype quality has to be sufficient so that false positive rate does not dilute the “real” signals DSL: disease susceptibility locus Marker

  23. Design of genome-wide association studies/Handling of the multiple testing problem:

  24. “Using the same data set for screening and testing”: An approach for family-based designs • Balance false-negatives with false-positives • We don’t want to test all SNPs • “You break it, you buy it” • Genomic screening and testing using the same data set • Test the “promising” SNPs • Ignore the “less-promising” SNPs

  25. PBAT • PBAT* screening approach • Family-based studies, quantitative traits • Address multiple-comparisons • Screen and test using the same dataset *Van Steen K, McQueen MB, Herbert A et al. (2005). Genomic screening and replication using the same data set in family-based association testing. Nat Genet 37:683-691.

  26. PBAT: Screening Step • 1. Screen • Use ‘between-family’ information E(X|S) to estimate the strength of the genetic association • Based on the estimate ab, calculate conditional power for • Select top N SNPs on the basis of power

  27. PBAT: Testing Step • 2. Test • Use ‘within-family’ information • FBAT statistic (independent of ‘between-family’ info) • Adjust for N tests (not 500K!)

  28. The 3 steps of the screening technique (Nature Genetics (2005)): Step 1: Replace X by E(X) and estimate power/effect size Step 3: Replace E(X) by X and compute FBAT test statistic for SNP2 and Trait Step 2: Select combination with maximal power Trait 23% 15% 15% 89% 35% 85% E(X1|P) SNP 1 SNP 2 E(X2|P) SNP 3 E(X3|P) E(X4|P) SNP 4 SNP 5 E(X5|P) SNP 6 E(X6|P) This p-value does not need to be adjusted for multiple comparisons!!! P-value for FBAT statistic: 0.5%

  29. PBAT Software implementation • family-based studies • quantitative traits & dichotomous traits • Single marker, haplotype, multi-marker • Time-to-onset, multivariate data, time-series data • Professional version distributed by Golden Helix…

  30. Golden Helix Software for Illumina Whole Genome Analysis • Golden Helix is Harvard’s PBAT commercialization partner • Easy-to-use, user-friendly graphical interface • Professional PBAT training and consulting • Rapid customer support • “Accelerating the Quest for Significance” • Powerful methods for both family and unrelated individuals • Run on hundreds of processors with distributed computing • Illumina data import directly supported • “I was able to do in 3 days what it has taken our lab 2 years to try and do with [other] collaborations.” – Golden Helix customer www.goldenhelix.com

More Related