Genome-wide association studies

C2BAT: Using the same data set for screening and testing. A testing strategy for genome-wide association studies in case/control designMatt McQueen, Jessica Su, Nan Laird and Christoph LangeHarvard School of Public Health

Genome-wide association studies Limitation of linkage analysis and the potential of association analysis => genome-wide association studies (Risch & Merikangas 1997) 100,000 > SNPs and phenotypes are tested for association. Statistical road block: Severe multiple testing problem!!!

“Using the same data set for screening and testing” Screening technique S Testing statistic T • Testing strategy: • Assess evidence for association for all SNPs based on S (Screening Step) • Select a small subset of N markers (10-200) • Compute the association test conditional upon S and adjust N comparisons (Testing Step) • If the screening step and the testing step are statistically independent, we can look at the data in the screening step without paying a “statistical price” for it.

“Using the same data set for screening and testing” General concept proposed by Laird and Lange (2006, Nat Rev Genet) Decomposition of joint-likelihood: P( {phenotype, genotype} ) = P( {phenotype, genotype} | S({phenotype, genotype}) ) * P(S{phenotype, genotype}) • S = “Summary test statistic to assess evidence for association” • Requirements for S: • The association test has to condition on S • S has to contain information about the potential association as well = Testing Step = Screening Step • Testing strategy: • Assess evidence for association for all SNPs based on S (Screening Step) • Select a small subset of N markers (10-200) • Compute the association test conditional upon S and adjust N comparisons (Testing Step) • The screening step and the testing step are statistically independent !!!

“Using the same data set for screening and testing” Application to family-based association tests (VanSteen et al (2005)) Decomposition of joint-likelihood: P( {phenotype, genotype, parent genotype} ) = P( {phenotype, genotype} | {phenotype, par. genotype} ) * P({phenotype, par genotype}) • S = “phenotype and parental genotype/sufficient statistic” = Screening Step based on conditional mean model Lange et al (2003) = Between-family component Fulker et al (1999) = Within-family component (Fulker et al (1999)) = Testing Step based FBAT Laird et al (2000) • Alternative approach: • Instead of using the between-component (Screening step) and the within-component (Testing Step) in 2 stage testing strategy one could include both components in the test statistics, e.g. QTDT (Abecasis et al (2000)) • Disadvantages: • Only marginal power gains (5%) over the FBAT-statistic when a single SNP is tested (Abecasis et al (2001)) • Lack of robustness against population admixture (Yu et al (2006)) • Properties of the testing strategy: • Outperforms standard adjustments for multiple comparions by factors up to 40 • Additional power boost by the use of complex phenotypes such as longitudinal data: Discovery of INSIG2 in a 100K-scan in the Framingham Heart Study First replicable association for BMI / obesity (Herbert et al (2006, Science))

“Using the same data set for screening and testing” Can we translate this concept to association studies in unrelated cases and controls? c2-Tests and Amitrage-trend tests are conditional tests that condition upon the margins => The data-partitioning statistic S are margins of the table

Testing strategy: • 1.) Divide table into a “screening table” and a “testing table“ • 2.) For each SNP, use the “screening table” and the margins of the “testing table” to assess evidence for association in the screening step • 3.) Select the most promising N SNPs and test them for association based on the data of the testing table. • How can we obtain information about an association from the margins? = Screening Step = Testing Step

+ Results will depend on the actual random split-up of the tables! Solution: 1.) Re-sampling of the tables 2.) p-value for testing set based on p(data)=p(data|S(data))*p(S(data)) and Monte-Carlo simulations

Simulation Study

Can C2BAT find INSIG2 in the 100K-scan in Framingham Heart Study again ? • 1400 probands in about 300 families: • Randomly select 150 unrelated cases/controls (BMI>28 = “affected”) • =>Apply standard analysis (p-value adjusted by Bonferroni correction) and C2BAT to see whether INSIG2 reaches genome-wide significance For 1000, replicates: Power of standard analysis to detect INSIG2: 5% Power of C2BAT to detect INSIG2: 17%

Future work: 1.) Extension to quantitative traits =>Expression analysis 2.) Gene-gene interactions Software: www.c2bat.com

Genome-wide association studies