1 / 68

Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 206 2-4271

EPI293 Design and analysis of gene association studies Winter Term 2006 Lecture 3: Statistical review, single-locus association tests. Peter Kraft pkraft@hsph.harvard.edu Bldg 2 Rm 206 2-4271. Outline. Statistical review One-locus tests Multiple single locus tests. Outline.

sarah
Download Presentation

Peter Kraft pkraft@hsph.harvard Bldg 2 Rm 206 2-4271

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EPI293Design and analysis of gene association studiesWinter Term 2006Lecture 3: Statistical review, single-locus association tests Peter Kraft pkraft@hsph.harvard.eduBldg 2 Rm 2062-4271

  2. Outline • Statistical review • One-locus tests • Multiple single locus tests

  3. Outline • Statistical review • Pearson’s chi-square • Likelihood theory • Measures of model fit: AIC, BIC • Bayesian data analysis • One-locus tests • Multiple single-locus tests

  4. Pearson’s chi-squared • Do categorical data have hypothesized dist’n? • Are outcome and exposure independent (kl tables)? • Do genotypes follow Hardy-Weinberg proportions? • i indexes I categories • Test statistic • T ~ d2 under null • d = no. parms under alternative – no. parms under null

  5. Example: 2  3 table • Let n00, n01 and n02 be counts of controls with genotypes aa, Aa, and AA, respectively • Let n10, n11 and n12 be the same for cases • n0. and n1. are total no.s of controls, cases • n.1 is total no. of aa genotypes etc. • T= ~ 22 2 d.f. from 4-2 = 2 or standard formula: (k-1)(l-1) = 2

  6. Example: test for departure from HWE • T= • Under null T is a chi-square with 1 d.f. • Two parameters under alternative minus one under null

  7. Likelihood theory • Likelihood is function of model parameters, given a probabilistic model and data • Probability of observed data for given parameter values • Assume observations (indexed by i) are independent • Let X be data for observation i •  = parameters of interest;  = “nuisance” parameters • Maximize L to estimate  (MLEs) • Equivalent to maximizing log L • Usually requires computers

  8. Example: MLE for allele frequency • Multinomial likelihood • Maximum at 0, 1 or “score” = U(p) = /p logL = 0 • … so MLE of p is (2n2 + n1) / (2n)

  9. Example: unconditional logistic regression • J exposures of interest, K “nuisance” parameters • No closed-form solution for parameter estimates • Need computer: SAS PROC LOGISTIC, R GLM etc.

  10. Tests based on likelihood theory • Score test • U(0) ~ N(0,Var(U)) • If observations are independent Var(U)  I = - 2/2 log L • U’I-1U ~ 2 with dim(0) d.f. • Often has convenient formula (e.g. McNemar’s test) • Wald test • For large enough samples: • If observations are independent: • Leads to usual test: • “Easy” to robustify if observations are not independent • Sandwich or Huber-White estimate: • Likelihood ratio test • Intuitive test of hypotheses that constrain multiple parms

  11. Likelihood ratio test Alternative model Null model 0  1, i.e. models are “nested” LRT = 2 log LR ~ d2 under null, d = dim(1) – dim(0) E.g.1= {1,2: 1,2(-, )}, 0 = {1=2=0}, d = 2.

  12. Likelihood ratio test: example • Case-control study of CHD • Z is BMI, coded in tertiles • I.e. Zi’ = (Zi1,Zi2) • Zi1=1 if i in middle tertile, 0 otherwise • Zi2=1 if i in top tertile, 0 otherwise • X includes intercept, age (as a linear term) • Null: Pr(D=1|Z,X) = expit[’X] • (two parameters) • Alternative: Pr(D=1|Z,X) = expit[’Z+’X] • (four parameters) Likelihood ratio test has 4-2=2 degrees of freedom

  13. More parameters = more flexibility = smaller -2 log L “Penalty for ‘overfitting’” Measures of model fit • Not all models are nested within each other • Dominant, recessive models for a given risk allele • Locus A versus locus B • Interested in model fit per se • Which model(s) best describe(s) data • Akaike Information Criterion • AIC = -2 log L + 2 K • Bayes Information Criterion • BIC = -2 log L + log(n) K Smaller is better(but read software manual) AIC is an estimate of “in-sample error” using log-likelihood loss functionBIC is a rough estimate of -2 log Pr(Model|Data)

  14. Bayesian data analysis • Frequentists assume there is a true model with true parameter values, which we estimate given the data • Pearson’s chi-square, likelihood theory: all frequentist • Bayesians assume the parameters (including perhaps model form) are random variables, and calculate the posterior distribution given the data • Advantages • Can account for “prior information” about distribution of parms • Quite complicated models are mathematically tractable • Disadvantages • Requires assumptions about “prior information” Bayes’ Theorem is prior distribution of  “Fully Bayes” = assumes prior completely known; “empirical Bayes” = assumes prior depends on “hyper parameters” (e.g. mean and variance) which are estimated from data

  15. “Fully Bayes” example • Say we collect n std’zed continuous measurements • Xi ~ N(,1) • Say that a priori ~N(0,02) • Then posterior distribution of  has mean… …and variance What does this mean? (a) For n large relative to 1/02, “the data swamp the prior” (b) for n small relative to 1/02, the prior swamps n (c) different priors lead to different results

  16. Empirical Bayes example: heirarchical modeling • Say Z1,…,Z5 measure consumption of five food types • First stage model: • Pr(D=1|Z) – expit[0 + 1 Z1 + … 5 Z5] • Second stage model (prior): • 1= 0 + 1 X1 + 1;2=0 + 1 X2 + 2; etc. … • …where Xi is amount of nutrient of interest in food i • “regressing effect of Z on X” • Prior depends on three parameters: 0,1 and var() • 0,1 estimated from data • var() can be estimated from data or treated as fixed • Or chosen to minimize prediction error • Advantages • Reduce parameter variance • Allow high-dimensional models to be fit • Disadvantages • Must make assumptions in second-stage model • For us: what is the at-risk allele, which loci are “exchangeable”

  17. Outline • Statistical review • One-locus tests • Diallelic • Multiallelic • Multiple single-locus tests

  18. Simple threetwo tables • Advantages • Simplicity, completeness • Robust to true dominance pattern • Disadvantage • Statistic unreliable when few homozygote variants (AA) T= has 2 d.f. under null

  19. Simple twotwo tables • Test statistic now has 1 d.f. under null • Advantages • Simplicity • Disadvantage • Lose some information in presentation • Not robust to true dominance pattern Dominant model Recessive model

  20. Simple trend test • Armitage’s Trend Test • Test linear trend in log(OR) with no. of A alleles Notation from slide 18 • Test statistic still has 1 d.f. under null • Advantages • Simplicity, retain information in presentation (2x3 table) • More robust than dominant, recessive tests • Disadvantage • Not as robust as 2 d.f. test

  21. Allelic test • For all the previous tests, the unit of observation was the subject (genotype) • Total number of observations = n.. = number of subjects • Can also treat alleles as the unit of observation • Now total number of observations is 2 n.. • Great! I’ve doubled my sample size! But… • … my Type I error could be inflated if locus is out of HWE… • … and ORall requires careful interpretation Sasieni, P.D., From genotypes to genes: doubling the sample size. Biometrics, 1997. 53(4): p. 1253-61

  22. Examples Codominant test Pearson’s chi-square: 1.86 on 2 d.f., p=.39 Allelic test Pearson’s chi-square: 1.62 on 1 d.f., p=.20 “Truth:” RRAa = 1.25, RRAA = 1.5

  23. 2x3 (etc.) tables via logistic regression • Trick: create genotype coding variable Z • One d.f. tests • Dominant: Z=1 if genotype is AA or Aa, 0 otherwise • Recessive: Z=1 if genotype is AA, 0 otherwise • Trend (a.k.a. linear or addtive): Z = # A alleles • If genotype is AA then Z= 2, if Aa then Z=1 etc. • Score test form this model = Armitage’s trend test • Two d.f. test • Create two “dummy” variables • Z1 = 1 if genotype is Aa, 0 otherwise • Z2 = 1 if genotype is AA, 0 otherwise • Perform likelihood ratio test • Advantages of logistic regression • Adjust for other variables, test several loci simultaneously

  24. How to fit using logistic regression (in SAS) data example; input caco z n; cards; 0 0 655 0 1 310 0 2 37 1 0 535 1 1 401 1 2 67 ; run; Additive Co-dominant proclogisticdescending; model caco=z; weight n; run; proclogisticdescending; class z (ref=first); model caco=z; weight n; run;

  25. Additive model Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 32.3558 1 <.0001 Score 32.1106 1 <.0001 Wald 31.6368 1 <.0001 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -0.1963 0.0568 11.9363 0.0006 z 1 0.4331 0.0770 31.6368 <.0001

  26. Co-dominant model Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 32.5780 2 <.0001 Score 32.4012 2 <.0001 Wald 32.0451 2 <.0001 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 0.2163 0.0753 8.2417 0.0041 z 1 1 0.0411 0.0871 0.2232 0.6366 z 2 1 0.3775 0.1402 7.2486 0.0071

  27. Adjusting for covariates A) proclogisticdescending; model caco=x; weight n; proclogisticdescending; class z (ref=first); model caco=z x; weight n; run; data example; input caco z x n; cards; 0 0 0 655 0 1 0 310 0 2 0 37 1 0 0 535 1 1 0 401 1 2 0 67 0 0 1 642 0 1 1 311 0 2 1 31 1 0 1 542 1 1 1 391 1 2 1 59 ; run; B) Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 5520.818 5522.805 SC 5521.302 5523.775 -2 Log L 5518.818 5518.805 A) B) Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 5520.818 5468.033 SC 5521.302 5469.972 -2 Log L 5518.818 5460.033 T=5518.8-5460.0=58.8 on 2 d.f.

  28. Which test to use? • No fishing expeditions (without paying a price)!

  29. Which test to use? (cont) Different colors = different true models. Points are comparisons of power for different models. Codominant offers gain in power under true recessive model, for little cost under other true models.

  30. PK’s soapbox • For complex diseases, the “mode of inheritance” (dominant, recessive, et cetera) is an antiquated and potentially dangerous concept • “Mode of inheritance” developed for simple Mendelian diseases with near-complete penetrance • Complex diseases involve multiple loci, have high phenocopy rates • A marker that is in LD with a causal gene will “look co-dominant,” even if the causal gene is actually recessive or dominant • Few can resist temptation to “go fishing” • Reporting results for “most significant” coding leads to inflated Type I error, narrow CIs • Very difficult--if not impossible--to choose between competing models • Suggestion: emphasize results from co-dominant coding • Co-dominant coding is “model free” (model is saturated) • ORs convey more information • Generally retains power relative to other codings (even though test has 2 d.f. instead of 1) • Switch to dominant coding only if homozygous carriers are very rare (As yet) unpublished simulation study by Jean Yen supports these hypotheses

  31. If Causal Variant is Dominant (recessive, codominant), What does Marker-Trait correlation pattern look like? Next few slides borrowed from Bruce Weir

  32. Quantitative Traits

  33. Two-allele Models

  34. Trait Mean and Variance

  35. Marker and Trait Values

  36. Two-allele Situation

  37. Trait Values for Marker Loci

  38. Quantitative Traits

  39. Two-allele Models

  40. Trait Mean and Variance

  41. Marker and Trait Values

  42. Two-allele Situation

  43. Trait Values for Marker Loci

  44. BINARY TRAITS

  45. Multi-allelic tests • General test has KC2 + K d.f. • Number of d.f. gets large quickly • Even bigger problem with sparse cells • Can we use information about dominance pattern?

  46. Multi-allelic dominance • E.G. ABO blood group • “A is dominant to O”; “B is dominant to O” • ZA=1 if G{AO,AA}, 0 otherwise • ZB=1 if G{BO,BB}, 0 otherwise • Z0=1 if G=OO, 0 otherwise • ZAB=1 if G=AB, 0 otherwise • Have to leave one of these vars out as reference group • We do not generally know multiallelic dominance relations… • Maybe one allele carries risk? Maybe two have same risk profile? Maybe something odd like ABO alleles? • …and general test quickly becomes problematic • Compromise: additive model • ZA=# of A alleles, ZB=# of B alleles, etc. • Advantages: • Number of parameters does not explode with number of alleles • Test is insensitive to choice of baseline

  47. Example: four-allele marker

  48. Full genotype analysis Contrast OR LowCI UpCI Z 01 vs 00 1.198 0.522 2.751 Z 02 vs 00 0.875 0.544 1.407 Z 03 vs 00 0.870 0.482 1.57 Z 11 vs 00 <0.001 <0.001 >999.999 Z 12 vs 00 0.915 0.396 2.119 Z 13 vs 00 1.953 0.494 7.719 Z 22 vs 00 1.953 0.839 4.549 Z 23 vs 00 1.065 0.556 2.041 Z 33 vs 00 0.837 0.285 2.454 Chi-square=9.1803 on 9 d.f. p=0.4208 Additive coding & “global test”Allele OR LowCI UpCI Z1 1.045 0.633 1.725 Z2 1.138 0.842 1.536 Z3 1.018 0.720 1.440 Chi-square=0.7292 on 3 d.f. p=0.8663 Could also perform four tests: allele 0 versus all others, allele 1 versus all others, etc.—but this requires adjustment for multiple (correlated) tests

  49. Outline • Statistical review • One-locus tests • Multiple single-locus tests

  50. Motivation for testing multiple loci • Want to test as many candidates as possible • Increase odds that we can detect at least one causal gene • Motivation behind genome-wide association scans • Analytic issue: multiple testing • Want to boost power… • …by better predicting untyped variants… • E.g. haplotypes • …or by capturing gene-gene interactions • Analytic issues: multiple testing, model selection • Want to test as many candidates as possible • Increase odds that we can detect at least one causal gene • Motivation behind genome-wide association scans • Analytic issue: multiple testing • Want to boost power… • …by better predicting untyped variants… • E.g. haplotypes • …or by capturing gene-gene interactions • Analytic issues: multiple testing, model selection

More Related