680 likes | 830 Views
EPI293 Design and analysis of gene association studies Winter Term 2006 Lecture 3: Statistical review, single-locus association tests. Peter Kraft pkraft@hsph.harvard.edu Bldg 2 Rm 206 2-4271. Outline. Statistical review One-locus tests Multiple single locus tests. Outline.
E N D
EPI293Design and analysis of gene association studiesWinter Term 2006Lecture 3: Statistical review, single-locus association tests Peter Kraft pkraft@hsph.harvard.eduBldg 2 Rm 2062-4271
Outline • Statistical review • One-locus tests • Multiple single locus tests
Outline • Statistical review • Pearson’s chi-square • Likelihood theory • Measures of model fit: AIC, BIC • Bayesian data analysis • One-locus tests • Multiple single-locus tests
Pearson’s chi-squared • Do categorical data have hypothesized dist’n? • Are outcome and exposure independent (kl tables)? • Do genotypes follow Hardy-Weinberg proportions? • i indexes I categories • Test statistic • T ~ d2 under null • d = no. parms under alternative – no. parms under null
Example: 2 3 table • Let n00, n01 and n02 be counts of controls with genotypes aa, Aa, and AA, respectively • Let n10, n11 and n12 be the same for cases • n0. and n1. are total no.s of controls, cases • n.1 is total no. of aa genotypes etc. • T= ~ 22 2 d.f. from 4-2 = 2 or standard formula: (k-1)(l-1) = 2
Example: test for departure from HWE • T= • Under null T is a chi-square with 1 d.f. • Two parameters under alternative minus one under null
Likelihood theory • Likelihood is function of model parameters, given a probabilistic model and data • Probability of observed data for given parameter values • Assume observations (indexed by i) are independent • Let X be data for observation i • = parameters of interest; = “nuisance” parameters • Maximize L to estimate (MLEs) • Equivalent to maximizing log L • Usually requires computers
Example: MLE for allele frequency • Multinomial likelihood • Maximum at 0, 1 or “score” = U(p) = /p logL = 0 • … so MLE of p is (2n2 + n1) / (2n)
Example: unconditional logistic regression • J exposures of interest, K “nuisance” parameters • No closed-form solution for parameter estimates • Need computer: SAS PROC LOGISTIC, R GLM etc.
Tests based on likelihood theory • Score test • U(0) ~ N(0,Var(U)) • If observations are independent Var(U) I = - 2/2 log L • U’I-1U ~ 2 with dim(0) d.f. • Often has convenient formula (e.g. McNemar’s test) • Wald test • For large enough samples: • If observations are independent: • Leads to usual test: • “Easy” to robustify if observations are not independent • Sandwich or Huber-White estimate: • Likelihood ratio test • Intuitive test of hypotheses that constrain multiple parms
Likelihood ratio test Alternative model Null model 0 1, i.e. models are “nested” LRT = 2 log LR ~ d2 under null, d = dim(1) – dim(0) E.g.1= {1,2: 1,2(-, )}, 0 = {1=2=0}, d = 2.
Likelihood ratio test: example • Case-control study of CHD • Z is BMI, coded in tertiles • I.e. Zi’ = (Zi1,Zi2) • Zi1=1 if i in middle tertile, 0 otherwise • Zi2=1 if i in top tertile, 0 otherwise • X includes intercept, age (as a linear term) • Null: Pr(D=1|Z,X) = expit[’X] • (two parameters) • Alternative: Pr(D=1|Z,X) = expit[’Z+’X] • (four parameters) Likelihood ratio test has 4-2=2 degrees of freedom
More parameters = more flexibility = smaller -2 log L “Penalty for ‘overfitting’” Measures of model fit • Not all models are nested within each other • Dominant, recessive models for a given risk allele • Locus A versus locus B • Interested in model fit per se • Which model(s) best describe(s) data • Akaike Information Criterion • AIC = -2 log L + 2 K • Bayes Information Criterion • BIC = -2 log L + log(n) K Smaller is better(but read software manual) AIC is an estimate of “in-sample error” using log-likelihood loss functionBIC is a rough estimate of -2 log Pr(Model|Data)
Bayesian data analysis • Frequentists assume there is a true model with true parameter values, which we estimate given the data • Pearson’s chi-square, likelihood theory: all frequentist • Bayesians assume the parameters (including perhaps model form) are random variables, and calculate the posterior distribution given the data • Advantages • Can account for “prior information” about distribution of parms • Quite complicated models are mathematically tractable • Disadvantages • Requires assumptions about “prior information” Bayes’ Theorem is prior distribution of “Fully Bayes” = assumes prior completely known; “empirical Bayes” = assumes prior depends on “hyper parameters” (e.g. mean and variance) which are estimated from data
“Fully Bayes” example • Say we collect n std’zed continuous measurements • Xi ~ N(,1) • Say that a priori ~N(0,02) • Then posterior distribution of has mean… …and variance What does this mean? (a) For n large relative to 1/02, “the data swamp the prior” (b) for n small relative to 1/02, the prior swamps n (c) different priors lead to different results
Empirical Bayes example: heirarchical modeling • Say Z1,…,Z5 measure consumption of five food types • First stage model: • Pr(D=1|Z) – expit[0 + 1 Z1 + … 5 Z5] • Second stage model (prior): • 1= 0 + 1 X1 + 1;2=0 + 1 X2 + 2; etc. … • …where Xi is amount of nutrient of interest in food i • “regressing effect of Z on X” • Prior depends on three parameters: 0,1 and var() • 0,1 estimated from data • var() can be estimated from data or treated as fixed • Or chosen to minimize prediction error • Advantages • Reduce parameter variance • Allow high-dimensional models to be fit • Disadvantages • Must make assumptions in second-stage model • For us: what is the at-risk allele, which loci are “exchangeable”
Outline • Statistical review • One-locus tests • Diallelic • Multiallelic • Multiple single-locus tests
Simple threetwo tables • Advantages • Simplicity, completeness • Robust to true dominance pattern • Disadvantage • Statistic unreliable when few homozygote variants (AA) T= has 2 d.f. under null
Simple twotwo tables • Test statistic now has 1 d.f. under null • Advantages • Simplicity • Disadvantage • Lose some information in presentation • Not robust to true dominance pattern Dominant model Recessive model
Simple trend test • Armitage’s Trend Test • Test linear trend in log(OR) with no. of A alleles Notation from slide 18 • Test statistic still has 1 d.f. under null • Advantages • Simplicity, retain information in presentation (2x3 table) • More robust than dominant, recessive tests • Disadvantage • Not as robust as 2 d.f. test
Allelic test • For all the previous tests, the unit of observation was the subject (genotype) • Total number of observations = n.. = number of subjects • Can also treat alleles as the unit of observation • Now total number of observations is 2 n.. • Great! I’ve doubled my sample size! But… • … my Type I error could be inflated if locus is out of HWE… • … and ORall requires careful interpretation Sasieni, P.D., From genotypes to genes: doubling the sample size. Biometrics, 1997. 53(4): p. 1253-61
Examples Codominant test Pearson’s chi-square: 1.86 on 2 d.f., p=.39 Allelic test Pearson’s chi-square: 1.62 on 1 d.f., p=.20 “Truth:” RRAa = 1.25, RRAA = 1.5
2x3 (etc.) tables via logistic regression • Trick: create genotype coding variable Z • One d.f. tests • Dominant: Z=1 if genotype is AA or Aa, 0 otherwise • Recessive: Z=1 if genotype is AA, 0 otherwise • Trend (a.k.a. linear or addtive): Z = # A alleles • If genotype is AA then Z= 2, if Aa then Z=1 etc. • Score test form this model = Armitage’s trend test • Two d.f. test • Create two “dummy” variables • Z1 = 1 if genotype is Aa, 0 otherwise • Z2 = 1 if genotype is AA, 0 otherwise • Perform likelihood ratio test • Advantages of logistic regression • Adjust for other variables, test several loci simultaneously
How to fit using logistic regression (in SAS) data example; input caco z n; cards; 0 0 655 0 1 310 0 2 37 1 0 535 1 1 401 1 2 67 ; run; Additive Co-dominant proclogisticdescending; model caco=z; weight n; run; proclogisticdescending; class z (ref=first); model caco=z; weight n; run;
Additive model Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 32.3558 1 <.0001 Score 32.1106 1 <.0001 Wald 31.6368 1 <.0001 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -0.1963 0.0568 11.9363 0.0006 z 1 0.4331 0.0770 31.6368 <.0001
Co-dominant model Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq Likelihood Ratio 32.5780 2 <.0001 Score 32.4012 2 <.0001 Wald 32.0451 2 <.0001 Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 0.2163 0.0753 8.2417 0.0041 z 1 1 0.0411 0.0871 0.2232 0.6366 z 2 1 0.3775 0.1402 7.2486 0.0071
Adjusting for covariates A) proclogisticdescending; model caco=x; weight n; proclogisticdescending; class z (ref=first); model caco=z x; weight n; run; data example; input caco z x n; cards; 0 0 0 655 0 1 0 310 0 2 0 37 1 0 0 535 1 1 0 401 1 2 0 67 0 0 1 642 0 1 1 311 0 2 1 31 1 0 1 542 1 1 1 391 1 2 1 59 ; run; B) Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 5520.818 5522.805 SC 5521.302 5523.775 -2 Log L 5518.818 5518.805 A) B) Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 5520.818 5468.033 SC 5521.302 5469.972 -2 Log L 5518.818 5460.033 T=5518.8-5460.0=58.8 on 2 d.f.
Which test to use? • No fishing expeditions (without paying a price)!
Which test to use? (cont) Different colors = different true models. Points are comparisons of power for different models. Codominant offers gain in power under true recessive model, for little cost under other true models.
PK’s soapbox • For complex diseases, the “mode of inheritance” (dominant, recessive, et cetera) is an antiquated and potentially dangerous concept • “Mode of inheritance” developed for simple Mendelian diseases with near-complete penetrance • Complex diseases involve multiple loci, have high phenocopy rates • A marker that is in LD with a causal gene will “look co-dominant,” even if the causal gene is actually recessive or dominant • Few can resist temptation to “go fishing” • Reporting results for “most significant” coding leads to inflated Type I error, narrow CIs • Very difficult--if not impossible--to choose between competing models • Suggestion: emphasize results from co-dominant coding • Co-dominant coding is “model free” (model is saturated) • ORs convey more information • Generally retains power relative to other codings (even though test has 2 d.f. instead of 1) • Switch to dominant coding only if homozygous carriers are very rare (As yet) unpublished simulation study by Jean Yen supports these hypotheses
If Causal Variant is Dominant (recessive, codominant), What does Marker-Trait correlation pattern look like? Next few slides borrowed from Bruce Weir
Multi-allelic tests • General test has KC2 + K d.f. • Number of d.f. gets large quickly • Even bigger problem with sparse cells • Can we use information about dominance pattern?
Multi-allelic dominance • E.G. ABO blood group • “A is dominant to O”; “B is dominant to O” • ZA=1 if G{AO,AA}, 0 otherwise • ZB=1 if G{BO,BB}, 0 otherwise • Z0=1 if G=OO, 0 otherwise • ZAB=1 if G=AB, 0 otherwise • Have to leave one of these vars out as reference group • We do not generally know multiallelic dominance relations… • Maybe one allele carries risk? Maybe two have same risk profile? Maybe something odd like ABO alleles? • …and general test quickly becomes problematic • Compromise: additive model • ZA=# of A alleles, ZB=# of B alleles, etc. • Advantages: • Number of parameters does not explode with number of alleles • Test is insensitive to choice of baseline
Full genotype analysis Contrast OR LowCI UpCI Z 01 vs 00 1.198 0.522 2.751 Z 02 vs 00 0.875 0.544 1.407 Z 03 vs 00 0.870 0.482 1.57 Z 11 vs 00 <0.001 <0.001 >999.999 Z 12 vs 00 0.915 0.396 2.119 Z 13 vs 00 1.953 0.494 7.719 Z 22 vs 00 1.953 0.839 4.549 Z 23 vs 00 1.065 0.556 2.041 Z 33 vs 00 0.837 0.285 2.454 Chi-square=9.1803 on 9 d.f. p=0.4208 Additive coding & “global test”Allele OR LowCI UpCI Z1 1.045 0.633 1.725 Z2 1.138 0.842 1.536 Z3 1.018 0.720 1.440 Chi-square=0.7292 on 3 d.f. p=0.8663 Could also perform four tests: allele 0 versus all others, allele 1 versus all others, etc.—but this requires adjustment for multiple (correlated) tests
Outline • Statistical review • One-locus tests • Multiple single-locus tests
Motivation for testing multiple loci • Want to test as many candidates as possible • Increase odds that we can detect at least one causal gene • Motivation behind genome-wide association scans • Analytic issue: multiple testing • Want to boost power… • …by better predicting untyped variants… • E.g. haplotypes • …or by capturing gene-gene interactions • Analytic issues: multiple testing, model selection • Want to test as many candidates as possible • Increase odds that we can detect at least one causal gene • Motivation behind genome-wide association scans • Analytic issue: multiple testing • Want to boost power… • …by better predicting untyped variants… • E.g. haplotypes • …or by capturing gene-gene interactions • Analytic issues: multiple testing, model selection