310 likes | 555 Views
Association Analysis, Logistic Regression, R and S-PLUS. Richard Mott http://bioinformatics.well.ox.ac.uk/lectures/. Logistic Regression in Statistical Genetics. Applicable to Association Studies Data: Binary outcomes (eg disease status) Dependent on genotypes [+ sex, environment]
E N D
Association Analysis, Logistic Regression, R and S-PLUS Richard Mott http://bioinformatics.well.ox.ac.uk/lectures/
Logistic Regression in Statistical Genetics • Applicable to Association Studies • Data: • Binary outcomes (eg disease status) • Dependent on genotypes [+ sex, environment] • Aim is to identify which factors influence the outcome • Rigorous tests of statistical significance • Flexible modelling language • Generalisation of Chi-Squared Test
What is R ? • Statistical analysis package • Free • Similar to commercial package S-PLUS • Runs on Unix, Windows, Mac • www.r-project.org • Many packages for statistical genetics, microarray analysis available in R • Easily Programmable
Modelling in R • Data for individual labelled i=1…n: • Response yi • Genotypes gij for markers j=1..m
Coding Unphased Genotypes • Several possibilities: • AA, AG, GG original genotypes • 12, 21, 22 • 1, 2, 3 • 0, 1, 2 # of G alleles • Missing Data • NA default in R
Using R • Load genetic logistic regression tools • > source(‘logistic.R’) • Read data table from file • > t <- read.table(‘geno.dat’, header=TRUE) • Column names • names(t) • t$yresponse (0,1) • t$m1, t$m2, ….Genotypes for each marker
Contigency Tables in R • ftable(t$y,t$m31)prints the contingency table > ftable(t$y,t$m31) 11 12 22 0 515 387 75 1 28 11 2 >
Chi-Squared Test in R > chisq.test(t$y,t$m31) Pearson's Chi-squared test data: t$y and t$m31 X-squared = 3.8424, df = 2, p-value = 0.1464 Warning message: Chi-squared approximation may be incorrect in: chisq.test(t$y, t$m31) >
The Logistic Model • Prob(Yi=0) = exp(hi)/(1+exp(hi)) • hi = Sjxij bj- Linear Predictor • xij– Design Matrix (genotypes etc) • bj– Model Parameters (to be estimated) • Model is investigated by • estimating the bj’s by maximum likelihood • testing if the estimates are different from 0
The Logistic FunctionProb(Yi=0) = exp(hi)/(1+exp(hi)) Prob(Y=0) h
Additive Genotype Model • Code genotypes as • AA x=0, • AG x=1, • GG x=2 • Linear Predictor • h = b0 + xb1 • P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1)) • PAA = P(Y=0|x=0) = exp(b0)/(1+exp(b0)) • PAG = P(Y=0|x=1) = exp(b0 + b1)/(1+exp(b0 + b1)) • PGG = P(Y=0|x=2) = exp(b0 + 2b1)/(1+exp(b0 + 2b1))
Additive Model: b0 = -2 b1 = 2PAA = 0.12 PAG = 0.50 PGG = 0.88 Prob(Y=0) h
Additive Model: b0 = 0 b1 = 2PAA = 0.50 PAG = 0.88 PGG = 0.98 Prob(Y=0) h
Recessive Model • Code genotypes as • AA x=0, • AG x=0, • GG x=1 • Linear Predictor • h = b0 + xb1 • P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1)) • PAA = PAG = P(Y=0|x=0) = exp(b0)/(1+exp(b0)) • PGG = P(Y=0|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))
Recessive Model: b0 = 0 b1 = 2PAA = PAG = 0.50 PGG = 0.88 Prob(Y=0) h
Genotype Model • Each genotype has an independent probability • Code genotypes as (for example) • AA x=0, y=0 • AG x=1, y=0 • GG x=0, y=1 • Linear Predictor • h = b0 + xb1+yb2 two parameters • P(Y=0|x) = exp(b0 + xb1+yb2)/(1+exp(b0 + xb1+yb2)) • PAA = P(Y=0|x=0,y=0) = exp(b0)/(1+exp(b0)) • PAG = P(Y=0|x=1,y=0) = exp(b0 + b1)/(1+exp(b0 + b1)) • PGG = P(Y=0|x=0,y=1) = exp(b0 + b2)/(1+exp(b0 + b2))
Genotype Model: b0 = 0 b1 = 2 b2 = -1 PAA = 0.5 PAG = 0.88 PGG = 0.27 Prob(Y=0) h
Data Transformation • g <- t$m1 • use these functions to treat a genotype vector in a certain way: • a <- additive(g) • r <- recessive(g) • d <- dominant(g) • g <- genotype(g)
Fitting the Model • afit <- glm( t$y ~ additive(g),family=‘binomial’) • rfit <- glm( t$y ~ recessive(g),family=‘binomial’) • dfit <- glm( t$y ~ dominant(g),family=‘binomial’) • gfit <- glm( t$y ~ genotype(g),family=‘binomial’) • Equivalent models: • genotype = dominant + recessive • genotype = additive + recessive • genotype = additive + dominant • genotype ~ standard chi-squared test of genotype association
Parameter Estimates > summary(glm( t$y ~ genotype(t$m31), family='binomial')) Coefficients: Estimate Std. Error z value Pr(>|z|) b0 (Intercept) -2.9120 0.1941 -15.006 <2e-16 *** b1 genotype(t$m31)12 -0.6486 0.3621 -1.791 0.0733 . b2 genotype(t$m31)22 -0.7124 0.7423 -0.960 0.3372 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 >
Analysis of DevianceChi-Squared Test > anova(glm( t$y ~ genotype(t$m31), family='binomial')) Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 1017 343.71 genotype(t$m31) 2 3.96 1015 339.76
Model Comparison • Compare general model with additive, dominant or recessive models: > afit <- glm(t$y ~ additive(t$m20)) > gfit <- glm(t$y ~ genotype(t$m20)) > anova(afit,gfit) Analysis of Deviance Table Model 1: t$y ~ additive(t$m20) Model 2: t$y ~ genotype(t$m20) Resid. Df Resid. Dev Df Deviance 1 1016 38.301 2 1015 38.124 1 0.177 >
Scanning all Markers > logscan(t,model=‘additive’) Deviance DF Pval LogPval m1 8.604197e+00 1 3.353893e-03 2.474450800 m2 7.037336e+00 1 7.982767e-03 2.097846522 m3 6.603882e-01 1 4.164229e-01 0.380465360 m4 3.812860e+00 1 5.086054e-02 1.293619014 m5 7.194936e+00 1 7.310960e-03 2.136025588 m6 2.449127e+00 1 1.175903e-01 0.929628598 m7 2.185613e+00 1 1.393056e-01 0.856031566 m8 1.227191e+00 1 2.679539e-01 0.571939852 m9 2.532562e+01 1 4.842353e-07 6.314943565 m10 5.729634e+01 1 3.748518e-14 13.426140380 m11 3.107441e+01 1 2.483233e-08 7.604982503 … … …
Multilocus Models • Can test the effects of fitting two or more markers simultaneously • Several multilocus models are possible • Interaction Model assumes that each combination of genotypes has a different effect • eg t$y ~ t$m10 * t$m15
Multi-Locus Models > f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26) , family='binomial') > anova(f) Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 1017 343.71 genotype(t$m13) 2 108.68 1015 235.03 genotype(t$m26) 2 1.14 1013 233.89 genotype(t$m13):genotype(t$m26) 3 6.03 1010 227.86 > pchisq(6.03,2,lower.tail=F) calculate p-value [1] 0.04904584
Adding the effects of Sex and other Covariates • Read in sex and other covariate data, eg. age from a file into variables, say a$sex, a$age • Fit models of the form • fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=‘binomial’) • fit2 <- glm(t$y ~ a$sex + a$age, family=‘binomial’)
Adding the effects of Sex and other Covariates • Compare models using anova – test if the effect of the marker m10 is significant after taking into account sex and age • anova(fit1,fit2)
Multiple Testing • Take care interpreting significance levels when performing multiple tests • Linkage disequilibrium can reduce the effective number of independent tests • Permutation is a safe procedure to determine significance • Repeat j=1..N times: • Permute disease status y between individuals • Fit all markers • Record maximum deviance maxdev[j] over all markers • Permutation p-value for a marker is the proportion of times the permuted maximum deviance across all markers exceeds the observed deviance for the marker • logscan(t,permute=1000)slow!
Haplotype Association • Haplotype Association • Different from multiple genotype models • Phase taken into account • Haplotype association can be modelled in a similar logistic framework • Treat haplotypes as extended alleles • Fit additive, recessive, dominant & genotype models as before • Eg haplotypes are h = AAGCAT, ATGCTT, etc • y ~ additive(h) • y ~ dominant(h) etc