Association Analysis, Logistic Regression, R and S-PLUS

Association Analysis, Logistic Regression, R and S-PLUS Richard Mott http://bioinformatics.well.ox.ac.uk/lectures/

Logistic Regression in Statistical Genetics • Applicable to Association Studies • Data: • Binary outcomes (eg disease status) • Dependent on genotypes [+ sex, environment] • Aim is to identify which factors influence the outcome • Rigorous tests of statistical significance • Flexible modelling language • Generalisation of Chi-Squared Test

What is R ? • Statistical analysis package • Free • Similar to commercial package S-PLUS • Runs on Unix, Windows, Mac • www.r-project.org • Many packages for statistical genetics, microarray analysis available in R • Easily Programmable

Modelling in R • Data for individual labelled i=1…n: • Response yi • Genotypes gij for markers j=1..m

Coding Unphased Genotypes • Several possibilities: • AA, AG, GG original genotypes • 12, 21, 22 • 1, 2, 3 • 0, 1, 2 # of G alleles • Missing Data • NA default in R

Using R • Load genetic logistic regression tools • > source(‘logistic.R’) • Read data table from file • > t <- read.table(‘geno.dat’, header=TRUE) • Column names • names(t) • t$yresponse (0,1) • t$m1, t$m2, ….Genotypes for each marker

Contigency Tables in R • ftable(t$y,t$m31)prints the contingency table > ftable(t$y,t$m31) 11 12 22 0 515 387 75 1 28 11 2 >

Chi-Squared Test in R > chisq.test(t$y,t$m31) Pearson's Chi-squared test data: t$y and t$m31 X-squared = 3.8424, df = 2, p-value = 0.1464 Warning message: Chi-squared approximation may be incorrect in: chisq.test(t$y, t$m31) >

The Logistic Model • Prob(Yi=0) = exp(hi)/(1+exp(hi)) • hi = Sjxij bj- Linear Predictor • xij– Design Matrix (genotypes etc) • bj– Model Parameters (to be estimated) • Model is investigated by • estimating the bj’s by maximum likelihood • testing if the estimates are different from 0

The Logistic FunctionProb(Yi=0) = exp(hi)/(1+exp(hi)) Prob(Y=0) h

Types of genetic effect at a single locus

Additive Genotype Model • Code genotypes as • AA x=0, • AG x=1, • GG x=2 • Linear Predictor • h = b0 + xb1 • P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1)) • PAA = P(Y=0|x=0) = exp(b0)/(1+exp(b0)) • PAG = P(Y=0|x=1) = exp(b0 + b1)/(1+exp(b0 + b1)) • PGG = P(Y=0|x=2) = exp(b0 + 2b1)/(1+exp(b0 + 2b1))

Additive Model: b0 = -2 b1 = 2PAA = 0.12 PAG = 0.50 PGG = 0.88 Prob(Y=0) h

Additive Model: b0 = 0 b1 = 2PAA = 0.50 PAG = 0.88 PGG = 0.98 Prob(Y=0) h

Recessive Model • Code genotypes as • AA x=0, • AG x=0, • GG x=1 • Linear Predictor • h = b0 + xb1 • P(Y=0|x) = exp(b0 + xb1)/(1+exp(b0 + xb1)) • PAA = PAG = P(Y=0|x=0) = exp(b0)/(1+exp(b0)) • PGG = P(Y=0|x=1) = exp(b0 + b1)/(1+exp(b0 + b1))

Recessive Model: b0 = 0 b1 = 2PAA = PAG = 0.50 PGG = 0.88 Prob(Y=0) h

Genotype Model • Each genotype has an independent probability • Code genotypes as (for example) • AA x=0, y=0 • AG x=1, y=0 • GG x=0, y=1 • Linear Predictor • h = b0 + xb1+yb2 two parameters • P(Y=0|x) = exp(b0 + xb1+yb2)/(1+exp(b0 + xb1+yb2)) • PAA = P(Y=0|x=0,y=0) = exp(b0)/(1+exp(b0)) • PAG = P(Y=0|x=1,y=0) = exp(b0 + b1)/(1+exp(b0 + b1)) • PGG = P(Y=0|x=0,y=1) = exp(b0 + b2)/(1+exp(b0 + b2))

Genotype Model: b0 = 0 b1 = 2 b2 = -1 PAA = 0.5 PAG = 0.88 PGG = 0.27 Prob(Y=0) h

Models in Rresponse ygenotype g

Data Transformation • g <- t$m1 • use these functions to treat a genotype vector in a certain way: • a <- additive(g) • r <- recessive(g) • d <- dominant(g) • g <- genotype(g)

Fitting the Model • afit <- glm( t$y ~ additive(g),family=‘binomial’) • rfit <- glm( t$y ~ recessive(g),family=‘binomial’) • dfit <- glm( t$y ~ dominant(g),family=‘binomial’) • gfit <- glm( t$y ~ genotype(g),family=‘binomial’) • Equivalent models: • genotype = dominant + recessive • genotype = additive + recessive • genotype = additive + dominant • genotype ~ standard chi-squared test of genotype association

Parameter Estimates > summary(glm( t$y ~ genotype(t$m31), family='binomial')) Coefficients: Estimate Std. Error z value Pr(>|z|) b0 (Intercept) -2.9120 0.1941 -15.006 <2e-16 *** b1 genotype(t$m31)12 -0.6486 0.3621 -1.791 0.0733 . b2 genotype(t$m31)22 -0.7124 0.7423 -0.960 0.3372 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 >

Analysis of DevianceChi-Squared Test > anova(glm( t$y ~ genotype(t$m31), family='binomial')) Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 1017 343.71 genotype(t$m31) 2 3.96 1015 339.76

Model Comparison • Compare general model with additive, dominant or recessive models: > afit <- glm(t$y ~ additive(t$m20)) > gfit <- glm(t$y ~ genotype(t$m20)) > anova(afit,gfit) Analysis of Deviance Table Model 1: t$y ~ additive(t$m20) Model 2: t$y ~ genotype(t$m20) Resid. Df Resid. Dev Df Deviance 1 1016 38.301 2 1015 38.124 1 0.177 >

Scanning all Markers > logscan(t,model=‘additive’) Deviance DF Pval LogPval m1 8.604197e+00 1 3.353893e-03 2.474450800 m2 7.037336e+00 1 7.982767e-03 2.097846522 m3 6.603882e-01 1 4.164229e-01 0.380465360 m4 3.812860e+00 1 5.086054e-02 1.293619014 m5 7.194936e+00 1 7.310960e-03 2.136025588 m6 2.449127e+00 1 1.175903e-01 0.929628598 m7 2.185613e+00 1 1.393056e-01 0.856031566 m8 1.227191e+00 1 2.679539e-01 0.571939852 m9 2.532562e+01 1 4.842353e-07 6.314943565 m10 5.729634e+01 1 3.748518e-14 13.426140380 m11 3.107441e+01 1 2.483233e-08 7.604982503 … … …

Multilocus Models • Can test the effects of fitting two or more markers simultaneously • Several multilocus models are possible • Interaction Model assumes that each combination of genotypes has a different effect • eg t$y ~ t$m10 * t$m15

Multi-Locus Models > f <- glm( t$y ~ genotype(t$m13) * genotype(t$m26) , family='binomial') > anova(f) Analysis of Deviance Table Model: binomial, link: logit Response: t$y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 1017 343.71 genotype(t$m13) 2 108.68 1015 235.03 genotype(t$m26) 2 1.14 1013 233.89 genotype(t$m13):genotype(t$m26) 3 6.03 1010 227.86 > pchisq(6.03,2,lower.tail=F) calculate p-value [1] 0.04904584

Adding the effects of Sex and other Covariates • Read in sex and other covariate data, eg. age from a file into variables, say a$sex, a$age • Fit models of the form • fit1 <- glm(t$y ~ additive(t$m10) + a$sex + a$age, family=‘binomial’) • fit2 <- glm(t$y ~ a$sex + a$age, family=‘binomial’)

Adding the effects of Sex and other Covariates • Compare models using anova – test if the effect of the marker m10 is significant after taking into account sex and age • anova(fit1,fit2)

Multiple Testing • Take care interpreting significance levels when performing multiple tests • Linkage disequilibrium can reduce the effective number of independent tests • Permutation is a safe procedure to determine significance • Repeat j=1..N times: • Permute disease status y between individuals • Fit all markers • Record maximum deviance maxdev[j] over all markers • Permutation p-value for a marker is the proportion of times the permuted maximum deviance across all markers exceeds the observed deviance for the marker • logscan(t,permute=1000)slow!

Haplotype Association • Haplotype Association • Different from multiple genotype models • Phase taken into account • Haplotype association can be modelled in a similar logistic framework • Treat haplotypes as extended alleles • Fit additive, recessive, dominant & genotype models as before • Eg haplotypes are h = AAGCAT, ATGCTT, etc • y ~ additive(h) • y ~ dominant(h) etc

Association Analysis, Logistic Regression, R and S-PLUS

Association Analysis, Logistic Regression, R and S-PLUS

Presentation Transcript

Logistic regression

Logistic Regression

Linear Discriminant Analysis and Logistic Regression

Logistic Regression

Logistic Regression

Logistic regression

Logistic Regression

Logistic Regression and Discriminant Function Analysis

Logistic Regression

Multiple Discriminant Analysis and Logistic Regression

Logistic Regression Analysis

Logistic Regression

Logistic Regression

Logistic Regression

Logistic Regression

Logistic regression

Logistic Regression

Logistic Regression

Logistic Regression