Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops www.bioinformatics.ca

Essential Statistics inBiology: Getting theNumbers Right - II Laurent Briollais Samuel Lunenfeld Research Institute University of Toronto laurent@lunenfeld.ca

Outline: what you’ll find... • Introduction to: • linear regression • logistic model • survival analysis • statistics for genomewide association studies • Applications to genetic association studies • R functions

References • Lecture notes from Frank Harell http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/BioMod • Hosmer & Lemeshow (1999). Applied Survival Analysis: Regression Modelling of Time to Event Data. NY: Wiley • My friends Karen Kopciuk, Yun-Hee Choi and (almost) anonymous web contributors for providing some slides and R codes

R packages used for this talk • library("genetics") • library("fdrtool") • library("haplo.stats") • library(“survival”) • library(“GenABEL”)

Introduction to Linear Regression

Outline • General concepts • A few examples • Simple linear regression • Multivariate linear regression • Regression modeling strategies • A case study in R: CEPH data

General Concepts • Regression is the study of dependence between a response variable (y), the dependent variable and one or several predictors (x), the independent variables. • Regression can be used for different purposes:estimation, prediction, hypothesis testing,... • It is a simple representation/modeling of the dependence between several variables. • BUT it makes assumptions: linear relationship, normal errors (= parametric framework) that need to be checked carefully

What is meant by linear relationship?

Statistical Modeling Procedure Model Specification No Parameter Estimation Model Adequate? Yes Use Model

Example 1: inheritance of height Slope = 1 Slope < 1 Y = Daughter’s height From Weisber’s book X = Mother’s height

Gene expression response to dose effect Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Example 2: Microarray Study Y = Gene expression From Lobenhofer et al., 2004 X = Dose of treatment

Example 3: Sib-pair linkage analysis FromHaseman and Elston, 1972 Y= (Y1-Y2)2 =Squqred sib-pair trait differences (phenotypic resemblance) X =Estimated proportion of alleles the sib shared (genotypic resemblance)

Example 4: Association studies Genotype-by-phenotype plots illustrating the effect of genotype (x-axis) on standardized general cognitive abilityscores (y-axis). From Butcher et al.,Genes Brain Behav. 2008 June; 7(4): 435–446.

Case study: A genetic association study • CEPH pedigrees which consist of 12 multigenerational Caucasian families from Utah inlcuding 107 individuals. DNA for the CEPH family pedigree is available for genetic studies. • The marker data consist of genotypes for 20 SNP markers, six of them are genotyped for all individuals, the remaining 14 are genotyped in only 50-54 individuals. • The study looks for an association between these SNPs and a gene expression phenotype (mRNA) • http://www.sph.umich.edu/csg/abecasis/Merlin/tour/assoc.html

Case study: previous results Very significant result !

Case study: trait distribution ceph.data<-read.table(paste(my.directory,"ceph_data.txt",sep=""),header=T,na.strings = "0/0") attach(ceph.data) allele.snp1<-allele.count(genotype(snp1),1) plot(allele.snp1,qt) R output

Simple Linear Regression Day II. Section VI 1/1/2020 E(Y|X): Population expected value or long-run average of y conditioned on the value of x Eg: population average blood pressure for a 30-year old • α : y-intercept • β : slope of y on x Simple linear regression is used when • Only two variables are of interest • One variable is a response and one a predictor • No adjustment is needed for confounding or other between-subject variation 19

The Model Specification Day II. Section VI 1/1/2020 • E(y|x) = α + ßx • Y = α + ßx + e, • e is a random error (residual) representing variation between subjects in y • Even if x is constant, e.g. variation in blood pressure for patients of the same age 20

Assumptions Day II. Section VI 1/1/2020 • Linearity • σ2 is constant, independent of x • Observations (e’s) are independent of each other • For proper statistical inference (CI, p-values), y (e) is normal conditional on x 21

Estimation Day II. Section VI 1/1/2020 Need a criterion for what are good estimates One criterion is to choose values of the two parameters that minimize the sum of squared errors (SSE) in predicting individual subject responses Let a, b be guesses for α, ß Sample of size n: (x1,y1), (x2,y2), …, (xn,yn) Values that minimize SSE are least squares estimates 22

Ordinary least-squares estimation

Estimates Day II. Section VI 1/1/2020 These are obtained from Note: A term from Lxy will be positive when x and y are concordant in terms of both being above their means or both being below their means. 24

Case study: R R code: lm.snp1<-lm(qt ~ allele.snp1, data=ceph.data) summary(lm.snp1) R output Residuals: Min 1Q Median 3Q Max -1.48691 -0.29568 -0.00657 0.36666 0.81169 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.71522 0.06212 156.405 <2e-16 *** allele.snp1 -0.15503 0.06773 -2.289 0.0241 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.4503 on 105 degrees of freedom Multiple R-Squared: 0.04753, Adjusted R-squared: 0.03845 F-statistic: 5.239 on 1 and 105 DF, p-value: 0.02409

Case study: R R code: xyplot(qt ~ allele.snp1, data=ceph.data, type=c('p','r')) R output y=9.71-0.16x

Inference about the parameters

Case study: R R code: anova(lm.snp1) SSE<-sum(lm.snp1$res^2) print(SSE) SSR<-sum((qt-lm.snp1$res-ybar)^2) print(SSR) SST<-SSR+SSE print(SST) F.stat<-(SSR/1)/(SSE/(length(qt)-2)) print(F.stat) pf(F.stat, 1, length(qt)-2,lower.tail=F) Anova table Response: qt Df Sum Sq Mean Sq F value Pr(>F) allele.snp1 1 1.0624 1.0624 5.2391 0.02409 * Residuals 105 21.2919 0.2028

Computing the F-statistic Remember that the F-statistic is the ratio ofthe mean squareddeviations from the means of the fitted values to the residuals. Fstat = (SSRegression/p−1)/(SSResidual/n−p)

Coefficient of determination

Classical ANOVA analysis Day II. Section VI 1/1/2020 • Model • α : mean for controls • β1: mean for currently exposed minus mean for controls • β2: mean for previously exposed minus mean for controls • β1-β2: mean for previously exposed minus currently exposed • In ANOVA, • SSR is called sum of squares between treatments • SSE is called sum of squares within treatment • Don’t need to learn formulas specifically for ANOVA 31

Classical ANOVA analysis R code: anova(lm(qt~snp1, data=ceph.data)) R output Classical analysis of Variance Table Response: qt Df Sum Sq Mean Sq F value Pr(>F) snp1 2 1.21 0.60 2.97 0.056 Residuals 104 21.14 0.20

Estimate of the variance

Prediction R code: predict(lm.snp1, newdata=data.frame(allele.snp1=c(0,1)), interval=c("prediction"), level=0.95) R output fit lower upper 1 9.71 8.81 10.61 2 9.56 8.66 10.45

Model Adequacy (Diagnostic) Check We need to check residual plots to validate underlying assumptions: Relationship between response and regressor is linear (at least approximately). Error term,  has zero mean Error term,  has constant variance Errors are normally distributed (required for tests and intervals)

Residual Plot: Residual vs. Fitted Value Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis Adequate Inadequate Check constant variance and linearity, and look for potential outliers Inadequate Inadequate

Assessing goodness of fit Plot of residuals vs. predicted values R code: plot(predict(lm.snp1), residuals(lm.snp1)) abline(h=0, lty=2) R output

Backup slide Residual Plot: Normal Probability Plot Adequate Inadequate Inadequate Inadequate Inadequate Cumulative probability of normal distribution Check the normality assumption Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis

Multiple Linear Regression Day II. Section VI 1/1/2020 Assume the variables act linearly against y Model: Or For two x-variables: Estimated equations: Least squares criterion for fitting the model (estimating the parameters) Solve for a, b1, b2 to minimize SSE 39

Inference Day II. Section VI 1/1/2020 A general principle in regression models: a set of variables can be tested for their combined partial effects by removing that set of variables from the model and measuring the harm (SSE) done to the model Let full refer to computed values from the full model including all variables; reduced model containing only the adjustment variables and not the variables being tested Dropping variables SSE, SSR unless the dropped variables had exactly zero slope estimates in the full model (which never happens) 40

Inference -cont’d Day II. Section VI 1/1/2020 • SSEreduced – SSE full = SSRfull – SSRreduced Numerator of F test can use either SSE or SSR • Form of partial F-test: change in SS when dropping the variables of interest divided by change in d.f., then divided by MSE; MSE is chosen as that which best estimates σ2, namely the MSE from the full model • Full model has p slopes; suppose we want to test q of the slopes 41

Case study: R R code: xyplot(qt ~ allele.snp1|sex, data=ceph.data, type=c('p','r')) R output

Case study: R R output Analysis of Variance Table Response: qt Df Sum Sq Mean Sq F value Pr(>F) allele.snp1 1 1.0624 1.0624 5.1901 0.02476 * sex 1 0.0034 0.0034 0.0167 0.89748 Residuals 104 21.2884 0.2047

Stepwise variable selection Use of Akaike criteria(AIC) log (RSS/n) step(lm(qt ~ snp1+snp2+snp3+snp4+snp4+snp5, data=ceph.data)) Step: AIC=-99.6 qt ~ snp1 + snp5 Df Sum of Sq RSS AIC <none> 7.363 -99.596 - snp5 1 0.416 7.779 -98.625 - snp1 2 0.846 8.209 -97.726 Call: lm(formula = qt ~ snp1 + snp5, data = ceph.data) Coefficients: (Intercept) snp11/3 snp13/3 snp53/3 9.3085 0.3145 0.4235 0.1962 > Start: AIC=-97.45 qt ~ snp1 + snp2 + snp3 + snp4 + snp4 + snp5 Step: AIC=-97.45 qt ~ snp1 + snp3 + snp4 + snp5 Df Sum of Sq RSS AIC - snp3 2 0.266 6.872 -99.321 - snp5 1 0.249 6.855 -97.455 <none> 6.607 -97.450 - snp4 2 0.547 7.154 -97.154 - snp1 2 1.163 7.769 -92.696 Step: AIC=-99.32 qt ~ snp1 + snp4 + snp5 Df Sum of Sq RSS AIC - snp4 2 0.491 7.363 -99.596 <none> 6.872 -99.321 - snp5 1 0.261 7.134 -99.304 - snp1 2 1.076 7.948 -95.467 Step3 Step1 Step2 R output

Influential Observation

Robust linear regression Quantile regression:Koenker and Bassett (1978) M-estimator:Huber (1981), Hampel et al. (1986) Bivariate skew-t distribution:Azzalini & Capitanio (2003)

Quantile regression Quantile regression can estimate the conditional median or other quantiles of the response variable, rather than ordinary least squares regression which estimates the conditional mean. Thus quantile regression is generally more robust to outliers. The estimation is performed by solving: Where rq(z) = z(q -I(z<0)), 0 < q < 1 and I() is the indicator function For L1-norm estimator (regression median) q = 0.5

Example 3: Sib-pair linkage analysis FromHaseman and Elston, 1972 Y= (Y1-Y2)2 =Squqred sib-pair trait differences (phenotypic resemblance) X =Estimated proportion of alleles the sib shared (genotypic resemblance)

Simulated bivariate trait distributions Bivariate skew-Normal distribution Bivariate Normal distribution Bivariate skew-t distribution Bivariate t distribution

Simulated distribution of Y=(Y1-Y2)2

Canadian Bioinformatics Workshops