1 / 53

Canadian Bioinformatics Workshops

This workshop provides an introduction to linear regression, logistic model, survival analysis, and statistics for genome-wide association studies. It includes applications to genetic association studies, lecture notes, and R functions.

bgreen
Download Presentation

Canadian Bioinformatics Workshops

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Canadian Bioinformatics Workshops www.bioinformatics.ca

  2. Essential Statistics inBiology: Getting theNumbers Right - II Laurent Briollais Samuel Lunenfeld Research Institute University of Toronto laurent@lunenfeld.ca

  3. Outline: what you’ll find... • Introduction to: • linear regression • logistic model • survival analysis • statistics for genomewide association studies • Applications to genetic association studies • R functions

  4. References • Lecture notes from Frank Harell http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/BioMod • Hosmer & Lemeshow (1999). Applied Survival Analysis: Regression Modelling of Time to Event Data. NY: Wiley • My friends Karen Kopciuk, Yun-Hee Choi and (almost) anonymous web contributors for providing some slides and R codes

  5. R packages used for this talk • library("genetics") • library("fdrtool") • library("haplo.stats") • library(“survival”) • library(“GenABEL”)

  6. Introduction to Linear Regression

  7. Outline • General concepts • A few examples • Simple linear regression • Multivariate linear regression • Regression modeling strategies • A case study in R: CEPH data

  8. General Concepts • Regression is the study of dependence between a response variable (y), the dependent variable and one or several predictors (x), the independent variables. • Regression can be used for different purposes:estimation, prediction, hypothesis testing,... • It is a simple representation/modeling of the dependence between several variables. • BUT it makes assumptions: linear relationship, normal errors (= parametric framework) that need to be checked carefully

  9. What is meant by linear relationship?

  10. Statistical Modeling Procedure Model Specification No Parameter Estimation Model Adequate? Yes Use Model

  11. Example 1: inheritance of height Slope = 1 Slope < 1 Y = Daughter’s height From Weisber’s book X = Mother’s height

  12. Gene expression response to dose effect Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Example 2: Microarray Study Y = Gene expression From Lobenhofer et al., 2004 X = Dose of treatment

  13. Example 3: Sib-pair linkage analysis FromHaseman and Elston, 1972 Y= (Y1-Y2)2 =Squqred sib-pair trait differences (phenotypic resemblance) X =Estimated proportion of alleles the sib shared (genotypic resemblance)

  14. Example 4: Association studies Genotype-by-phenotype plots illustrating the effect of genotype (x-axis) on standardized general cognitive abilityscores (y-axis). From Butcher et al.,Genes Brain Behav. 2008 June; 7(4): 435–446.

  15. Case study: A genetic association study • CEPH pedigrees which consist of 12 multigenerational Caucasian families from Utah inlcuding 107 individuals. DNA for the CEPH family pedigree is available for genetic studies. • The marker data consist of genotypes for 20 SNP markers, six of them are genotyped for all individuals, the remaining 14 are genotyped in only 50-54 individuals. • The study looks for an association between these SNPs and a gene expression phenotype (mRNA) • http://www.sph.umich.edu/csg/abecasis/Merlin/tour/assoc.html

  16. Case study: previous results Very significant result !

  17. Case study: trait distribution ceph.data<-read.table(paste(my.directory,"ceph_data.txt",sep=""),header=T,na.strings = "0/0") attach(ceph.data) allele.snp1<-allele.count(genotype(snp1),1) plot(allele.snp1,qt) R output

  18. Simple Linear Regression Day II. Section VI 1/1/2020 E(Y|X): Population expected value or long-run average of y conditioned on the value of x Eg: population average blood pressure for a 30-year old • α : y-intercept • β : slope of y on x Simple linear regression is used when • Only two variables are of interest • One variable is a response and one a predictor • No adjustment is needed for confounding or other between-subject variation 19

  19. The Model Specification Day II. Section VI 1/1/2020 • E(y|x) = α + ßx • Y = α + ßx + e, • e is a random error (residual) representing variation between subjects in y • Even if x is constant, e.g. variation in blood pressure for patients of the same age 20

  20. Assumptions Day II. Section VI 1/1/2020 • Linearity • σ2 is constant, independent of x • Observations (e’s) are independent of each other • For proper statistical inference (CI, p-values), y (e) is normal conditional on x 21

  21. Estimation Day II. Section VI 1/1/2020 Need a criterion for what are good estimates One criterion is to choose values of the two parameters that minimize the sum of squared errors (SSE) in predicting individual subject responses Let a, b be guesses for α, ß Sample of size n: (x1,y1), (x2,y2), …, (xn,yn) Values that minimize SSE are least squares estimates 22

  22. Ordinary least-squares estimation

  23. Estimates Day II. Section VI 1/1/2020 These are obtained from Note: A term from Lxy will be positive when x and y are concordant in terms of both being above their means or both being below their means. 24

  24. Case study: R R code: lm.snp1<-lm(qt ~ allele.snp1, data=ceph.data) summary(lm.snp1) R output Residuals: Min 1Q Median 3Q Max -1.48691 -0.29568 -0.00657 0.36666 0.81169 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.71522 0.06212 156.405 <2e-16 *** allele.snp1 -0.15503 0.06773 -2.289 0.0241 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.4503 on 105 degrees of freedom Multiple R-Squared: 0.04753, Adjusted R-squared: 0.03845 F-statistic: 5.239 on 1 and 105 DF, p-value: 0.02409

  25. Case study: R R code: xyplot(qt ~ allele.snp1, data=ceph.data, type=c('p','r')) R output y=9.71-0.16x

  26. Inference about the parameters

  27. Case study: R R code: anova(lm.snp1) SSE<-sum(lm.snp1$res^2) print(SSE) SSR<-sum((qt-lm.snp1$res-ybar)^2) print(SSR) SST<-SSR+SSE print(SST) F.stat<-(SSR/1)/(SSE/(length(qt)-2)) print(F.stat) pf(F.stat, 1, length(qt)-2,lower.tail=F) Anova table Response: qt Df Sum Sq Mean Sq F value Pr(>F) allele.snp1 1 1.0624 1.0624 5.2391 0.02409 * Residuals 105 21.2919 0.2028

  28. Computing the F-statistic Remember that the F-statistic is the ratio ofthe mean squareddeviations from the means of the fitted values to the residuals. Fstat = (SSRegression/p−1)/(SSResidual/n−p)

  29. Coefficient of determination

  30. Classical ANOVA analysis Day II. Section VI 1/1/2020 • Model • α : mean for controls • β1: mean for currently exposed minus mean for controls • β2: mean for previously exposed minus mean for controls • β1-β2: mean for previously exposed minus currently exposed • In ANOVA, • SSR is called sum of squares between treatments • SSE is called sum of squares within treatment • Don’t need to learn formulas specifically for ANOVA 31

  31. Classical ANOVA analysis R code: anova(lm(qt~snp1, data=ceph.data)) R output Classical analysis of Variance Table Response: qt Df Sum Sq Mean Sq F value Pr(>F) snp1 2 1.21 0.60 2.97 0.056 Residuals 104 21.14 0.20

  32. Estimate of the variance

  33. Prediction R code: predict(lm.snp1, newdata=data.frame(allele.snp1=c(0,1)), interval=c("prediction"), level=0.95) R output fit lower upper 1 9.71 8.81 10.61 2 9.56 8.66 10.45

  34. Model Adequacy (Diagnostic) Check We need to check residual plots to validate underlying assumptions: Relationship between response and regressor is linear (at least approximately). Error term,  has zero mean Error term,  has constant variance Errors are normally distributed (required for tests and intervals)

  35. Residual Plot: Residual vs. Fitted Value Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis Adequate Inadequate Check constant variance and linearity, and look for potential outliers Inadequate Inadequate

  36. Assessing goodness of fit Plot of residuals vs. predicted values R code: plot(predict(lm.snp1), residuals(lm.snp1)) abline(h=0, lty=2) R output

  37. Backup slide Residual Plot: Normal Probability Plot Adequate Inadequate Inadequate Inadequate Inadequate Cumulative probability of normal distribution Check the normality assumption Source: Montgomery et al., 2001, Introduction to Linear Regression Analysis

  38. Multiple Linear Regression Day II. Section VI 1/1/2020 Assume the variables act linearly against y Model: Or For two x-variables: Estimated equations: Least squares criterion for fitting the model (estimating the parameters) Solve for a, b1, b2 to minimize SSE 39

  39. Inference Day II. Section VI 1/1/2020 A general principle in regression models: a set of variables can be tested for their combined partial effects by removing that set of variables from the model and measuring the harm (SSE) done to the model Let full refer to computed values from the full model including all variables; reduced model containing only the adjustment variables and not the variables being tested Dropping variables SSE, SSR unless the dropped variables had exactly zero slope estimates in the full model (which never happens) 40

  40. Inference -cont’d Day II. Section VI 1/1/2020 • SSEreduced – SSE full = SSRfull – SSRreduced Numerator of F test can use either SSE or SSR • Form of partial F-test: change in SS when dropping the variables of interest divided by change in d.f., then divided by MSE; MSE is chosen as that which best estimates σ2, namely the MSE from the full model • Full model has p slopes; suppose we want to test q of the slopes 41

  41. Case study: R R code: xyplot(qt ~ allele.snp1|sex, data=ceph.data, type=c('p','r')) R output

  42. Case study: R R output Analysis of Variance Table Response: qt Df Sum Sq Mean Sq F value Pr(>F) allele.snp1 1 1.0624 1.0624 5.1901 0.02476 * sex 1 0.0034 0.0034 0.0167 0.89748 Residuals 104 21.2884 0.2047

  43. Stepwise variable selection Use of Akaike criteria(AIC) log (RSS/n) step(lm(qt ~ snp1+snp2+snp3+snp4+snp4+snp5, data=ceph.data)) Step: AIC=-99.6 qt ~ snp1 + snp5 Df Sum of Sq RSS AIC <none> 7.363 -99.596 - snp5 1 0.416 7.779 -98.625 - snp1 2 0.846 8.209 -97.726 Call: lm(formula = qt ~ snp1 + snp5, data = ceph.data) Coefficients: (Intercept) snp11/3 snp13/3 snp53/3 9.3085 0.3145 0.4235 0.1962 > Start: AIC=-97.45 qt ~ snp1 + snp2 + snp3 + snp4 + snp4 + snp5 Step: AIC=-97.45 qt ~ snp1 + snp3 + snp4 + snp5 Df Sum of Sq RSS AIC - snp3 2 0.266 6.872 -99.321 - snp5 1 0.249 6.855 -97.455 <none> 6.607 -97.450 - snp4 2 0.547 7.154 -97.154 - snp1 2 1.163 7.769 -92.696 Step: AIC=-99.32 qt ~ snp1 + snp4 + snp5 Df Sum of Sq RSS AIC - snp4 2 0.491 7.363 -99.596 <none> 6.872 -99.321 - snp5 1 0.261 7.134 -99.304 - snp1 2 1.076 7.948 -95.467 Step3 Step1 Step2 R output

  44. Influential Observation

  45. Robust linear regression Quantile regression:Koenker and Bassett (1978) M-estimator:Huber (1981), Hampel et al. (1986) Bivariate skew-t distribution:Azzalini & Capitanio (2003)

  46. Quantile regression Quantile regression can estimate the conditional median or other quantiles of the response variable, rather than ordinary least squares regression which estimates the conditional mean. Thus quantile regression is generally more robust to outliers. The estimation is performed by solving: Where rq(z) = z(q -I(z<0)), 0 < q < 1 and I() is the indicator function For L1-norm estimator (regression median) q = 0.5

  47. Example 3: Sib-pair linkage analysis FromHaseman and Elston, 1972 Y= (Y1-Y2)2 =Squqred sib-pair trait differences (phenotypic resemblance) X =Estimated proportion of alleles the sib shared (genotypic resemblance)

  48. Simulated bivariate trait distributions Bivariate skew-Normal distribution Bivariate Normal distribution Bivariate skew-t distribution Bivariate t distribution

  49. Simulated distribution of Y=(Y1-Y2)2

More Related