160 likes | 285 Views
BIO503: Lecture 4 Statistical models in R --- Recap ---. Stefan Bentink bentink@jimmy.harvard.edu. residual error. dependent variable. intercept. independent variable. regression coefficient. Linear Regression Models.
E N D
BIO503: Lecture 4Statistical models in R--- Recap --- Stefan Bentink bentink@jimmy.harvard.edu
residual error dependent variable intercept independent variable regression coefficient Linear Regression Models Using the methods of least squares, we can derive the following estimators: Our goal is to test the hypothesis: We can do this with a T test: under the null hypothesis, this follows a T distribution with (n-1) df.
Some important functions my.model <- lm(x~y) summary(my.model) anova(my.model) predict(my.model,new.data)
Specifying Models In R we use model formula to specify the model we want to fit to our data. y ~ x Simple Linear Regression y ~ x – 1 Simple Linear Regression without the intercept (line goes through origin) y ~ x1 + x2 + x3 Multiple Regression y ~ x + I(x^2) Quadratic Regression log(y) ~ x1 + x2 Multiple Regression of Transformed Variable For factors A, B: y ~ A 1-way ANOVA y ~ A + B 2-way ANOVA y ~ A*B 2-way ANOVA + interaction term
ANOVA Example Let's use a different dataset: > library(MASS) > data(ChickWeight) > attach(ChickWeight) The factor Diet has 4 levels. > levels(Diet) > anova(lm(weight ~ Diet, data=ChickWeight)) Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) Diet 3 155863 51954 10.81 6.433e-07 Residuals 574 2758693 4806
Two-way ANOVA We can fit a two-way ANOVA: > anova(lm(weight ~ Diet + Chick, data=ChickWeight)) Analysis of Variance Table Response: weight Df Sum Sq Mean Sq F value Pr(>F) Diet 3 155863 51954 11.5045 2.571e-07 Chick 46 374243 8136 1.8015 0.001359 Residuals 528 2384450 4516 The interpretation of the model output is sequential, from the bottom to the top. This line tests the model: weight ~ Diet + Chick This line tests the model: weight ~ Diet vs weight ~ Diet + Chick.
Generalized Linear Models Linear regression models hinge on the assumption that the response variable follows a Normal distribution. Generalized linear models are able to handle non-Normal response variables and transformations to linearity.
Logistic Regression When faced with a binary response Y = (0,1), we use logistic regression. where Logit
Fit the Logistic Regression Model > anes.logit <- glm(move ~ conc, family=binomial(link=logit), data=anesthetic) The output summary looks like this: > summary(anes.logit) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.469 2.418 2.675 0.00748 ** conc -5.567 2.044 -2.724 0.00645 ** Estimates of P(Y=1) are given by: > fitted.values(anes.logit)
Update models and model selection Some handy functions to know about: new.model <- update(old.model, new.formula) Model Selection functions available in the MASS package drop1, dropterm add1, addterm step stepAIC
Problem 5 – Survival Analysis • Read in the data file aml.txt. This data stores the survival data on patients with Acute Myelogenous Leukemia. • Compute the Kaplan-Meier estimate for all patients in this data. Compute the corresponding Kaplan-Meier plot. Construct Kaplan-Meier plots grouped by chemotherapy status. • Using a log-rank test, test if the two survival curves (patients on maintenance chemotherapy, patients who are not) are identical. • Fit a Cox proportional hazards model to the data set. • Plot these survival functions for patients from the different groups.
Survival Analysis library(survival) Example: aml leukemia data Kaplan-Meier curve fit1 <- survfit(Surv(aml$time,aml$status)~1) summary(fit1) plot(fit1) Log-rank test survdiff(Surv(time, status)~x, data=aml)
Survival analysis > cp <- coxph(Surv(aml$time, + aml$status)~x,data=aml) > > summary(cp) > > plot(survfit(Surv(aml$time,aml$status)~x, + data=aml),col=c("red","green"),lwd=2)