270 likes | 446 Views
BIO503: Lecture 5. Jess Mar Department of Biostatistics jmar@hsph.harvard.edu. Harvard School of Public Health Wintersession 2009. Roadmap for Today. Some More Advanced Statistical Models Multiple Linear Regression Generalized linear models Logistic Regression Poisson Regression
E N D
BIO503: Lecture 5 Jess Mar Department of Biostatistics jmar@hsph.harvard.edu Harvard School of Public Health Wintersession 2009
Roadmap for Today Some More Advanced Statistical Models • Multiple Linear Regression • Generalized linear models • Logistic Regression • Poisson Regression • Survival Analysis Multivariate Data Analysis Programming Tutorials Bits & Pieces
Multiple Linear Regression Some handy functions to know about: new.model <- update(old.model, new.formula) Model Selection functions available in the MASS package drop1, dropterm add1, addterm step, stepAIC Similarly, anova(modObj, test="Chisq")
Generalized Linear Models Linear regression models hinge on the assumption that the response variable follows a Normal distribution. Generalized linear models are able to handle non-Normal response variables and transformations to linearity.
Logistic Regression When faced with a binary response Y = (0,1), we use logistic regression. where
Problem 2 – Logistic Regression Read in the anaesthetic data set, data file: anaesthetic.txt. Covariates: move binary numeric vector for patient movement (1 = movement, 0 = no movement) conc anaethestic concentration Goal: estimate how the concentration of movement varies with increasing concentration of the anesthetic agent.
Fit the Logistic Regression Model > anes.logit <- glm(nomove ~ conc, family=binomial(link=logit), data=anesthetic) The output summary looks like this: > summary(anes.logit) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -6.469 2.418 -2.675 0.00748 ** conc 5.567 2.044 2.724 0.00645 ** Estimates of P(Y=1) are given by: > fitted.values(anes.logit)
Estimating Log Odds Ratio To get back the log odds ratio > anes.logit$linear.predictors > plot(anesthetic$conc, anes.logit$linear.predictors) > abline(coefficients(anes.logit)) Looks like the odds of not moving increase significantly when you increase the concentration of the anesthetic agent beyond 0.8.
Problem 3 – Multiple Logistic Regression Read in data set birthwt.txt. We fit a logistic regression using the glm function and using the binomial family.
Problem 4 - Poisson Regression Poisson regression is often used for the analysis of count data or the calculation of rates associated with a rare event or disease. Example: schooldata.csv. We can fit the Poisson regression model using the glm function and the poisson family.
Survival Analysis library(survival) Example: aml leukemia data Kaplan-Meier curve fit1 <- survfit(Surv(aml$time[1:11],aml$status[1:11])) summary(fit1) plot(fit1) Log-rank test survdiff(Surv(time, status)~x, data=aml)
Survival Analysis Fit a Cox proportional hazards model coxfit1 <- coxph(Surv(time, status)~x, data=aml) summary(coxfit1) Cumulative baseline hazard estimator: basehaz(coxph(Surv(time, status)~x, data=aml)) Survival function for one group: plot(survfit(coxfit1, newdata=data.frame(x=1)))
Cluster Analysis Clustering observations on the basis of experiments or across a time series. Clustering experiments together on the basis of observations. A clustering problem is generally much harder than a classification problem because we don’t know the number of classes. Hierarchical Methods: (Agglomerative, Divisive) + (Single, Average, Complete) Linkage… Model-based Methods: Mixed models. Plaid models. Mixture models…
1 2 3 4 1 2 3 4 Examples of Clustering Algorithms Available in R Hierarchical Methods: hclust agnes Partitioning Methods: som kmeans pam Packages: cluster Different Samples Observations
n genes in 1 cluster divisive agglomerative n genes in n clusters Hierarchical Clustering Source: J-Express Manual We join (or break) nodes based on the notion of maximum (or minimum) ‘similarity’. Euclidean distance (Pearson) correlation
Different Ways to Determine Distances Between Clusters Single linkage Complete linkage Average linkage
Partitioning Methods Examples of partitioning methods are k-means, partitioning about medoids (pam). Gap statistic: source("http://www.bioconductor.org/biocLite.R") biocLite("SAGx") ?gap The goal is to minimize the gap statistic.
W – within variance B – between variance K-means Clustering Reference: J-Express manual
Classification (Machine Learning) Machine learning algorithms predict new classes based on patterns discerned from existing data. Goal: derive a rule (classifier) that assigns a new object (e.g. patient microarray profile) to a pre-specified group (e.g. aggressive vs non-aggressive prostate cancer). • Classification algorithms are a form of supervised learning. • Clustering algorithms are a form of unsupervised learning. • R Package: • class – contains knn, SOM • nnet • MLInterfaces - Biconductor • A simplified way to construct machine learning algorithms from microarray data.
Classification Linear Discriminant Analysis lda Support Vector Machines library(e1071) svm K-nearest neighbors knn Tree-based methods: rpart randomForest
Scaling Methods Principal Component Analysis prcomp Multi-dimensional Scaling MDS Self Organizing Maps SOM Independent Component Analysis fastICA
R Shortcuts Ctrl + A: Ctrl + E: Ctrl + K Esc {Up, Down} Arrow
Laundry List .Rprofile file Outline of R packages Graphics – lattice, Rwiki Homework R/SAS/Stata Comparison Exercises