Logistic Regression and Generalized Linear Models:

Blood Screening, Women’s Role in Society, and Colonic Polyps Logistic Regression and Generalized Linear Models:

Blood Screening • ESR measurements (erythrocyte sedimentation rate)‏ • Study checks for the rate of increase of ESR when (blood proteins) fibrinogen and globulin increase • Looking for an association between the probability of an ESR reading > 20mm/hr and the levels of the two plasma proteins. • Less than 20mm/hr indicates a healthy individual

Multiple Regression Doesn’t Work • Not Normally distributed • The response variable is binary. • In fact, the distribution is Binomial. The can be seen by looking at how the error term relates to the probability. If y = 1 then the error term is 1- P(y=1). We will assume for our Random Variables Y that

Logistic Regression, a Generalized Linear Model • Modelling the expected value of the response requires a transformation • This chapter is about the logit transformation of a model, which is of the form • If the response variable is 1, the log odds of the response is the logit function of a probability. • Fixing all explanatory variables but xj, exp(Bj) represents the odds that the response variable is 1 when xj increases by 1.

R commands • Plasma_glm_1 <- glm(ESR ~ fibrinogen, data = plasma, family = binomial())‏ • Fits the model, in this example specifying the logistic function is implied because of the binomial() parameter is already passed. • Layout(matrix(1:2, ncol = 2))‏ • Lets you choose the layout of your graph screen. • Cdplot(ESR ~ fibrinogen, data = plasma)‏ • Cdplot(ESR ~ globulin, data = plasma)‏ • cdplot “plots conditional densities describing how the conditional distribution of a categorical variable y changes over a numerical variable x” (help(“cdplot”))‏

Interpretation • The area of the dark region is the probability that ESR < 20 mm/hr, and this decreases as the protein levels increase • There is not much shape to the globulin density function.

R Commands • Confint(plasma_glm_1, parm = “fibrinogen”)‏ • Gives a confidence interval • Output: • Waiting for profiling to be done... • 2.5 % 97.5 % • 0.3389465 3.9988602 • Exp(coef(plasma_glm_1)[“fibrinogen”])‏ • This is necessary to do the reverse transformation to get the model coefficients • Output: • fibrinogen • 6.215715

R Commands • Summary(plasma_glm_1)‏ • Output: • Deviance Residuals: • Min 1Q Median 3Q Max • -0.9298 -0.5399 -0.4382 -0.3356 2.4794 • Coefficients: • Estimate Std. Error z value Pr(>|z|) • (Intercept) -6.8451 2.7703 -2.471 0.0135 * • fibrinogen 1.8271 0.9009 2.028 0.0425 * • --- • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • (Dispersion parameter for binomial family taken to be 1) • Null deviance: 30.885 on 31 degrees of freedom • Residual deviance: 24.840 on 30 degrees of freedom

R Commands • Exp(confint(plasma_glm_1, parm = “fibrinogen”))‏ • Output: • 2.5 % 97.5 % • 1.403468 54.535954 • Plasma_glm_2 <- glm(ESR ~ fibrinogen + globulin, data = plasma, family = binomial())‏ • Use logistic regression for both variables fibrinogen and globulin • Summary(Plasma_glm_2)‏

R Commands • Summary(Plasma_glm_2)‏ • Output • Deviance Residuals: • Min 1Q Median 3Q Max • -0.9683 -0.6122 -0.3458 -0.2116 2.2636 • Coefficients: • Estimate Std. Error z value Pr(>|z|) • (Intercept) -12.7921 5.7963 -2.207 0.0273 * • fibrinogen 1.9104 0.9710 1.967 0.0491 * • globulin 0.1558 0.1195 1.303 0.1925 • --- • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • (Dispersion parameter for binomial family taken to be 1) • Null deviance: 30.885 on 31 degrees of freedom • Residual deviance: 22.971 on 29 degrees of freedom

Interpretation • large confidence interval is because there are not many observations in all, and even fewer ESR > 20mm/hr • The globulin coefficient is basically zero

Anova(plasma_glm_1, plasma_glm_2, test = “Chisq”)‏ Compares the two models one of just fibrinogen, the other of both fibrinogen and globulin Why Chisquared test? ANOVA assumes normal distribution for each data set. Why is this OK? Output Analysis of Deviance Table Model 1: ESR ~ fibrinogen Model 2: ESR ~ fibrinogen + globulin Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 30 24.8404 2 29 22.9711 1 1.8692 0.1716 Comparing the residual deviance between the two models, we see that the difference is not significant. We must surmise that there is no association between globulin and ESR level. R Commands

R Commands • Prob <- predict(plasma_glm_1, type = “response” • The model of just fibrinogen useful in creating a neat looking bubble plot. First we must take the predicted values from the first model and use them to determine the size of the bubbles • Plot(globulin ~ fibrinogen, data = plasma, xlim = c(2, 6), ylim = c(25, 50), pch = “.”)‏ • Plots the second model • Symbols(plasma$fibrinogen, plasma$globulin, circles = prob, add = TRUE)‏ • Uses the values of the first model to create different bubble sizes.

The plot shows how the probability of having an ESR > 20mm/hr increases as fibrinogen increases Bubbles size is Probability

Generalized Linear Model • Unifies the logistic regression, Analysis of Variance and multiple regression techniques. • GLMs have three essential parts • An error distribution- the distribution of the response variable. • Normal for Analysis of Variance and Multiple Regression • Binomial for Logistic Regression

Main parts of GLM (continued)‏ • A link function which links the explanatory variables to the expected value of the response. • Logit function for logistic regression • Identity function for ANOVA and multiple regression • A variance function which shows the dependency of the response variable variability on the mean

Measure of Fit • The deviance shows how well the model fits the data • Comparing two model’s deviances • Use a likelihood ratio test • Compare using Chi-square distribution

Women’s Role in Society • Response to “Women should take care of running their homes and leave the running the country up to men”. • Factors: Education, Sex • Response: Agree or Disagree • data is presented as categories with counts for each education, sex combination

R Commands • Womensrole_glm_1 <- glm(cbind(agree, disagree) ~ sex + education, data = womensrole, family= binomial())‏ • This uses the cbind function to change data from two responses to one response that is a matrix of agree, disagree counts. • Summary(womensrole_glm_1)‏ • We can see that education is a significant factor to the response.

Summary Output • Deviance Residuals: • Min 1Q Median 3Q Max • -2.72544 -0.86302 -0.06525 0.84340 3.13315 • Coefficients: • Estimate Std. Error z value Pr(>|z|) • (Intercept) 2.50937 0.18389 13.646 <2e-16 *** • Why is there a P-value for the intercept? • sexFemale -0.01145 0.08415 -0.136 0.892 • education -0.27062 0.01541 -17.560 <2e-16 *** • --- • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • (Dispersion parameter for binomial family taken to be 1) • Null deviance: 451.722 on 40 degrees of freedom • Residual deviance: 64.007 on 38 degrees of freedom

Declaring a function in R • Myplot <- function(role.fitted) { • Declares the function myplot and passes it an object “role.fitted” to be used in the function • f <- womensrole$sex == “Female” • Stores everything in the data with sex = femail in f • plot(womensrole$education, role.fitted, type = "n", ylab = "probability of agreeing", xlab = "Education", ylim = c(0,1))‏ • Plots education against the role.fitted object which will be some predicted values from our GLM defined as • Role.fitted1 <- predict(womensrole_glm_1, type = “response”)‏

Myplot function (cont)‏ • lines(womensrole$education[!f], role.fitted[!f], lty = 1) • lines(womensrole$education[f], role.fitted[f], lty = 2) • These graph the lines, one for males and one for femails. Lty indicates the kind of line (in this case solid or dotted)‏ • lgtxt <- c("Fitted (Males)", "Fitted (Females)") • legend("topright", lgtxt, lty = 1:2, bty = "n") • These add a legend for each line.

Myplot Function (cont)‏ • y<-womensrole$agree/ (womensrole$agree + womensrole$disagree)‏ • A basic calculation of the proportion of women that agree • text(womensrole$education, y, ifelse(f, "\\VE", "\\MA"), family = "HersheySerif", cex = 1.25)‏ • Plots y and not y using male/female symbols

Interpretation • The two fitted lines indicate that sex does not change a probability of agreeing vs education • The symbols of unfitted observations may indicate an interaction between sex and education

MyPlot Sex:Education • By running the same analysis, e.g. • Womensrole_glm_2 <- glm(cbind(agree, disagree) ~ sex*education, data = womensrole, family = binomial())‏ • Summary(womensrole_glm_2)‏ • This summary shows a significant p-value for the sex:education interaction • Role.fitted2 <- predict(womensrole_glm_1, type = “response”)‏ • Myplot(role.fitted2)‏ • The plot shows that less education is associated with agreement that women belong in the home.

The Deviance Residual • One of the many methods for checking the adequacy of the model fit • The deviance residual is the square root of the part of each observation that contributes to the deviance • Res <- residuals( womensrole_glm_2, type = “deviance”)‏ • Pulls the residuals for many other models as well • Plot(predict(womensrole_glm_2), res, xlab = “fitted values”, ylab = “Residuals”, lim = max(abs(res))*c(-1,1))‏ • No visible pattern: fit appears ok • Abline(h = 0, lty = 2)‏ • Adds a dotted line at height 0

Drug Treatment TestingFamlilial Andenomatous Polyposis (FAP)‏ • Counts of colonic polyps after 12 months of treatment • Don’t want to be the guy that did that • Placebo controlled (Binary factor)‏ • Age

GLM Analysis • Multiple Regression won’t work. • Count data strictly positive • Normality is not probable • Poisson Regression- GLM with a log link function • Ensures a Poisson Distribution • Ensures positive fitted amounts • R command: • Polyps_glm_1 <- glm(number ~ treat + age, data = polyps, family = poisson())‏

Model Summary • Call: • glm(formula = number ~ treat + age, family = poisson(), data = polyps)‏ • Deviance Residuals: • Min 1Q Median 3Q Max • -4.2212 -3.0536 -0.1802 1.4459 5.8301 • Coefficients: • Estimate Std. Error z value Pr(>|z|) • (Intercept) 4.529024 0.146872 30.84 < 2e-16 *** • treatdrug -1.359083 0.117643 -11.55 < 2e-16 *** • age -0.038830 0.005955 -6.52 7.02e-11 *** • --- • Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 • (Dispersion parameter for poisson family taken to be 1)‏ • Null deviance: 378.66 on 19 degrees of freedom • Residual deviance: 179.54 on 17 degrees of freedom • AIC: 273.88 • Number of Fisher Scoring iterations: 5

Model Problem: Overdispersion • In previous models the variance can be seen as dependent solely on the mean • E.G. Binomial, Poisson • In practice, this doesn’t always work. • Sometimes the raw data points are not independent, there is some correlation, in this case, possible “clustering” • Compare residual deviance and degrees of freedom to determine • These should be basically equal

Model Solution: Quasi-Likelihood • This procedure estimates the other factors that might contribute to the variance • R Command • Polyps_glm_2 <- glm(number ~ treat + age, data = polyps, family = quasipoisson())‏ • Summary(polyps_glm_2)‏ • The coefficients are still significant, but less so

Homework • Please run through the commands for the myplot function (pg 100) and send me the command script. • Please send me what you thought of the presentation, give me a grade and add any constructive criticism. • zweihanderdawg@gmail.com

Logistic Regression and Generalized Linear Models:

Logistic Regression and Generalized Linear Models:

Presentation Transcript

Chapter 13

Regression for Data Mining

Logistic Regression – Basic Relationships

Introduction to Generalized Linear Models

Multiple Regression Analysis

This chapter uses MS Excel and Weka

limma: Linear Models for Microarray Data

Principles of Biostatistics Simple Linear Regression

Introduction to Linear Regression and Correlation Analysis

VI. Logistic Regression

Regression Analysis

3.3 Hypothesis Testing in Multiple Linear Regression

Survival Analysis in SAS

Chapter 2 The Simple Linear Regression Model: Specification and Estimation

Linear Mixed Models: An Introduction

Relative Importance of Predictors with Regression Models

Correlation and regression

Nonlinear Regression Models

Multilevel Regression Models

Regression