Introduction to Logistic Regression Analysis

Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia

Introductory example 1 • Gender difference in preference for white wine. A group of 57 men and 167 women were asked to make preference for a new white wine. The results are as follows: Question: Is there a gender effect on the preference ?

Introductory example 2 Fat concentration and preference. 435 samples of a sauce of various fat concentration were tasted by consumers. There were two outcome: like or dislike. The results are as follows: Question: Is there an effect of fat concentration on the preference ?

Consideration … • The question in example 1 can be addressed by “traditional” analysis such as z-statistic or Chi-square test. • The question in example 2 is a bit difficult to handle as the factor (fat concentration ) was a continuous variable and the outcome was a categorical variable (like or dislike) • However, there is a much better and more systematic method to analysis these data: Logistic regression

Odds and odds ratio • Let P be the probability of preference, then the odds of preference is: O = P / (1-P) • Omen = 0.403 / 0.597 = 0.676 • Owomen = 0.209 / 0.791 = 0.265 Odds ratio:OR = Omen / Owomen = 0.676 / 0.265 = 2.55 (Meaning: the odds of preference is 2.55 times higher in men than in women)

Meanings of odds ratio • OR > 1: the odds of preference is higher in men than in women • OR < 1: the odds of preference is lower in men than in women • OR = 1: the odds of preference in men is the same as in women • How to assess the “significance” of OR ?

Computing variance of odds ratio • The significance of OR can be tested by calculating its variance. • The variance of OR can be indirectly calculated by working with logarithmic scale: • Convert OR to log(OR) • Calculate variance of log(OR) • Calculate 95% confidence interval of log(OR) • Convert back to 95% confidence interval of OR

Computing variance of odds ratio • OR = (23/34)/ (35/132) = 2.55 • Log(OR) = log(2.55) = 0.937 • Variance of log(OR): V = 1/23 + 1/34 + 1/35 + 1/132 = 0.109 • Standard error of log(OR) SE = sqrt(0.109) = 0.330 • 95% confidence interval of log(OR) 0.937 + 0.330(1.96) = 0.289 to 1.584 • Convert back to 95% confidence interval of OR Exp(0.289) = 1.33 to Exp(1.584) = 4.87

Logistic analysis by R sex <- c(1, 2) like <- c(23, 35) dislike <- c(34, 132) total <- like + dislike prob <- like/total logistic <- glm(prob ~ sex, family=”binomial”, weight=total) > summary(logistic) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.5457 0.5725 0.953 0.34044 sex -0.9366 0.3302 -2.836 0.00456 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 7.8676e+00 on 1 degrees of freedom Residual deviance: 2.2204e-15 on 0 degrees of freedom AIC: 13.629

Logistic regression model for continuous factor

Analysis by using R conc <- c(1.35, 1.60, 1.75, 1.85, 1.95, 2.05, 2.15, 2.25, 2.35) like <- c(13, 19, 67, 45, 71, 50, 35, 7, 1) dislike <- c(0, 0, 2, 5, 8, 20, 31, 49, 12) total <- like+dislike prob <- like/total plot(prob ~ conc, pch=16, xlab="Concentration")

Logistic regression model for continuous factor - model • Let p = probability of preference • Logit of p is: Model: Logit(p) = a + b(FAT) where a is the intercept, and b is the slope that have to be estimated from the data

Analysis by using R logistic <- glm(prob ~ conc, family="binomial", weight=total) summary(logistic) Deviance Residuals: Min 1Q Median 3Q Max -1.78226 -0.69052 0.07981 0.36556 1.36871 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 22.708 2.266 10.021 <2e-16 *** conc -10.662 1.083 -9.849 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 198.7115 on 8 degrees of freedom Residual deviance: 8.5568 on 7 degrees of freedom AIC: 37.096

Logistic regression model for continuous factor – Interpretation • The odds ratio associated with each 0.1 increase in fat concentration was 2.90 (95% CI: 2.34, 3.59) • Interpretation: Each 0.1 increase in fat concentration was associated with a 2.9 odds of disliking the product. Since the 95% confidence interval exclude 1, this association was statistically significant at the p<0.05 level.

Multiple logistic regression Fracture (0=no, 1=yes) Dependent variables: age, bmi, bmd, ictp, pinp Question: Which variables are important for fracture? id fx age bmi bmd ictp pinp 1 1 79 24.7252 0.818 9.170 37.383 2 1 89 25.9909 0.871 7.561 24.685 3 1 70 25.3934 1.358 5.347 40.620 4 1 88 23.2254 0.714 7.354 56.782 5 1 85 24.6097 0.748 6.760 58.358 6 0 68 25.0762 0.935 4.939 67.123 7 0 70 19.8839 1.040 4.321 26.399 8 0 69 25.0593 1.002 4.212 47.515 9 0 74 25.6544 0.987 5.605 26.132 10 0 79 19.9594 0.863 5.204 60.267 ... 137 0 64 38.0762 1.086 5.043 32.835 138 1 80 23.3887 0.875 4.086 23.837 139 0 67 25.9455 0.983 4.328 71.334

Multiple logistic regression: R analysis setwd(“c:/works/stats”) fracture <- read.table(“fracture.txt”, header=TRUE, na.string=”.”) names(fracture) fulldata <- na.omit(fracture) attach(fulldata) temp <- glm(fx ~ ., family=”binomial”, data=fulldata) search <- step(temp) summary(search)

Bayesian Model Average (BMA) analysis Library(BMA) xvars <- fulldata[, 3:7] y <- fx bma.search <- bic.glm(xvars, y, strict=F, OR=20, glm.family="binomial") summary(bma.search) imageplot.bma(bma.search)

Bayesian Model Average (BMA) analysis > summary(bma.search) Call: Best 5 models (cumulative posterior probability = 0.8836 ): p!=0 EV SD model 1 model 2 model 3 model 4 model 5 Intercept 100 -2.85012 2.8651 -3.920 -1.065 -1.201 -8.257 -0.072 age 15.3 0.00845 0.0261 . . . 0.063 . bmi 21.7 -0.02302 0.0541 . . -0.116 . -0.070 bmd 39.7 -1.34136 1.9762 . -3.499 . . -2.696 ictp 100.0 0.64575 0.1699 0.606 0.687 0.680 0.554 0.714 pinp 5.7 -0.00037 0.0041 . . . . . nVar 1 2 2 2 3 BIC -525.044 -524.939 -523.625 -522.672 -521.032 post prob 0.307 0.291 0.151 0.094 0.041

Bayesian Model Average (BMA) analysis > imageplot.bma(bma.search)

Summary of main points • Logistic regression model is used to analyze the association between a binary outcome and one or many determinants. • The determinants can be binary, categorical or continuous measurements • The model is logit(p) = log[p / (1-p)] = a + bX, where X is a factor, and a and b must be estimated from observed data.

Summary of main points • Exp(b) is the odds ratio associated with an increment in the determinant X. • The logistic regression model can be extended to include many determinants: logit(p) = log[p / (1-p)] = a + bX1 + gX2 + dX3 + …

Introduction to Logistic Regression Analysis

Introduction to Logistic Regression Analysis

Presentation Transcript

Logistic regression

Logistic Regression

Introduction to logistic regression a.k.a. Varbrul

Introduction to logistic regression

Introduction to Logistic Regression

An Introduction to Logistic Regression

Introduction to Logistic Regression

Introduction to logistic regression

Logistic Regression

Logistic Regression

Logistic regression

Logistic Regression

An Introduction to Logistic Regression

An Introduction to Logistic Regression

Introduction to Logistic Regression In Stata

Logistic Regression Analysis

Logistic Regression

An Introduction to Logistic Regression

Logistic regression

Logistic Regression

Introduction to logistic regression a.k.a. Varbrul