210 likes | 415 Views
Introduction to Logistic Regression Analysis. Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia. Introductory example 1.
E N D
Introduction to Logistic Regression Analysis Dr Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia
Introductory example 1 • Gender difference in preference for white wine. A group of 57 men and 167 women were asked to make preference for a new white wine. The results are as follows: Question: Is there a gender effect on the preference ?
Introductory example 2 Fat concentration and preference. 435 samples of a sauce of various fat concentration were tasted by consumers. There were two outcome: like or dislike. The results are as follows: Question: Is there an effect of fat concentration on the preference ?
Consideration … • The question in example 1 can be addressed by “traditional” analysis such as z-statistic or Chi-square test. • The question in example 2 is a bit difficult to handle as the factor (fat concentration ) was a continuous variable and the outcome was a categorical variable (like or dislike) • However, there is a much better and more systematic method to analysis these data: Logistic regression
Odds and odds ratio • Let P be the probability of preference, then the odds of preference is: O = P / (1-P) • Omen = 0.403 / 0.597 = 0.676 • Owomen = 0.209 / 0.791 = 0.265 Odds ratio:OR = Omen / Owomen = 0.676 / 0.265 = 2.55 (Meaning: the odds of preference is 2.55 times higher in men than in women)
Meanings of odds ratio • OR > 1: the odds of preference is higher in men than in women • OR < 1: the odds of preference is lower in men than in women • OR = 1: the odds of preference in men is the same as in women • How to assess the “significance” of OR ?
Computing variance of odds ratio • The significance of OR can be tested by calculating its variance. • The variance of OR can be indirectly calculated by working with logarithmic scale: • Convert OR to log(OR) • Calculate variance of log(OR) • Calculate 95% confidence interval of log(OR) • Convert back to 95% confidence interval of OR
Computing variance of odds ratio • OR = (23/34)/ (35/132) = 2.55 • Log(OR) = log(2.55) = 0.937 • Variance of log(OR): V = 1/23 + 1/34 + 1/35 + 1/132 = 0.109 • Standard error of log(OR) SE = sqrt(0.109) = 0.330 • 95% confidence interval of log(OR) 0.937 + 0.330(1.96) = 0.289 to 1.584 • Convert back to 95% confidence interval of OR Exp(0.289) = 1.33 to Exp(1.584) = 4.87
Logistic analysis by R sex <- c(1, 2) like <- c(23, 35) dislike <- c(34, 132) total <- like + dislike prob <- like/total logistic <- glm(prob ~ sex, family=”binomial”, weight=total) > summary(logistic) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 0.5457 0.5725 0.953 0.34044 sex -0.9366 0.3302 -2.836 0.00456 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 7.8676e+00 on 1 degrees of freedom Residual deviance: 2.2204e-15 on 0 degrees of freedom AIC: 13.629
Analysis by using R conc <- c(1.35, 1.60, 1.75, 1.85, 1.95, 2.05, 2.15, 2.25, 2.35) like <- c(13, 19, 67, 45, 71, 50, 35, 7, 1) dislike <- c(0, 0, 2, 5, 8, 20, 31, 49, 12) total <- like+dislike prob <- like/total plot(prob ~ conc, pch=16, xlab="Concentration")
Logistic regression model for continuous factor - model • Let p = probability of preference • Logit of p is: Model: Logit(p) = a + b(FAT) where a is the intercept, and b is the slope that have to be estimated from the data
Analysis by using R logistic <- glm(prob ~ conc, family="binomial", weight=total) summary(logistic) Deviance Residuals: Min 1Q Median 3Q Max -1.78226 -0.69052 0.07981 0.36556 1.36871 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 22.708 2.266 10.021 <2e-16 *** conc -10.662 1.083 -9.849 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 198.7115 on 8 degrees of freedom Residual deviance: 8.5568 on 7 degrees of freedom AIC: 37.096
Logistic regression model for continuous factor – Interpretation • The odds ratio associated with each 0.1 increase in fat concentration was 2.90 (95% CI: 2.34, 3.59) • Interpretation: Each 0.1 increase in fat concentration was associated with a 2.9 odds of disliking the product. Since the 95% confidence interval exclude 1, this association was statistically significant at the p<0.05 level.
Multiple logistic regression Fracture (0=no, 1=yes) Dependent variables: age, bmi, bmd, ictp, pinp Question: Which variables are important for fracture? id fx age bmi bmd ictp pinp 1 1 79 24.7252 0.818 9.170 37.383 2 1 89 25.9909 0.871 7.561 24.685 3 1 70 25.3934 1.358 5.347 40.620 4 1 88 23.2254 0.714 7.354 56.782 5 1 85 24.6097 0.748 6.760 58.358 6 0 68 25.0762 0.935 4.939 67.123 7 0 70 19.8839 1.040 4.321 26.399 8 0 69 25.0593 1.002 4.212 47.515 9 0 74 25.6544 0.987 5.605 26.132 10 0 79 19.9594 0.863 5.204 60.267 ... 137 0 64 38.0762 1.086 5.043 32.835 138 1 80 23.3887 0.875 4.086 23.837 139 0 67 25.9455 0.983 4.328 71.334
Multiple logistic regression: R analysis setwd(“c:/works/stats”) fracture <- read.table(“fracture.txt”, header=TRUE, na.string=”.”) names(fracture) fulldata <- na.omit(fracture) attach(fulldata) temp <- glm(fx ~ ., family=”binomial”, data=fulldata) search <- step(temp) summary(search)
Bayesian Model Average (BMA) analysis Library(BMA) xvars <- fulldata[, 3:7] y <- fx bma.search <- bic.glm(xvars, y, strict=F, OR=20, glm.family="binomial") summary(bma.search) imageplot.bma(bma.search)
Bayesian Model Average (BMA) analysis > summary(bma.search) Call: Best 5 models (cumulative posterior probability = 0.8836 ): p!=0 EV SD model 1 model 2 model 3 model 4 model 5 Intercept 100 -2.85012 2.8651 -3.920 -1.065 -1.201 -8.257 -0.072 age 15.3 0.00845 0.0261 . . . 0.063 . bmi 21.7 -0.02302 0.0541 . . -0.116 . -0.070 bmd 39.7 -1.34136 1.9762 . -3.499 . . -2.696 ictp 100.0 0.64575 0.1699 0.606 0.687 0.680 0.554 0.714 pinp 5.7 -0.00037 0.0041 . . . . . nVar 1 2 2 2 3 BIC -525.044 -524.939 -523.625 -522.672 -521.032 post prob 0.307 0.291 0.151 0.094 0.041
Bayesian Model Average (BMA) analysis > imageplot.bma(bma.search)
Summary of main points • Logistic regression model is used to analyze the association between a binary outcome and one or many determinants. • The determinants can be binary, categorical or continuous measurements • The model is logit(p) = log[p / (1-p)] = a + bX, where X is a factor, and a and b must be estimated from observed data.
Summary of main points • Exp(b) is the odds ratio associated with an increment in the determinant X. • The logistic regression model can be extended to include many determinants: logit(p) = log[p / (1-p)] = a + bX1 + gX2 + dX3 + …