1 / 32

Logistic Regression for binary outcomes

Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression since: Y can’t be linearly related to Xs. Y does NOT have a Gaussian (normal) distribution around “mean” P.

Download Presentation

Logistic Regression for binary outcomes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Logistic Regressionfor binary outcomes

  2. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression since: • Y can’t be linearly related to Xs. • Y does NOT have a Gaussian (normal) distribution around “mean” P. We need a “linearizing” transformation and a non Gaussian error model

  3. Since 0 <= P <= 1 Might use odds = P/(1-P) Odds has no “ceiling” but has “floor” of zero. So we use the logit transformation ln(P/(1-P)) = ln(odds) = logit(P) Logit does not have a floor or ceiling. Model: logit = ln(P/(1-P))=β0+ β1X1 + β2X2+…+βkXk or Odds= e(β0 + β1X1 + β2X2+…+βkXk)=elogit

  4. Since P=odds/(1 + odds) & odds = elogit P = elogit/(1 + elogit) = 1/(1 + e-logit)

  5. If ln(odds)= β0+ β1X1 + β2X2+…+βkXk then odds = (eβ0) (eβ1X1) (eβ2X2)…(eβkXk) or odds = (base odds) OR1 OR2 … ORk Model is multiplicative on the odds scale (Base odds are odds when all Xs=0) ORi = odds ratio for the ith X

  6. Interpreting β coefficients Example: Dichotomous X X = 0 for males, X=1 for females logit(P) = β0 + β1 X M: X=0, logit(Pm)= β0 F: X=1, logit(Pf) = β0 + β1 logit(Pf) – logit(Pm) = β1 log(OR) = β1, eβ1 = OR

  7. Example: P is proportion with disease logit(P) = β0 + β1 age + β2 sex “sex” is coded 0 for M, 1 for F OR for F vs M for disease is eβ2 if both are the same age. eβ1 is the increase in the odds of disease for a one year increase in age. (eβ1)k = ekβ1 is the OR for a ‘k’ year change in age in two groups with the same gender.

  8. Example: P is proportion with a MI Predictors: age in years htn = hypertension (1=yes, 0=no) smoke = smoking (1=yes, 0=no) Logit(P) = β0+ β1age + β2 htn + β3 smoke Q: Want OR for a 40 year old with hypertension vs otherwise identical 30 year old without hypertension. A:β0+β140+β2+β3smoke–(β0+β130+β3smoke) = β110+β2=log OR. OR = e[10 β1+β2].

  9. Interactions P is proportion with CHD S:1= smoking, 0=non. D:1=drinking, 0 =non Logit(P)= β0+ β1S + β2 D + β3 SD Referent category is S=0, D=0 S D odds OR 0 0 eβ0 OR00=1= eβ0/ eβ0 1 0 eβ0+β1 OR10= eβ1 0 1 eβ0+β2 OR01= eβ2 1 1 eβ0+β1+β2+β3 OR11= e(β1+β2+β3) When will OR11=OR10 x OR01? IFF β3=0

  10. Interpretation example Potential predictors (13) of in hospital infection mortality (yes or no) Crabtree, et al JAMA 8 Dec 1999 No 22, 2143-2148 Gender (female or male) Age in years APACHE score (0-129) Diabetes (y/n) Renal insufficiency / Hemodyalysis (y/n) Intubation / mechanical ventilation (y/n) Malignancy (y/n) Steroid therapy (y/n) Transfusions (y/n) Organ transplant (y/n) WBC - count Max temperature - degrees Days from admission to treatment (> 7 days)

  11. Factors Associated With Mortality for All Infections Characteristic Odds Ratio (95% CI) pvalue Incr APACHE score 1.15 (1.11-1.18) <.001 Transfusion (y/n) 4.15 (2.46-6.99) <.001 Increasing age 1.03 (1.02-1.05) <.001 Malignancy 2.60 (1.62-4.17) <.001 Max Temperature 0.70 (0.58-0.85) <.001 Adm to treat>7 d 1.66 (1.05-2.61) 0.03 Female (y/n) 1.32 (0.90-1.94) 0.16 *APACHE = Acute Physiology & Chronic Health Evaluation Score

  12. Diabetes complications -Descriptive stats Table of obese by diabetes complication obese diabetes complication Freq | no- 0|yes- 1| Total % yes -----+------+------+ no 0| 56 | 28 | 84 28/84=33% -----+------+------+ yes 1| 20 | 41 | 61 41/61=67% -----+------+------+ Total 76 69 145 %obese 26% 59% RR=2.0, OR=4.1 , p < 0.001 Fasting glucose (“fast glu”) mg/dl n min median mean max No complication 76 70.0 90.0 91.2 112.0 Complication 69 75.0 114.0 155.9 353.0, p= Steady state glucose (“steady glu”) mg/dl n min median mean max No complication 76 29.0 105.0 114.0 273.0 Complication 69 60.0 257.0 261.5 480.0, p=

  13. Diabetes complication Parameter DF beta SE(b) Chi-Square p Intercept 1 -14.70 3.231 20.706 <.0001 obese 1 0.328 0.615 0.285 0.5938 Fast glu 1 0.108 0.031 2.456 0.0004 Steady glu 1 0.023 0.005 18.322 <.0001 Log odds diabetes complication = -14.7+0.328 obese+0.108 fast glu + 0.023 steady glu

  14. Statistical sig of the βs Linear regr t = b/SE -> p value Logistic regr Χ2 = (b/SE)2 -> p value Must first form (95%) CI for β on log scale b – 1.96 SE, b + 1.96 SE Then take antilogs of each end e[b – 1.96 SE], e[b + 1.96 SE]

  15. Diabetes complications Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits obese e0.328=1.388 0.416 4.631 Fast glu e0.108=1.114 1.049 1.182 Steady glu e0.023=1.023 1.012 1.033

  16. Model fit-Linear vs Logistic regression k variables, n observations Variation df sum square or deviance Model k G Error n-k D Total n-1 T <-fixed Yi= ith observation, Ŷi=prediction for ith obs

  17. Good regression models have large G and small D. For logistic regression, D/(n-k), the mean deviance, should be near 1.0. There are two versions of the R2 for logistic regression.

  18. Goodness of fit:Deviance Deviance in logistic is like SS in linear regr df -2log L p value Model (G) 3 117.21 < 0.001 Error (D) 141 83.46 total (T) 144 200.67 mean deviance =83.46/141=0.59 (want mean deviance to be ≤ 1) R2pseudo=G/total =117/201= 0.58, R2cs =0.554

  19. Goodness of fit:H-L chi sq Compare observed vs model predicted (expected) frequencies by pred. decile decile total obs y exp y obs no exp no 1 16 0 0.23 16 15.8 2 15 0 0.61 15 14.4 3 15 0 1.31 15 13.7 … 8 16 15 15.6 1 0.40 9 23 23 23.0 0 0.00 chi-square=9.89, df=7, p = 0.1946

  20. Goodness of fit vs R2 Interpretation when goodness of fit is acceptable and R2 is poor. Need to include interactions or make transformation on X variables in model? Need to obtain more X variables?

  21. Sensitivity & Specificity Sensitivity=a/(a+c), false neg=c/(a+c) Specificity=d/(b+d), false pos=b/(b+d) Accuracy = W sensitivity + (1-W) specificity

  22. Any good classification rule, including a logistic model, should have high sensitivity & specificity. In logistic, we choose a cutpoint, Pc, Predict positive if P > Pc Predict negative if P < Pc

  23. Diabetes complication logit(Pi) = -14.7+0.328 obese+0.108 fast glu +0.023 steady glu Pi = 1/(1+ exp(-logit)) Compute Pi for all observations, find value of Pi (call it P0) that maximizes accuracy=0.5 sensitivity + 0.5 specificity This is an ROC analysis using the logit (or Pi)

  24. ROC for logistic model

  25. Diabetes model accuracy Logit =0.447, P0=e0.447/(1+e0.447) = 0.61 Sens=55/69= 79.7%, Spec=65/76=85.5% Accuracy = (81.2% + 85.5%)/2 = 83.4%

  26. C statistic (report this) n0=num negative, n1=num positive Make all n0 x n1 pairs (1,0) Concordant if predicted P for Y=1 > predicted P for Y=0 Discordant if predicted P for Y=1 < predicted P for Y=0 C = num concordant + 0.5 num ties n0 x n1 C=0.949 for diabetes complication model

  27. Logistic model is also a discriminant model (LDA) Histograms of logit scores for each group

  28. Poisson Regression Y is a low positive integer, 0, 1,2, … Model: ln(mean Y) = β0+ β1X1 + β2X2+…+βkXk so mean Y = exp(β0+ β1X1 + β2X2+…+βkXk) dY/dXi = βi mean Y, βi = (dY/dXi)/mean Y 100 βi is the percent change per unit change in Xi

  29. End

  30. Equation for logit = log odds=depr “score” logit = -1.8259 + 0.8332 female + 0.3578 chron ill -0.0299 income odds depr = elogit, risk = odds/(1+odds) coding: Female: 0 for M, 1 for F Chron ill: 0 for no, 1 for yes Income in 1000s

  31. Example: Depression (y/n) Model for depression term coeff=β SE p value Intercept -1.8259 0.4495 0.0001 female 0.8332 0.3882 0.0319 chron ill 0.3578 0.3300 0.2782 income -0.0299 0.0135 0.0268 Female, chron ill are binary, income in 1000s

  32. ORs term coeff=β OR = eβ Intercept -1.8259 --- female 0.8332 2.301 chron ill 0.3578 1.430 income -0.0299 0.971

More Related