330 likes | 474 Views
Logistic Regression for binary outcomes. In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression since: Y can’t be linearly related to Xs. Y does NOT have a Gaussian (normal) distribution around “mean” P.
E N D
In Linear Regression, Y is continuous In Logistic, Y is binary (0,1). Average Y is P. Can’t use linear regression since: • Y can’t be linearly related to Xs. • Y does NOT have a Gaussian (normal) distribution around “mean” P. We need a “linearizing” transformation and a non Gaussian error model
Since 0 <= P <= 1 Might use odds = P/(1-P) Odds has no “ceiling” but has “floor” of zero. So we use the logit transformation ln(P/(1-P)) = ln(odds) = logit(P) Logit does not have a floor or ceiling. Model: logit = ln(P/(1-P))=β0+ β1X1 + β2X2+…+βkXk or Odds= e(β0 + β1X1 + β2X2+…+βkXk)=elogit
Since P=odds/(1 + odds) & odds = elogit P = elogit/(1 + elogit) = 1/(1 + e-logit)
If ln(odds)= β0+ β1X1 + β2X2+…+βkXk then odds = (eβ0) (eβ1X1) (eβ2X2)…(eβkXk) or odds = (base odds) OR1 OR2 … ORk Model is multiplicative on the odds scale (Base odds are odds when all Xs=0) ORi = odds ratio for the ith X
Interpreting β coefficients Example: Dichotomous X X = 0 for males, X=1 for females logit(P) = β0 + β1 X M: X=0, logit(Pm)= β0 F: X=1, logit(Pf) = β0 + β1 logit(Pf) – logit(Pm) = β1 log(OR) = β1, eβ1 = OR
Example: P is proportion with disease logit(P) = β0 + β1 age + β2 sex “sex” is coded 0 for M, 1 for F OR for F vs M for disease is eβ2 if both are the same age. eβ1 is the increase in the odds of disease for a one year increase in age. (eβ1)k = ekβ1 is the OR for a ‘k’ year change in age in two groups with the same gender.
Example: P is proportion with a MI Predictors: age in years htn = hypertension (1=yes, 0=no) smoke = smoking (1=yes, 0=no) Logit(P) = β0+ β1age + β2 htn + β3 smoke Q: Want OR for a 40 year old with hypertension vs otherwise identical 30 year old without hypertension. A:β0+β140+β2+β3smoke–(β0+β130+β3smoke) = β110+β2=log OR. OR = e[10 β1+β2].
Interactions P is proportion with CHD S:1= smoking, 0=non. D:1=drinking, 0 =non Logit(P)= β0+ β1S + β2 D + β3 SD Referent category is S=0, D=0 S D odds OR 0 0 eβ0 OR00=1= eβ0/ eβ0 1 0 eβ0+β1 OR10= eβ1 0 1 eβ0+β2 OR01= eβ2 1 1 eβ0+β1+β2+β3 OR11= e(β1+β2+β3) When will OR11=OR10 x OR01? IFF β3=0
Interpretation example Potential predictors (13) of in hospital infection mortality (yes or no) Crabtree, et al JAMA 8 Dec 1999 No 22, 2143-2148 Gender (female or male) Age in years APACHE score (0-129) Diabetes (y/n) Renal insufficiency / Hemodyalysis (y/n) Intubation / mechanical ventilation (y/n) Malignancy (y/n) Steroid therapy (y/n) Transfusions (y/n) Organ transplant (y/n) WBC - count Max temperature - degrees Days from admission to treatment (> 7 days)
Factors Associated With Mortality for All Infections Characteristic Odds Ratio (95% CI) pvalue Incr APACHE score 1.15 (1.11-1.18) <.001 Transfusion (y/n) 4.15 (2.46-6.99) <.001 Increasing age 1.03 (1.02-1.05) <.001 Malignancy 2.60 (1.62-4.17) <.001 Max Temperature 0.70 (0.58-0.85) <.001 Adm to treat>7 d 1.66 (1.05-2.61) 0.03 Female (y/n) 1.32 (0.90-1.94) 0.16 *APACHE = Acute Physiology & Chronic Health Evaluation Score
Diabetes complications -Descriptive stats Table of obese by diabetes complication obese diabetes complication Freq | no- 0|yes- 1| Total % yes -----+------+------+ no 0| 56 | 28 | 84 28/84=33% -----+------+------+ yes 1| 20 | 41 | 61 41/61=67% -----+------+------+ Total 76 69 145 %obese 26% 59% RR=2.0, OR=4.1 , p < 0.001 Fasting glucose (“fast glu”) mg/dl n min median mean max No complication 76 70.0 90.0 91.2 112.0 Complication 69 75.0 114.0 155.9 353.0, p= Steady state glucose (“steady glu”) mg/dl n min median mean max No complication 76 29.0 105.0 114.0 273.0 Complication 69 60.0 257.0 261.5 480.0, p=
Diabetes complication Parameter DF beta SE(b) Chi-Square p Intercept 1 -14.70 3.231 20.706 <.0001 obese 1 0.328 0.615 0.285 0.5938 Fast glu 1 0.108 0.031 2.456 0.0004 Steady glu 1 0.023 0.005 18.322 <.0001 Log odds diabetes complication = -14.7+0.328 obese+0.108 fast glu + 0.023 steady glu
Statistical sig of the βs Linear regr t = b/SE -> p value Logistic regr Χ2 = (b/SE)2 -> p value Must first form (95%) CI for β on log scale b – 1.96 SE, b + 1.96 SE Then take antilogs of each end e[b – 1.96 SE], e[b + 1.96 SE]
Diabetes complications Odds Ratio Estimates Point 95% Wald Effect Estimate Confidence Limits obese e0.328=1.388 0.416 4.631 Fast glu e0.108=1.114 1.049 1.182 Steady glu e0.023=1.023 1.012 1.033
Model fit-Linear vs Logistic regression k variables, n observations Variation df sum square or deviance Model k G Error n-k D Total n-1 T <-fixed Yi= ith observation, Ŷi=prediction for ith obs
Good regression models have large G and small D. For logistic regression, D/(n-k), the mean deviance, should be near 1.0. There are two versions of the R2 for logistic regression.
Goodness of fit:Deviance Deviance in logistic is like SS in linear regr df -2log L p value Model (G) 3 117.21 < 0.001 Error (D) 141 83.46 total (T) 144 200.67 mean deviance =83.46/141=0.59 (want mean deviance to be ≤ 1) R2pseudo=G/total =117/201= 0.58, R2cs =0.554
Goodness of fit:H-L chi sq Compare observed vs model predicted (expected) frequencies by pred. decile decile total obs y exp y obs no exp no 1 16 0 0.23 16 15.8 2 15 0 0.61 15 14.4 3 15 0 1.31 15 13.7 … 8 16 15 15.6 1 0.40 9 23 23 23.0 0 0.00 chi-square=9.89, df=7, p = 0.1946
Goodness of fit vs R2 Interpretation when goodness of fit is acceptable and R2 is poor. Need to include interactions or make transformation on X variables in model? Need to obtain more X variables?
Sensitivity & Specificity Sensitivity=a/(a+c), false neg=c/(a+c) Specificity=d/(b+d), false pos=b/(b+d) Accuracy = W sensitivity + (1-W) specificity
Any good classification rule, including a logistic model, should have high sensitivity & specificity. In logistic, we choose a cutpoint, Pc, Predict positive if P > Pc Predict negative if P < Pc
Diabetes complication logit(Pi) = -14.7+0.328 obese+0.108 fast glu +0.023 steady glu Pi = 1/(1+ exp(-logit)) Compute Pi for all observations, find value of Pi (call it P0) that maximizes accuracy=0.5 sensitivity + 0.5 specificity This is an ROC analysis using the logit (or Pi)
Diabetes model accuracy Logit =0.447, P0=e0.447/(1+e0.447) = 0.61 Sens=55/69= 79.7%, Spec=65/76=85.5% Accuracy = (81.2% + 85.5%)/2 = 83.4%
C statistic (report this) n0=num negative, n1=num positive Make all n0 x n1 pairs (1,0) Concordant if predicted P for Y=1 > predicted P for Y=0 Discordant if predicted P for Y=1 < predicted P for Y=0 C = num concordant + 0.5 num ties n0 x n1 C=0.949 for diabetes complication model
Logistic model is also a discriminant model (LDA) Histograms of logit scores for each group
Poisson Regression Y is a low positive integer, 0, 1,2, … Model: ln(mean Y) = β0+ β1X1 + β2X2+…+βkXk so mean Y = exp(β0+ β1X1 + β2X2+…+βkXk) dY/dXi = βi mean Y, βi = (dY/dXi)/mean Y 100 βi is the percent change per unit change in Xi
Equation for logit = log odds=depr “score” logit = -1.8259 + 0.8332 female + 0.3578 chron ill -0.0299 income odds depr = elogit, risk = odds/(1+odds) coding: Female: 0 for M, 1 for F Chron ill: 0 for no, 1 for yes Income in 1000s
Example: Depression (y/n) Model for depression term coeff=β SE p value Intercept -1.8259 0.4495 0.0001 female 0.8332 0.3882 0.0319 chron ill 0.3578 0.3300 0.2782 income -0.0299 0.0135 0.0268 Female, chron ill are binary, income in 1000s
ORs term coeff=β OR = eβ Intercept -1.8259 --- female 0.8332 2.301 chron ill 0.3578 1.430 income -0.0299 0.971