270 likes | 281 Views
Discover the fundamentals of logistic regression modeling with dichotomous dependent variables. Learn how to estimate probabilities and interpret odds ratios in predicting outcomes. This approach is essential when traditional OLS regression falls short.
E N D
Logistic Regression Modeling with Dichotomous Dependent Variables
A New Type of Model… • Dichotomous Dependent Variable: • Why did someone vote for Bush or Kerry? • Why did residents own or rent their houses? • Why do some people drink alcohol and others don’t? • What determined if a household owned a car?
Dependent Variable… • Is binary, with a yes or a no answer • Can be coded, 1 for yes and 0 for no. • There are no other valid responses.
Problem: OLS Regression does not model the relationship well
Solution: Use a Different Functional Form • The properties we need: • The model should be bounded by 0 and 1 • The model should estimate a value for the dependent variable in terms of the probability of being in one category or the other, e.g., a owner or renter; or a Bush voter or Kerry voter
Solution, cont. • We want to know the probability, p, that a particular case falls in the 0 or the 1 category. • We want to derive a model which gives good estimates of 0 and 1, or put another way, that a particular case is likely to be a 0 or a 1.
The Logistic Function • Probability that a case is a 0 or a 1 is distributed according to the logistic function.
Remember probabilities… • Probabilities range from 0 to 1. • Probability: frequency of being in one category relative to the total of all categories. • Example: The probability that the first card dealt in a card game is a queen of hearts is 1/52 (one in 52). • It does us no good to “predict” a value of .5 as in the linear regression model.
But can we manipulate probabilities to estimate the logistic function? • Steps: • Convert probabilities to odds ratios • Convert odds ratios to log odds or logits
Manipulating probabilities to estimate the logistic function LIST V2 V3 V4 V5 /N=13 Case number P 1-P P/1-P ln(P/1-P) 1 0.010 0.990 0.010 -4.595 2 0.050 0.950 0.053 -2.944 3 0.100 0.900 0.111 -2.197 4 0.200 0.800 0.250 -1.386 5 0.300 0.700 0.429 -0.847 6 0.400 0.600 0.667 -0.405 7 0.500 0.500 1.000 0.000 8 0.600 0.400 1.500 0.405 9 0.700 0.300 2.333 0.847 10 0.800 0.200 4.000 1.386 11 0.900 0.100 9.000 2.197 12 0.950 0.050 19.000 2.944 13 0.990 0.010 99.000 4.595
Steps…. • Log odds = a + bx • Odds ratio = Exponentiate (a + bx) • Probability is distributed according to the logistic function
An Example • Determinants of Homeownership: • Age of the householder • Age of the householder squared • Building Type • Year house was built • Householder’s Ethnicity • Occupational status scale
Calculating the Model • Maximum Likelihood Estimation (not OLS) • Estimates of the b’s, standard errors, t ratios and p values for coefficients • Coefficients are estimates of the impact of the independent variable on the logit of the dependent variable
Logistic Regression Model • Parameter Estimate S.E. t-ratio p-value • 1 CONSTANT -6.976 1.501 -4.647 0.000 • 2 AGE 0.250 0.060 4.132 0.000 • 3 AGESQ -0.002 0.001 -3.400 0.001 • 4 BLDGTYP2$_cottage 0.036 0.277 0.131 0.895 • 5 BLDGTYP2$_duplex -1.432 0.328 -4.363 0.000 • 6 YEAR 0.061 0.022 2.757 0.006 • 7 GERMAN 0.706 0.264 2.677 0.007 • 8 POLISH 0.777 0.422 1.841 0.066 • 9 OCCSCALE 0.190 0.091 2.074 0.038
Logistic Regression model, cont. • Parameter Odds Ratio Upper Lower • 2 AGE 1.284 1.445 1.140 • 3 AGESQ 0.998 0.999 0.997 • 4 BLDGTYP2$_cottage 1.037 1.784 0.603 • 5 BLDGTYP2$_duplex 0.239 0.454 0.125 • 6 YEAR 1.063 1.109 1.018 • 7 GERMAN 2.026 3.398 1.208 • 8 POLISH 2.175 4.972 0.951 • 9 OCCSCALE 1.209 1.446 1.011 • Log Likelihood of constants only model = LL(0) = -303.864 • 2*[LL(N)-LL(0)] = 85.180 with 8 df Chi-sq p-value = 0.000 • McFadden's Rho-Squared = 0.140
Converting Odds Ratios to Probabilities • Odds ratio = P/1-P. • For Germans, compared with the omitted category (Americans and other ethnicities) controlling for other variables, 2.026 = P/(1-P) • Germans are more likely to own houses than Americans. • Can we be more specific?
Calculating Probability of a Case • Log odds of homeownership = -6.976 + .250Age - .002Agesquared + .036 cottage – 1.432 duplex + .061Year + .706 German + .777 Polish + .190 occscale • Plug in values and solve the equation. • Exponentiate the result to create the odds • Convert the odds to a probability for the case.
Calculations • Log odds of homeownership = -6.976 + .250Age - .002Agesquared + .036 cottage – 1.432 duplex + .061Year + .706 German + .777 Polish + .190 occscale • For a 40 year old skilled, American born worker, living in a residence built in 1892: • Log odds of homeownership = -6.976 + .250*40 - .002*1600 + .061* 5 + .190*3 • Log odds = .699
Calculations, cont. • log odds = .699 • odds = anti log or exponentiation of.699 = 2.012 • odds = P/(1-P) = 2.012 • Solve for P. The result is .67.
More calculations…. • How about a 40 year old German skilled worker in an 1892 residence? • Log odds of homeownership = -6.976 + .250Age - .002Agesquared + .036 cottage – 1.432 duplex + .061Year + .706 German + .777 Polish + .190 occscale • Log odds = -6.976 + .250*40 - .002*1600 + .061* 5 + .706 + .190*3 = 1.405 • Note as well that .699 + .706 = 1.405. • Note as well that .699 * 2.026 (or the odds ratio for the variable “German”) = 1.405
More calculations • Convert the log odds to odds, e.g., take the antilog of 1.405 = 4.076. • Odds = 4.076 = P/(1-P). • Solve for P. P = .803. • So the probability of the increase in home ownership between Americans and Germans is from .67 to .803 or about 13%.
More calculations • For a 30 year old American worker in a residence built in 1892: • Log odds = -6.976 + .250*30 - .002*900 + .061*5 + .190*3 = -0.401 • Odds = Antilog of (-.401) = 0.670 • Probability of ownership = .670/1.670 = 0.401
Classification Table • Model Prediction Success Table • Actual Predicted Choice Actual • Choice Response Reference Total • Response 281.647 85.353 367.000 • Reference 85.353 58.647 144.000 • Pred. Tot. 367.000 144.000 511.000 • Correct 0.767 0.407 • Success Ind. 0.049 0.125 • Tot. Correct 0.666 • Sensitivity: 0.767 Specificity: 0.407 • False Reference: 0.233 False Response: 0.593
Extending the Logic… • Logistic Regression can be extended to more than 2 categories for the dependent variable, for multi response models • Classification Tables can be used to understand misclassified cases • Results can be analyzed for patterns across different values of the independent variables.