Logistic Regression

Logistic Regression Modeling with Dichotomous Dependent Variables

A New Type of Model… • Dichotomous Dependent Variable: • Why did someone vote for Bush or Kerry? • Why did residents own or rent their houses? • Why do some people drink alcohol and others don’t? • What determined if a household owned a car?

Dependent Variable… • Is binary, with a yes or a no answer • Can be coded, 1 for yes and 0 for no. • There are no other valid responses.

Problem: OLS Regression does not model the relationship well

Solution: Use a Different Functional Form • The properties we need: • The model should be bounded by 0 and 1 • The model should estimate a value for the dependent variable in terms of the probability of being in one category or the other, e.g., a owner or renter; or a Bush voter or Kerry voter

Solution, cont. • We want to know the probability, p, that a particular case falls in the 0 or the 1 category. • We want to derive a model which gives good estimates of 0 and 1, or put another way, that a particular case is likely to be a 0 or a 1.

Solution: A Logistic Curve

The Logistic Function • Probability that a case is a 0 or a 1 is distributed according to the logistic function.

Remember probabilities… • Probabilities range from 0 to 1. • Probability: frequency of being in one category relative to the total of all categories. • Example: The probability that the first card dealt in a card game is a queen of hearts is 1/52 (one in 52). • It does us no good to “predict” a value of .5 as in the linear regression model.

But can we manipulate probabilities to estimate the logistic function? • Steps: • Convert probabilities to odds ratios • Convert odds ratios to log odds or logits

Manipulating probabilities to estimate the logistic function LIST V2 V3 V4 V5 /N=13 Case number P 1-P P/1-P ln(P/1-P) 1 0.010 0.990 0.010 -4.595 2 0.050 0.950 0.053 -2.944 3 0.100 0.900 0.111 -2.197 4 0.200 0.800 0.250 -1.386 5 0.300 0.700 0.429 -0.847 6 0.400 0.600 0.667 -0.405 7 0.500 0.500 1.000 0.000 8 0.600 0.400 1.500 0.405 9 0.700 0.300 2.333 0.847 10 0.800 0.200 4.000 1.386 11 0.900 0.100 9.000 2.197 12 0.950 0.050 19.000 2.944 13 0.990 0.010 99.000 4.595

Logistic Function

Steps…. • Log odds = a + bx • Odds ratio = Exponentiate (a + bx) • Probability is distributed according to the logistic function

An Example • Determinants of Homeownership: • Age of the householder • Age of the householder squared • Building Type • Year house was built • Householder’s Ethnicity • Occupational status scale

Calculating the Model • Maximum Likelihood Estimation (not OLS) • Estimates of the b’s, standard errors, t ratios and p values for coefficients • Coefficients are estimates of the impact of the independent variable on the logit of the dependent variable

Logistic Regression Model • Parameter Estimate S.E. t-ratio p-value • 1 CONSTANT -6.976 1.501 -4.647 0.000 • 2 AGE 0.250 0.060 4.132 0.000 • 3 AGESQ -0.002 0.001 -3.400 0.001 • 4 BLDGTYP2$_cottage 0.036 0.277 0.131 0.895 • 5 BLDGTYP2$_duplex -1.432 0.328 -4.363 0.000 • 6 YEAR 0.061 0.022 2.757 0.006 • 7 GERMAN 0.706 0.264 2.677 0.007 • 8 POLISH 0.777 0.422 1.841 0.066 • 9 OCCSCALE 0.190 0.091 2.074 0.038

Logistic Regression model, cont. • Parameter Odds Ratio Upper Lower • 2 AGE 1.284 1.445 1.140 • 3 AGESQ 0.998 0.999 0.997 • 4 BLDGTYP2$_cottage 1.037 1.784 0.603 • 5 BLDGTYP2$_duplex 0.239 0.454 0.125 • 6 YEAR 1.063 1.109 1.018 • 7 GERMAN 2.026 3.398 1.208 • 8 POLISH 2.175 4.972 0.951 • 9 OCCSCALE 1.209 1.446 1.011 • Log Likelihood of constants only model = LL(0) = -303.864 • 2*[LL(N)-LL(0)] = 85.180 with 8 df Chi-sq p-value = 0.000 • McFadden's Rho-Squared = 0.140

Converting Odds Ratios to Probabilities • Odds ratio = P/1-P. • For Germans, compared with the omitted category (Americans and other ethnicities) controlling for other variables, 2.026 = P/(1-P) • Germans are more likely to own houses than Americans. • Can we be more specific?

Calculating Probability of a Case • Log odds of homeownership = -6.976 + .250Age - .002Agesquared + .036 cottage – 1.432 duplex + .061Year + .706 German + .777 Polish + .190 occscale • Plug in values and solve the equation. • Exponentiate the result to create the odds • Convert the odds to a probability for the case.

Calculations • Log odds of homeownership = -6.976 + .250Age - .002Agesquared + .036 cottage – 1.432 duplex + .061Year + .706 German + .777 Polish + .190 occscale • For a 40 year old skilled, American born worker, living in a residence built in 1892: • Log odds of homeownership = -6.976 + .250*40 - .002*1600 + .061* 5 + .190*3 • Log odds = .699

Calculations, cont. • log odds = .699 • odds = anti log or exponentiation of.699 = 2.012 • odds = P/(1-P) = 2.012 • Solve for P. The result is .67.

More calculations…. • How about a 40 year old German skilled worker in an 1892 residence? • Log odds of homeownership = -6.976 + .250Age - .002Agesquared + .036 cottage – 1.432 duplex + .061Year + .706 German + .777 Polish + .190 occscale • Log odds = -6.976 + .250*40 - .002*1600 + .061* 5 + .706 + .190*3 = 1.405 • Note as well that .699 + .706 = 1.405. • Note as well that .699 * 2.026 (or the odds ratio for the variable “German”) = 1.405

More calculations • Convert the log odds to odds, e.g., take the antilog of 1.405 = 4.076. • Odds = 4.076 = P/(1-P). • Solve for P. P = .803. • So the probability of the increase in home ownership between Americans and Germans is from .67 to .803 or about 13%.

More calculations • For a 30 year old American worker in a residence built in 1892: • Log odds = -6.976 + .250*30 - .002*900 + .061*5 + .190*3 = -0.401 • Odds = Antilog of (-.401) = 0.670 • Probability of ownership = .670/1.670 = 0.401

Classification Table • Model Prediction Success Table • Actual Predicted Choice Actual • Choice Response Reference Total • Response 281.647 85.353 367.000 • Reference 85.353 58.647 144.000 • Pred. Tot. 367.000 144.000 511.000 • Correct 0.767 0.407 • Success Ind. 0.049 0.125 • Tot. Correct 0.666 • Sensitivity: 0.767 Specificity: 0.407 • False Reference: 0.233 False Response: 0.593

Extending the Logic… • Logistic Regression can be extended to more than 2 categories for the dependent variable, for multi response models • Classification Tables can be used to understand misclassified cases • Results can be analyzed for patterns across different values of the independent variables.

Logistic Regression