Logistic Regression and Odds Ratios

Logistic Regression and Odds Ratios Psych 818 - DeShon

Dichotomous Response • Used when the outcome or DV is a dichotomous, random variable • Can only take one of two possible values (1,0) • Pass/Fail • Disease/No Disease • Agree/Disagree • True/False • Present/Absent • This data structure causes problems for OLS regression

Dichotomous Response • Properties of dichotomous response variables (Y) • POSITIVE RESPONSE (Success =1)  p • NEGATIVE RESPONSE (Failure = 0)  q = (1-p) •  observed proportion of successes • Var(Y) = p*q • Ooops! Variance depends on the mean

Dichotomous Response • Lets generate some (0,1) data • Y <- rbinom(n=1000,size=1,prob=.3) • mean(Y)= 0.295 •  = .3 • var(Y) = 0.208 • 2= (.3 *.7) = .21 hist(Y)

Describing Dichotomous Data • Proportion of successes (p) • Odds • Odds of an event is the probability it occurs divided by the probability it does not occur • p/(1-p) • if p=.53; odds=.53/.47 = 1.13

Modeling Y (Categorical X) • Odds Ratio • Used to compare two proportions across groups • odds for males =.54/(1-.53) = 1.13 • odds for females = .62/(1-.62) = 1.63 • Odds-ratio = 1.62/1.13 = 1.44 • A female is 1.44 times more likely than a male to get a 1 • Or… 1.13/1.62 = 0.69 • A male is .69 times as likely as a female to get a 1 • OR > 1: increased odds for group 1 relative to 2 • OR = 1: no difference in odds for group 1 relative to 2 • OR < 1: lower odds for group 1 relative to 2

Modeling Y (Categorical X) • Odds-ratio for a 2 x 2 table • Odds(Hi) • 11/4 • Odds(Lo) • 2/5 • O.R. = (11/4)/(2/5)=8.25 • Odds of HD are 8.25 time larger for high cholesterol

Odds-Ratio • Ranges from 0 to infinity • 01∞ • Tends to be skewed • Often transform to log-odds to get symmetry • The log-OR comparing females to males = log(1.44) = 0.36 • The log-OR comparing males to females = log(0.69) = -0.36

Modeling Y (Continuous X) • We need to form a general prediction model • Standard OLS regression won’t work • The errors of a dichotomous variable can not be normally distributed with constant variance • Also, the estimated parameters don’t make much sense • Let’s look at a scatterplot of dichotomous data…

Dichotomous Scatterplot • What smooth function can we use to model something that looks like this?

Dichotomous Scatterplot • OLS regression? Smooth but…

Dichotomous Scatterplot • Could break X into groups to form a more continuous scale for Y • proportion or percentage scale

Dichotomous Scatterplot • Now, plot the categorized data Notice the “S” Shape? = sigmoid Notice that we just shifted to a continuous scale?

Dichotomous Scatterplot • We can fit a smooth function by modeling the probability of success (“1”) directly Model the probability of a ‘1’ rather than the (0,1) data directly

Another Example

Another Example (cont)

Logistic Equation • E(y|x)= (x) = probability that a person with a given x-score will have a score of ‘1’ on Y • Could just expand u to include more predictors for a multiple logistic regression

Logistic Regression  - shifts the distribution (value of x where  =.5)  - reflects the steepness of the transition (slope)

Features of Logistic Regression • Change in probability is not constant (linear) with constant changes in X • probability of a success (Y = 1) given the predictor variable (X) is a non-linear function • Can rewrite the logistic equation as an Odds

Logit Transform • Can linearize the logistic equation by using the “logit” transformation • apply the natural log to both sides of the equation • Yields the logit or log-odds:

Logit Transformation • The logit transformation puts the interpretation of the regression estimates back on familiar footing •  = expected value of the logit (log-odds) when X = 0 •  = ‘logit difference’ = The amount the logit (log-odds) changes, with a one unit change in X;

Logit • Logit • the natural log of the odds • often called a log odds • logit scale is continuous, linear, and functions much like a z-score scale. • p = 0.50, then logit = 0 • p = 0.70, then logit = 0.84 • p = 0.30, then logit = -0.84

Odds-Ratios and Logistic Regression • The slope may also be interpreted as the log odds-ratio associated with a unit increase in x • exp()=odds-ratio • Compare the log odds (logit) of a person with a score of x to a person with a score of x+1

There and back again… • If the data are consistent with a logistic function, then the relationship between the model and the logit is linear • The logit scale is somewhat difficult to understand • Could interpret as odds but people seem to prefer probability as the natural scale, so…

There and back again… Logit Odds Probability

Estimation • Don’t meet OLS assumptions so some variant of MLE is used • Let’s develop the likelihood • Assuming observations are independent…

Estimation • Likelihood • recall..

Estimation • Upon substitution…

Example • Heart Disease & Age • 100 participants • DV = presence of heart disease • IV = Age

Heart Disease Example

Heart Disease Example • library(MASS) • glm(formula = y ~ x, family = binomial,data=mydata) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.30945 1.13365 -4.683 2.82e-06 *** age 0.11092 0.02406 4.610 4.02e-06 *** Null deviance: 136.66 on 99 degrees of freedom Residual deviance: 107.35 on 98 degrees of freedom AIC: 111.35 Number of Fisher Scoring iterations: 4

Heart Disease Example • Logistic regression • Odds-Ratio • exp(.111)=1.117

Heart Disease Example • In terms of logits…

Logistic Regression and Odds Ratios