330 likes | 409 Views
Logistic Regression and Odds Ratios. Psych 818 - DeShon. Dichotomous Response. Used when the outcome or DV is a dichotomous, random variable Can only take one of two possible values (1,0) Pass/Fail Disease/No Disease Agree/Disagree True/False Present/Absent
E N D
Logistic Regression and Odds Ratios Psych 818 - DeShon
Dichotomous Response • Used when the outcome or DV is a dichotomous, random variable • Can only take one of two possible values (1,0) • Pass/Fail • Disease/No Disease • Agree/Disagree • True/False • Present/Absent • This data structure causes problems for OLS regression
Dichotomous Response • Properties of dichotomous response variables (Y) • POSITIVE RESPONSE (Success =1) p • NEGATIVE RESPONSE (Failure = 0) q = (1-p) • observed proportion of successes • Var(Y) = p*q • Ooops! Variance depends on the mean
Dichotomous Response • Lets generate some (0,1) data • Y <- rbinom(n=1000,size=1,prob=.3) • mean(Y)= 0.295 • = .3 • var(Y) = 0.208 • 2= (.3 *.7) = .21 hist(Y)
Describing Dichotomous Data • Proportion of successes (p) • Odds • Odds of an event is the probability it occurs divided by the probability it does not occur • p/(1-p) • if p=.53; odds=.53/.47 = 1.13
Modeling Y (Categorical X) • Odds Ratio • Used to compare two proportions across groups • odds for males =.54/(1-.53) = 1.13 • odds for females = .62/(1-.62) = 1.63 • Odds-ratio = 1.62/1.13 = 1.44 • A female is 1.44 times more likely than a male to get a 1 • Or… 1.13/1.62 = 0.69 • A male is .69 times as likely as a female to get a 1 • OR > 1: increased odds for group 1 relative to 2 • OR = 1: no difference in odds for group 1 relative to 2 • OR < 1: lower odds for group 1 relative to 2
Modeling Y (Categorical X) • Odds-ratio for a 2 x 2 table • Odds(Hi) • 11/4 • Odds(Lo) • 2/5 • O.R. = (11/4)/(2/5)=8.25 • Odds of HD are 8.25 time larger for high cholesterol
Odds-Ratio • Ranges from 0 to infinity • 01∞ • Tends to be skewed • Often transform to log-odds to get symmetry • The log-OR comparing females to males = log(1.44) = 0.36 • The log-OR comparing males to females = log(0.69) = -0.36
Modeling Y (Continuous X) • We need to form a general prediction model • Standard OLS regression won’t work • The errors of a dichotomous variable can not be normally distributed with constant variance • Also, the estimated parameters don’t make much sense • Let’s look at a scatterplot of dichotomous data…
Dichotomous Scatterplot • What smooth function can we use to model something that looks like this?
Dichotomous Scatterplot • OLS regression? Smooth but…
Dichotomous Scatterplot • Could break X into groups to form a more continuous scale for Y • proportion or percentage scale
Dichotomous Scatterplot • Now, plot the categorized data Notice the “S” Shape? = sigmoid Notice that we just shifted to a continuous scale?
Dichotomous Scatterplot • We can fit a smooth function by modeling the probability of success (“1”) directly Model the probability of a ‘1’ rather than the (0,1) data directly
Logistic Equation • E(y|x)= (x) = probability that a person with a given x-score will have a score of ‘1’ on Y • Could just expand u to include more predictors for a multiple logistic regression
Logistic Regression - shifts the distribution (value of x where =.5) - reflects the steepness of the transition (slope)
Features of Logistic Regression • Change in probability is not constant (linear) with constant changes in X • probability of a success (Y = 1) given the predictor variable (X) is a non-linear function • Can rewrite the logistic equation as an Odds
Logit Transform • Can linearize the logistic equation by using the “logit” transformation • apply the natural log to both sides of the equation • Yields the logit or log-odds:
Logit Transformation • The logit transformation puts the interpretation of the regression estimates back on familiar footing • = expected value of the logit (log-odds) when X = 0 • = ‘logit difference’ = The amount the logit (log-odds) changes, with a one unit change in X;
Logit • Logit • the natural log of the odds • often called a log odds • logit scale is continuous, linear, and functions much like a z-score scale. • p = 0.50, then logit = 0 • p = 0.70, then logit = 0.84 • p = 0.30, then logit = -0.84
Odds-Ratios and Logistic Regression • The slope may also be interpreted as the log odds-ratio associated with a unit increase in x • exp()=odds-ratio • Compare the log odds (logit) of a person with a score of x to a person with a score of x+1
There and back again… • If the data are consistent with a logistic function, then the relationship between the model and the logit is linear • The logit scale is somewhat difficult to understand • Could interpret as odds but people seem to prefer probability as the natural scale, so…
There and back again… Logit Odds Probability
Estimation • Don’t meet OLS assumptions so some variant of MLE is used • Let’s develop the likelihood • Assuming observations are independent…
Estimation • Likelihood • recall..
Estimation • Upon substitution…
Example • Heart Disease & Age • 100 participants • DV = presence of heart disease • IV = Age
Heart Disease Example • library(MASS) • glm(formula = y ~ x, family = binomial,data=mydata) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.30945 1.13365 -4.683 2.82e-06 *** age 0.11092 0.02406 4.610 4.02e-06 *** Null deviance: 136.66 on 99 degrees of freedom Residual deviance: 107.35 on 98 degrees of freedom AIC: 111.35 Number of Fisher Scoring iterations: 4
Heart Disease Example • Logistic regression • Odds-Ratio • exp(.111)=1.117
Heart Disease Example • In terms of logits…