1 / 17

Logistic Regression

Logistic Regression. Predicting Dichotomous Data. Predicting a Dichotomy. Response variable has only two states: male/female, present/absent, yes/no, etc Linear regression fails because we cannot keep the prediction within the bounds of 0 – 1 Continuous and non-continuous predictors possible.

stesha
Download Presentation

Logistic Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Logistic Regression Predicting Dichotomous Data

  2. Predicting a Dichotomy • Response variable has only two states: male/female, present/absent, yes/no, etc • Linear regression fails because we cannot keep the prediction within the bounds of 0 – 1 • Continuous and non-continuous predictors possible

  3. Logistic Model • Explanatory variables used to predict the probability that the response will be present (male, yes, etc) • We fit a linear model to the log of the odds that an event will occur • If the probability that an event will occur is p, then the odds = p/(1-p)

  4. logits • Equations: • logit(p) = log(p/(1-p)) • logit(p) = b0 + b1x1 + b2x2 . . . • So logistic regression is a linear regression of logits (logs of odds ratios)

  5. Assumptions • Dichotomous response (only two states possible) • Outcomes statistically independent • Model contains all relevant predictors and no irrelevant ones • Samples sizes of about 50 cases per predictor

  6. Two Approaches • Data consisting of individual cases with a dichotomous variable • Grouped data where the number present and number absent are known for each combination of explanatory variables (in practice these will usually be categorical/ ordinal)

  7. Inverting Snodgrass • Instead of seeing if houses inside the white wall are larger than those outside, we can use area to predict where the house is located.

  8. # Use Rcmdr to create a dichotomous variable In Snodgrass$In<- with(Snodgrass, ifelse(Inside=="Inside", 1, 0)) # Use Rcmdr to bin Area into 10 bins using numbers to Snodgrass$AreaBin<- bin.var(Snodgrass$Area, bins=10, method='intervals', labels=FALSE) # Use Rcmdr to aggregate compute mean Area and In for each AreaBin AggregatedData<- aggregate(Snodgrass[,c("Area","In"), drop=FALSE], by=list(AreaBin=Snodgrass$AreaBin), FUN=mean) # Plot raw data plot(In~Area, data=Snodgrass, las=1) # Plot means by AreaBin groups points(AggregatedData[,2:3], type="b", pch=16)

  9. Fitting a Simple Model • We start with a simple model using Area only • Statistics | Fit Models | Generalized Linear Model • In is the response, Area is the explanatory variable • Family is binomial, Link function is logit

  10. > GLM.1 <- glm(In ~ Area, family=binomial(logit), data=Snodgrass) > summary(GLM.1) Call: glm(formula = In ~ Area, family = binomial(logit), data = Snodgrass) Deviance Residuals: Min 1Q Median 3Q Max -2.1103 -0.4815 -0.1836 0.2885 2.5706 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.663071 1.818444 -4.764 1.90e-06 *** Area 0.034760 0.007515 4.626 3.74e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 123.669 on 90 degrees of freedom Residual deviance: 57.728 on 89 degrees of freedom AIC: 61.728 Number of Fisher Scoring iterations: 6

  11. Results • Slope value for Area is highly significant – Area is a significant predictor of the odds of being inside the white wall • The residual deviance is less than the degrees of freedom (an indicator that the binomial model fits)

  12. # Rcmdr command > GLM.1 <- glm(In ~ Area, family=binomial(logit), data=Snodgrass) # Typed commands > x <- seq(20, 470, 5) > y <- predict(GLM.1, data.frame(Area=x), type="response") > plot(In~Area, data=Snodgrass, las=2) > points(AggregatedData[,2:3], type="b", lty=2, pch=16) > lines(x, y, col="red", lwd=2) # Rcmdr command > Snodgrass$Predicted<- with(Snodgrass, + factor(ifelse(fitted.GLM.1 < .5, "Outside", "Inside"))) # Use Rcmdr to produce a crosstabulation of Inside and Predicted >.Table <- xtabs(~Inside+Predicted, data=Snodgrass) >.Table Predicted Inside Inside Outside Inside 29 9 Outside 5 48 >(29 + 48)/(29 + 9 + 5 + 48) [1] 0.8461538 Predictions are correct 84.6% of the time

  13. Expanding the Model • Expand the model by adding Total and Types • Check the results – neither of the new variables is significant, but this could be the high correlation between the two (+.94) • Delete Types and try again

  14. Third Model • Without Types, Total is now highly significant • ANOVA comparing 2nd and 3rd models show no difference so the 3rd (simpler) model is preferred • Also AIC, Akaike’s Information Criterion is lower (which is better) • New model is 89% accurate

  15. Akaike Information Criterion • AIC measures relative goodness of fit of a statistical model • Roughly it describes the tradeoff between accuracy and complexity of the model • A method of comparing different statistical models – generally prefer model with lower AIC

More Related