1 / 14

Advanced Research Methods II 04/13/2009

Advanced Research Methods II 04/13/2009. Logistic Regression. Topic Overview. What is it? Basic procedures: Estimating the model coefficients Interpreting the logistic regression coefficients Assessing Model fit Evaluating association Logistic Regression vs. Discriminant Analysis

mayten
Download Presentation

Advanced Research Methods II 04/13/2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Research Methods II04/13/2009 Logistic Regression

  2. Topic Overview • What is it? • Basic procedures: • Estimating the model coefficients • Interpreting the logistic regression coefficients • Assessing Model fit • Evaluating association • Logistic Regression vs. Discriminant Analysis • Conducting LR using SPSS

  3. What is it? • Analysis examining relationship between IV(s) (can be continuous or categorical) and a categorical DV • Why can’t we use multiple linear regression (MLR)? • (Recall for MLR): • MLR predicts the value of the DV from a linear combination of the IVs • Predicted value of a DV = Mean of all cases that have the same values on the IVs. • Y = b0 + b1X1 + b2X2 +… bkXk + e • When the DV is categorical (0, 1) • Mean of all cases that have the same values on the IVs  Probability of being in category 1 for those cases with such values on the IV • So, can MLR be used to predict probabilities of being in one category? • Problem: Probability ranges from 0.00 to 1.00, while the linear combination of the IVs can range from ( -  to + )

  4. What is it? • Why can’t we use multiple linear regression (MLR)? • Using MLR would violate assumptions of linearity and homoscedasticity • Solution: Find a function to transform probability so that the resulting variable can freely vary (i.e., having values from - to + ) and therefore can be used as the dependent variable to be predicted by the linear combination of the IVs. • That function is: ln[p/(1-p)] =Logit p = Natural logarithm of the odd. • z is a natural logarithm of O (i.e., z= ln(O)] when ez=O (e = 2.71828) • O has values ranging from 0 to +  • z has values ranging from - to +  (z is negative when O <1; z = 0 when O=1)

  5. What is it? • Odd: ratio of two probabilities O = p/(1-p) (O ranges from 0 to +  ) Eg. Dependent variable: Dropout (0 = Staying; 1 = Leaving) Probability of leaving: 0.2  Probability of staying = 1 – 0.2=0.8 Odd of leaving = 0.20/0.80 = 0.25 • Log odd = Logit(p) = Ln(O) = Ln[p/(1-p)] (logit(p) ranges from - to + ) for above example: logit(.20)= -1.39 Other examples p = .30 O = 0.43 logit(p) = -0.85 p = .40 O = 0.67 logit(p) = -0.41 p = .50 O = 1.00 logit(p) = 0.00 p = .60 O = 1.50 logit(p) = 0.41 p = .70 O = 2.33 logit(p) = 0.85

  6. What is it? Logistic Regression: Determining a linear combination of the IVs such that it has the highest correlation with the logit of the probability of belonging to the category coded as 1.00 in the original DV.  That means estimating a set of coefficients b (b0 – bk) logit(p) = ln(O)= ln[p/(1-p)] = b0 + b1x1 + b2x2 + …. bkxk + e

  7. Basic Procedures Estimating the Coefficients • Based on Maximum Likelihood (ML) procedure • Estimating a set of coefficients (weights) for the IVs that maximize the likelihood (probability) of observing the (pattern of) data through iterations. • For computational convenience, the ML procedure minimizes the value -2LL (natural logarithm of the likelihood of observing the current data, multiplied by minus 2). • -2LL is distributed as a Chi-square with (n – k) degree of freedom (n = sample size, k = number of IVs).

  8. Basic Procedures Interpreting the logistic regression coefficients: • Meaning of the regression coefficients in MR: Y = b0 + b1X1 + b2X2 +.. bkXk. b1 = Average change in Y associated with a one-unit change in X1 In logistic regression: Ln(O) = b0 + b1X1 + b2X2 +.. bkXk. b1 = Average change in Ln(O) associated with a one-unit change in X1  When X1 increases one unit, the new odd is equal to the current one multiplied by eb1 eb1 is called odd ratio (OR; ratio of the new and old odds).

  9. Basic Procedures Interpreting the logistic regression coefficients (cont.) - When b1 > 0  OR > 1  Increase in X1 leads to increase in the odd (probability) of belonging to category 1. - When b1 < 0  OR < 1  Increase in X1 leads in decrease in the odd (probability) of belonging to category 1. - When X1 is a categorical variable (e.g., male = 0, female = 1), eb1 is the odd ratio of females and males. (e.g., the ratio of the odd that a female student will drop out to the odd that a male student will drop out). - Note: Odd ratio (ratio of the odds) is different from risk ratio (ratio of the probabilities).

  10. Basic Procedures Assessing model fit: • Using -2LL (distributed as Chi-square – larger when the model is not good) • Based on hierarchically nested model: Comparing the -2LL of model with all the IVs to that of the “null” model, that is, model with no IVs. • Null hypothesis H0: both models fit similarly (i.e., the IVs are not good predictors) • Alternative hypothesis H1: Model with all the IVs fit better (i.e., the IVs are good predictors) • If the Chi-square is significant  Reject the H0.

  11. Basic Procedures Evaluating the association (“effect size”): • “R-square-like” indices: • Based on -2LL • Conceptually analogous to the R-square in MLR in the sense that they reflect the proportion of information in the data explainable by the IVs. • Cox and Snell’s R-square (R2M) • Range: 0 ≤ R2M≤ 1 • Affected by base rate • Nagelkerke’s R-square (R2N) • Computed by dividing R2M by its maximum value. • Range: 0≤R2N ≤ 1 • Also affected by base rate.

  12. Basic Procedures Evaluating the association (“effect size”): • “R-square-like” indices (cont.) • Likelihood R-square (R2L) (L0 = -2LL for the null model, LM= -2LL for the full model with all IVs) • Range: 0≤ R2L≤ 1 • Not affected by base rate • Probably the best index • Not provided directly by SPSS but can be calculated from the -2LL of the null model and that of the full model. • Classification accuracy (based on classification table) • Same as in DA • Can be used to compared DA and LR

  13. Discriminant Analysis and Logistic Regression • LR requires less stringent assumptions  Can be used when the IVs include both continuous and categorical variables or when the Box’s test suggests that the assumption of equal within-group variances-covariances is not tenable. • LR used Maximum Likelihood estimation  requires larger sample sizes. • Comparing solutions provided by LR and DA: Inconclusive (Fan & Wang, 1999).

  14. Conducting LR using SPSS Data

More Related