960 likes | 1.23k Views
Standard Binary Logistic Regression. Logistic regression.
E N D
Logistic regression • Logistic regression is used to analyze relationships between a dichotomous dependent variable and metric or non-metric independent variables. (SPSS now supports Multinomial Logistic Regression that can be used with more than two groups, but our focus now is on binary logistic regression for two groups.) • Logistic regression combines the independent variables to estimate the probability that a particular event will occur, i.e. a subject will be a member of one of the groups defined by the dichotomous dependent variable. In SPSS, the model is always constructed to predict the group with higher numeric code. If responses are coded 1 for Yes and 2 for No, SPSS will predict membership in the No category. If responses are coded 1 for No and 2 for Yes, SPSS will predict membership in the Yes category. We will refer to the predicted event for a particular analysis as the modeled event. • Predicting the “No” event create some awkward wording in our problems. Our only option for changing this is to recode the variable. • If the probability for group membership in the modeled category is above some cut point (the default is 0.50), the subject is predicted to be a member of the modeled group. If the probability is below the cut point, the subject is predicted to be a member of the other group. • For any given case, logistic regression computes the probability that a case with a particular set of values for the independent variable is a member of the modeled category
Level of measurement requirements • Logistic regression analysis requires that the dependent variable be dichotomous. • Logistic regression analysis requires that the independent variables be metric or non-metric. The logistic regression procedure will dummy-code non-metric variables for us. For logistic regression, we will use indicator dummy-coding, rather than deviation dummy-coding since I think it makes more sense to compare the odds for two groups rather than compare the odds for one group to the average odds for all groups. • If an independent variable is ordinal, we can either treat it as non-metric and dummy-code it or we can treat it as interval, in which case we will attach the usual caution. • Dichotomous independent variables do not have to be dummy-coded, but in our problems we will have SPSS dummy-code them because then we do not need to worry about the original codes for the variable as we can always interpret
Dummy-coding in SPSS - 1 When we want SPSS to dummy-code a variable, we enter the specifications in the Define Categorical Variables dialog box. Here we are dummy-coding sex, using the defaults of indicatory coding with the last category as the reference category. SPSS shows you its coding scheme in the table of Categorical Variables Codings in the output. Since we chose the last category as reference, FEMALE is coded 0. In the table of coefficients, the dummy-coded variable is referred to by its original name plus the value for the Parameter coding in the Categorical Variables Codings table.
Dummy-coding in SPSS - 2 Here we are dummy-coding sex, using the defaults of indicatory coding with the First category as the reference category. Note you must click on the Change button after selecting the First option button. SPSS shows you its coding scheme in the table of Categorical Variables Codings in the output. Since we chose the FIRST category as reference, MALE is coded 0. In the table of coefficients, the dummy-coded variable is still referred to by its original name plus the value for the Parameter coding in the Categorical Variables Codings table, but in this case it stands for females.
Assumptions • Logistic regression does not make any assumptions of normality, linearity, and homogeneity of variance for the independent variables. • When the variables satisfy the assumptions of normality, linearity, and homogeneity of variance, discriminant analysis has historically been cited as the more effective statistical procedure for evaluating relationships with a non-metric dependent variable. However, logistic regression is being used more and more frequently because it can be interpreted similarly to other general linear model problems. • When the variables do not satisfy the assumptions of normality, linearity, and homogeneity of variance, logistic regression is the statistic of choice since it does not make these assumptions. • Multicollinearity is a problem for logistic regression with the same consequences as multiple regression, i.e. we are likely to misinterpret the contribution of independent variables when they are collinear. SPSS does not compute tolerance values for logistic regression, so we will detect it through the examination of standard errors. We will not interpret models when evidence of multicollinearity is found. • Evidence of multicollinearity is detected as a numerical problem in the attempted solution.
Numerical problems • The maximum likelihood method used to calculate logistic regression is an iterative fitting process that attempts to cycle through repetitions to find an answer. • Sometimes, the method will break down and not be able to converge or find an answer. • Sometimes the method will produce wildly improbable results, reporting that a one-unit change in an independent variable increases the odds of the modeled event by hundreds of thousands or millions. These implausible results can be produced by multicollinearity, categories of predictors having no cases or zero cells, and complete separation whereby the two groups are perfectly separated by the scores on one or more independent variables. • The clue that we have numerical problems and should not interpret the results are standard errors for some independent variables that are larger than 2.0 (this does not apply to the constant).
Sample size requirements • The minimum number of cases per independent variable is 10, using a guideline provided by Hosmer and Lemeshow, authors of Applied Logistic Regression, one of the main resources for Logistic Regression. • If we do not meet the sample size requirement, it is suggested that this be mentioned as a limitation to our analysis. If the relationships between predictors and the dependent variable are strong, we may still attain statistical significance with smaller samples.
Methods for including variables • SPSS supports the three methods for including variables in the regression equation: • the standard or simultaneous method in which all independents are included at the same time • The hierarchical method in which control variables are entered in the analysis before the predictors whose effects we are primarily concerned with. • The stepwise method (forward conditional or forward LR in SPSS) in which variables are selected in the order in which they maximize the statistically significant contribution to the model. • For all methods, the contribution to the model is measures by model chi-square is a statistical measure of the fit between the dependent and independent variables, like R².
Computational method • Multiple regression uses the least-squares method to find the coefficients for the independent variables in the regression equation, i.e. it computed coefficients that minimized the residuals for all cases. • Logistic regression uses maximum-likelihood estimation to compute the coefficients for the logistic regression equation. This method finds attempts to find coefficients that match the breakdown of cases on the dependent variable. • The overall measure of how will the model fits is given by the likelihood value, which is similar to the residual or error sum of squares value for multiple regression. A model that fits the data well will have a small likelihood value. A perfect model would have a likelihood value of zero. • Maximum-likelihood estimation is an iterative procedure that successively tries works to get closer and closer to the correct answer. When SPSS reports the "iterations," it is telling us how may cycles it took to get the answer.
Overall test of relationship • Errors in a logistic regression models are measured in terms of “-2 log likelihood” values which are analogous to “total sum of squares”. When an independent variable has a relationship to the dependent variable the measure of error decreases. Since “-2 log likelihood (abbreviated at -2LL) is measured in negative numbers, an improvement is relationship is indicated by a larger number, e.g. if -2LL were -200, a -2LL of -100 would represent an improvement. • The overall test of relationship among the independent variables and groups defined by the dependent is based on the reduction in the -2 log likelihood values for a model which does not contain any independent variables and the model that contains the independent variables. • This difference in likelihood follows a chi-square distribution, and is referred to as the model chi-square. • The significance test for the model chi-square is our statistical evidence of the presence of a relationship between the dependent variable and the combination of the independent variables. • In a hierarchical logistic regression, the significance test for the addition of the predictor variables is based on the block chi-square in the omnibus tests of model coefficients.
Overall test of relationship in SPSS output Though the iteration history is not usually an output of interest, it does show us how the model chi-square value is derived. The original -2 Log Likelihood value is 213.891. At the end of this step, the -2 Log Likelihood value is 192.726. 213.891 – 192.726 = 21.165, the value for Model Chi-square in the table of Omnibus Tests of Model Coefficients.
Relationship of Individual Independent Variables and Dependent Variable • There is a test of significance for the relationship between an individual independent variable and the dependent variable, a significance test of the Wald statistic . • The individual coefficients represent change in the odds of being a member of the modeled category. Individual coefficients are expressed in log units and are not directly interpretable. However, if the b coefficient is used as the power to which the base of the natural logarithm (2.71828) is raised, the result represents the change in the odds of the modeled event associated with a one-unit change in the independent variable. • If a coefficient is positive, its transformed log value will be greater than one, meaning that the modeled event is more likely to occur. If a coefficient is negative, its transformed log value will be less than one, and the odds of the event occurring decrease. A coefficient of zero (0) has a transformed log value of 1.0, meaning that this coefficient does not change the odds of the event one way or the other. • The interpretive statement for individual relationships, provided they are statistically significant, incorporates the odds ratio or Exp(B) in SPSS output.
Interpreting individual relationships - 1 Exp(B) can be interpreted as a percentage change by subtracting 1.0 from the Exp(B) value. In this example, Exp(B) – 1.0 = .204 – 1.0 = -.796 We can state this finding as females (sex(1) value in this example) were 79.6% less likely to … Note: in this example, sex was coded so that males was the reference category.
Interpreting individual relationships - 2 Exp(B) can be interpreted as a multiplier when percentage change is confusing. In this example, Exp(B) – 1.0 = 4.902 – 1.0 = 3.902, or 390.2% more likely. We can state this finding as males (sex(1) value in this example) were 4.9 or approximately 5 times more likely to … Note: in this example, sex was coded so that females was the reference category.
Strength of logistic regression relationship • While logistic regression does compute correlation measures to estimate the strength of the relationship (pseudo R square measures, such as Nagelkerke's R²), these correlations measures do not really tell us much about the accuracy or errors associated with the model. • A more useful measure to assess the utility of a logistic regression model is classification accuracy, which compares predicted group membership based on the logistic model to the actual, known group membership, which is the value for the dependent variable.
Evaluating usefulness for logistic models • The benchmark that we will use to characterize a logistic regression model as useful is a 25% improvement over the rate of accuracy achievable by chance alone. • Even if the independent variables had no relationship to the groups defined by the dependent variable, we would still expect to be correct in our predictions of group membership some percentage of the time. This is referred to as by chance accuracy. • The estimate of by chance accuracy that we will use is the proportional by chance accuracy rate, computed by summing the squared percentage of cases in each group.
Comparing accuracy rates • To characterize our model as useful, we compare the overall percentage accuracy rate produced by SPSS at the last step in which variables are entered to 25% more than the proportional by chance accuracy. (Note: SPSS does not compute a cross-validated accuracy rate for logistic regression.) SPSS reports the overall accuracy rate in the Classification Table. The overall accuracy rate computed by SPSS was 67.6% in this example.
Computing by chance accuracy The number of cases in each group is found in the Classification Table at Step 0 (before any independent variables are included). The proportion of cases in the largest group is equal to the overall percentage (60.3%). The proportional by chance accuracy rate was computed by calculating the proportion of cases for each group based on the number of cases in each group in the classification table at Step 0, and then squaring and summing the proportion of cases in each group (0.397² + 0.603² = 0.521). The proportional by chance accuracy criteria is 65.2% (1.25 x 52.1% = 65.2%). Since the accuracy rate in this example, 67.6%, is greater than the 65.2% by chance accuracy criteria, this would would be characterized as useful.
Outliers • Logistic regression models the relationship between a set of independent variables and the probablity that a case is a member of one of the categories of the dependent variable (In SPSS, the modeled category is the one with the higher numeric code.) If the probability is greater than 0.5, the case is classified in the modeled category. If the probability is less than 0.50, the case is classified in the other category. • The actual probability of the modeled event for any case is either 1.0 or 0.0, i.e. a case is in the modeled category or it is not. • The residual is the difference between the actual probability and the predicted probability for a case. If the predicted probability for a case that actually belonged to the modeled category was 0.80, the residual would be 1.00 – 0.80 = 0.20. • The residual can be standardized by dividing it by an estimate of its standard deviation. Since the dependent variable is dichotomous or binary, the standard deviation for proportions is used.
Strategy for Outliers • Our strategy for evaluating the impact of outliers on our logistic regression model will parallel what we have done for multiple regression and discriminant analysis: • First, we run a baseline model including all cases • Second, we run a model excluding outliers whose studentized residual is greater than 2.58 or less than -2.58 (z-score for p = .01). • If the model excluding outliers has a classification accuracy rate that is 2% or more higher than the accuracy rate of the baseline model, we will interpret the revised model. If the accuracy rate of the revised model without outliers is less than 2% more accurate, we will interpret the baseline model.
The Problem in Blackboard The Problem in Blackboard • The problem statement tells us: • the variables included in the analysis • whether each variable should be treated as metric or non-metric • the type of dummy coding and reference category for non-metric variables • the alpha for both the statistical relationships and for diagnostic tests
The Statement about Level of Measurement The first statement in the problem asks about level of measurement. Standard binary logistic regression requires that the dependent variable be dichotomous, the metric independent variables be interval level, and the non-metric independent variables be dummy-coded if they are not dichotomous. SPSS Binary Logistic Regressioncalls non-metric variables “categorical.” SPSS Binary Logistic Regression will dummy-code categorical variables for us, provided it is useful to use either the first or last category as the reference category.
Marking the Statement about Level of Measurement • Mark the check box as a correct statement because: • The dependent variable "computer use" [compuse] is dichotomous level, satisfying the requirement for the dependent variable. • The independent variables "highest year of school completed" [educ] and "socioeconomic index" [sei] are interval level, satisfying the requirement for independent variables. • The independent variable "sex" [sex] is dichotomous level, satisfying the requirement for independent variables. • The independent variable "condition of health" [health] is ordinal level which the problem instructs us to dummy-code as a non-metric variable.
The Statement about Outliers While we do not need to be concerned about normality, linearity, and homogeneity of variance, we need to determine whether or not outliers were substantially reducing the classification accuracy of the model. To test for outliers, we run the binary logistic regression in SPSS and check for outliers. Next, we exclude the outliers and run the logistic regression a second time. We then compare the accuracy rates of the models with and without the outliers. If the accuracy of the model without outliers is 2% or more accurate than the model with outliers, we interpret the model excluding outliers.
Running the standard binary logistic regression Select the Regression | Binary Logistic… command from the Analyze menu.
Selecting the dependent variable First, highlight the dependent variable compuse in the list of variables. Second, click on the right arrow button to move the dependent variable to the Dependent text box.
Selecting the independent variables Move the independent variables stated in the problem to the Covariates list box.
Declare the categorical variables - 1 To tell SPSS that two of the variables are non-metric and need to be dummy-coded, click on the Categorical button.
Declare the categorical variables - 2 Move the variables sex and health to the Categorical Covariates list box. SPSS assigns its default method for dummy-coding, Indicator coding, to each variable, placing the name of the coding scheme in parentheses after each variable name.
Declare the categorical variables - 3 We could change the dummy-coding to a different scheme by choosing another method from the drop-down menu, and clicking on the Change button. However, we will use indicator dummy-coding for our logistic regression problems, so that we are comparing the difference in odds between two specific groups, rather than comparing one group to the average odds for all other groups. I think “average odds” complicates the interpretation.
Declare the categorical variables - 4 We will also accept the default of using the last valid category as the reference category for each variable (we do not use higher numbered missing values as a reference category). Click on the Continue button to close the dialog box. Note that sex is a dichotomous variable, and does not require dummy-coding. I prefer to dummy-code it anyhow so that my interpretation is consistently based on the difference between categories coded 0 and 1. I do not need to alter my interpretation if two different numbers were used for the original coding.
Specifying the method for including variables SPSS provides us with two methods for including variables: to enter all of the independent variables at one time, and a stepwise method for selecting variables using a statistical test to determine the order in which variables are included. SPSS also supports the specification of "Blocks" of variables for testing hierarchical models. Since the problem calls for a standard binary logistic regression, we accept the default Enter method for including variables.
Adding outliers to the data set - 1 SPSS will calculate the values for standardized residuals and save them to the data set so that we can check for outliers and remove the outliers easily if we need to run a model excluding outliers. Click on the Save… button to request the statistics that we want to save.
Adding outliers to the data set - 2 Second, click on the Continue button to complete the specifications. First, mark the checkbox for Standardized residuals in the Residuals panel.
Requesting the output Click on the OK button to request the output. While optional statistical output is available, we do not need to request any optional statistics.
Detecting the presence of outliers - 1 SPSS created a new variable, ZRE_1, which contains the standardized residual. If SPSS finds that the data set already contains a ZRE_1 variable, it will create ZRE_2. I find it easier to delete the ZRE_1 variable after each analysis rather than have multiple ZRE_ variables in the data set, requiring that I remember which one goes with which analysis.
Detecting the presence of outliers - 2 • To detect outliers, we will sort the ZRE_1 column twice: • first, in ascending order to identify outliers with a standardized residual of +2.58 or greater. • second, in descending order to identify outliers with a standardized residual of -2.58 or less. Click the right mouse button on the column header and select Sort Ascending from the pop-up menu.
Detecting the presence of outliers - 3 After scrolling down past the cases with missing data (. in the ZRE_1 column), we see that we have five outliers that have standardized residuals of -2.58 or less.
Detecting the presence of outliers - 4 To check for outliers with large positive standardized residuals, click the right mouse button on the column header and select Sort Ascending from the pop-up menu.
Detecting the presence of outliers - 5 After scrolling up to the top of the data set, we see that there are no outliers that have standardized residuals of +2.58 or more. Since we found outliers, we will run the model excluding them and compare accuracy rates to determine which one we will interpret. Had there been no outliers, we would move on to the issue of sample size.
Running the model excluding outliers - 1 We will use a Select Cases command to exclude the outliers from the analysis.
Running the model excluding outliers - 2 First, in the Select Cases dialog box, mark the option button If condition is satisfied. Second, click on the If button to specify the condition.
Running the model excluding outliers - 3 To eliminate the outliers, we request the cases that are not outliers be selected into the analysis. The formula specifies that we should include cases if the standard score for the standardized residual (ZRE_1) is less than or 2.58. The abs() or absolute value function tells SPSS to ignore the sign of the value. After typing in the formula, click on the Continue button to close the dialog box.
Running the model excluding outliers - 4 SPSS displays the condition we entered on the Select Cases dialog box. Click on the OK button to close the dialog box.
Running the model excluding outliers - 5 SPSS indicates which cases are excluded by drawing a slash across the case number. Scrolling down in the data, we see that the outliers and cases with missing values are excluded.
Running the model excluding outliers - 6 To run the logistic regression excluding outliers, select Logistic Regression from the Dialog Recall menu.
Running the model excluding outliers - 7 The only change we will make is to clear the check box for saving standardized residuals. Click on the Save button to open the dialog box.
Running the model excluding outliers - 8 Second, click on the Continue button to close the dialog box. First, clear the check box for Standardized residuals.
Running the model excluding outliers - 9 Finally, click on the OK button to request the output.