Logistic Regression

Logistic Regression

Outline • Basic Concepts of Logistic Regression • Finding Logistic Regression Coefficients using Excel’s Solver • Significance Testing of the Logistic Regression Coefficients • Testing the Fit of the Logistic Regression Model • Finding Logistic Regression Coefficients using Newton’s Method • Comparing Logistic Regression Models • Hosmer-LemeshowTest

Basic Concepts of Logistic Regression • The basic approach is to use the following regression model, employing the notation from Definition 3 of Method of Least Squares for Multiple Regression: where the odds function is as given in the following definition.

Definition 1 : • Odds(E) is the odds that event E occurs, namely Where p has a value 0 ≤ p ≤ 1 (i.e. p is a probability value), we can define the odds function as • For our purposes, the odds function has the advantage of transforming the probability function, which has values from 0 to 1, into an equivalent function with values between 0 and ∞. When we take the natural log of the odds function, we get a range of values from - ∞ to ∞.

Definition 2 : • The logit function is the log of the odds function, namely , or Definition 3 : • Based on the logistic model as described above, we have wherep = P(E).

It now follows that and so • For our purposes we take the event E to be that the dependent variable y has value 1. If y takes only the values 0 or 1, we can think of E as success and the complement E′ of E as failure. This is as for the trials in a binomial distribution.

Just as for the regression model studied in Regression and Multiple Regression, a sample consists of n data elements of the form (yi, xi1, xi2 ,…, xik), but for logistic regression each yi only takes the value 0 or 1. Now let Ei = the event that yi = 1 and pi = P(Ei). Just as the regression line studied previously provides a way to predict the value of the dependent variable y from the values of the independent variables x1, …, xk in for logistic regression we have • Note that since the yi have a proportion distribution, by Property 2 of Proportion Distribution, var(yi) = pi (1 – pi).

In the case where k = 1, we have,Such a curve has sigmoid shape: The values of b0 and b1 determine the location direction and spread of the curve. The curve is symmetric about the point where . In fact, the value of p is 0.5 for this value of x. Sigmoid curve for p

Logistic regression is used instead of ordinary multiple regression because the assumptions required for ordinary regression are not met. In particular 1. The assumption of the linear regression model that the values of y are normally distributed cannot be met since y only takes the values 0 and 1. 2. The assumption of the linear regression model that the variance of y is constant across values of x (homogeneity of variances) also cannot be met with a binary variable. Since the variance is p(1–p) when 50 percent of the sample consists of 1s, the variance is .25, its maximum value. As we move to more extreme values, the variance decreases. When p = .10 or .90, the variance is (.1)(.9) = .09, and so as p approaches 1 or 0, the variance approaches 0. 3. Using the linear regression model, the predicted values will become greater than one and less than zero if you move far enough on the x-axis. Such values are theoretically inadmissible for probabilities. • For the logistics model, the least squares approach to calculating the values of the coefficients bi cannot be used; instead the maximum likelihood techniques, as described below, are employed to find these values.

Definition 4: • The odds ratio between two data elements in the sample is defined as follows: Using the notation px = P(x), the log odds ratio of the estimates is defined as

In the case where Thus, Furthermore, for any value of d

Note that when x is a dichotomous variable, • E.g. when x = 0 for male and x = 1 for female, then represents the odds ratio between males and females. If for example b1 = 2, and we are measuring the probability of getting cancer under certain conditions, then = 7.4, which would mean that the odds of females getting cancer would be 7.4 times greater than males under the same conditions.

The model we will use is based on the binomial distribution, namely the probability that the sample data occurs as it does is given by Taking the natural log of both sides and simplifying we get the following definition. Definition 5: • The log-likelihood statistic is defined as follows: where the yi are the observed values while the pi are the corresponding theoretical values.

Example 1: A sample of 760 people who received doses of radiation between 0 and 1000 remswasmadefollowing a recent nuclear accident. Of these 302 died as shown in the table in Figure 2. Actually each row in the table represents the midpoint of an interval of 100 rems (i.e. 0-100, 100-200, etc.). Figure 2.

<Solution> Let Ei = the event that a person in the ith interval survived. The table also shows the probability P(Ei) and odds Odds(Ei) of survival for a person in each interval. Note thatP(Ei) = the percentage of people in interval i who survived and In Figure 3 we plot the values of P(Ei) vs. i and Odds(Ei) vs. i. We see that the second of these plots is reasonably linear.

Given that there is only one independent variable (namely x = # of rems), we can use the following model Here we use coefficients a and b instead of b0 and b1just to keep the notation simple. • We show two different methods for finding the values of the coefficients a and b. The first uses Excel’s Solver tool and the second uses Newton’s method. Before proceeding it might be worthwhile to click on Goal Seeking and Solver to review how to use Excel’s Solver tool and Newton’s Method to review how to apply Newton’s Method. We will use both methods to maximize the value of the log-likelihood statistic as defined in Definition 5.

Finding Logistic Regression Coefficients using Excel’s Solver • We now show how to find the coefficients for the logistic regression model using Excel’s Solver capability (see also Goal Seeking and Solver). We start with Example 1 from Basic Concepts of Logistic Regression.

Example 1 (continued) : From Definition 1 of Basic Concepts of Logistic Regression, the predicted values pi for the probability of survival for each interval i is given by the following formula where xi represents the number of rems for interval i. The log-likelihood statistic as defined in Definition 5 of Basic Concepts of Logistic Regression is given by where yi is the observed probability of survival in the ith interval. Since we are aggregating the sample elements into intervals, we use the modified version of the formula, namely

yi is the observed probability of survival in the ithof r intervals where We capture this information in the worksheet in Figure 1 (based on the data in Figure 2 of Basic Concepts of Logistic Regression).

In figure 1, Column I contains the rem values for each interval (copy of column A and E). Column J contains the observed probability of survival for each interval (copy of column F). Column K contains the values of each pi. E.g. cell K4 contains the formula =1/(1+EXP(-O5–O6*I4)) and initially has value 0.5 based on the initial guess of the coefficients a and b given in cells O5 and O6 (which we arbitrarily set to zero). Cell L14 contains the value of LL using the formula =SUM(L4:L13); where L4 contains the formula =(B4+C4)*(J4*LN(K4)+(1-J4)*LN(1-K4)), and similarly for the other cells in column L.

We now use Excel’s Solver tool by selecting Data > Analysis|Solver and filling in the dialog box that appears as described in Figure 2 (see Goal Seeking and Solver for more details). Our objective is to maximize the value of LL (in cell L14) by changing the coefficients (in cells O5 and O6). It is important, however, to make sure that the Make Unconstrained Variables Non-Negative checkbox is not checked. When we click on the Solve button we get a message that Solver has successfully found a solution, i.e. it has found values fora and b which maximize LL.

We elect to keep the solution found and Solver automatically updates the worksheet from Figure 1 based on the values it found for a and b. The resulting worksheet is shown in Figure 3. We see that a = 4.476711 and b = -0.00721. Thus the logistics regression model is given by the formula

For example, the predicted probability of survival when exposed to 380 rems of radiation is given by Note that Thus, the odds that a person exposed to 180 rems survives is 15.5% greater than a person exposed to 200 rems.

Real Statistics Data Analysis Tool: The Real Statistics Resource Pack provides the Logistic Regression supplemental data analysis tool. This tool takes as input a range which lists the sample data followed the number of occurrences of success and failure. E.g. for Example 1 this is the data in range A3:C13 of Figure 1. For this problem there was only one independent variable (number of rems). If additional independent variables are used then the input will contain additional columns, one for each independent variable.

We show how to use this tool to create a spreadsheet similar to the one in Figure 3. First press Ctrl-m to bring up the menu of Real Statistics supplemental data analysis tools and choose the Logistic Regression option. This brings up the dialog box shown in Figure 4. Now select A3:C13 as the Input Range (see Figure 5) and since this data is in summary form with column headings, select the Summary data option for the Input Format and check Headings included with data. Next select the Solver as the Analysis Type and keep the default Alpha and Classification Cutoff values of .05 and .5 respectively.

Finally press the OK button to obtain the output displayed in Figure 5. This tool takes as input a range which lists the sample data followed the number of occurrences of success and failure (this is considered to be the summary form). E.g. for Example 1 this is the data in range A3:C13 of Figure 1 (repeated in Figure 5 in the same cells). For this problem there was only one independent variable (number of rems). If additional independent variables are used then the input will contain additional columns, one for each independent variable.

Note that the coefficients (range Q7:Q8) are set initially to zero and (cell M16) is calculated to be -526.792 (exactly as in Figure 1). The output from the Logistic Regression data analysis tool also contains many fields which will be explained later. As described in Figure 2, we can now use Excel’s Solver tool to find the logistic regression coefficient. The result is shown in Figure 6. We obtain the same values for the regression coefficients as we obtained previously in Figure 3, but also all the other cells are updated with the correct values as well.

Significance Testing of the Logistic Regression Coefficients • Definition 1: For any coefficient b the Wald statistic is given by the formula • For ordinary regression we can calculate a statistic t ~ T(dfRes) which can be used to test the hypothesis that a coordinate b = 0. The Wald statistic is approximately normal and so it can be used to test whether the coefficient b = 0 in logistic regression. • Since the Wald statistic is approximately normal, by Theorem 1 of Chi-Square Distribution, Wald2 is approximately chi-square, and, in fact, Wald2 ~ χ2(df) where df = k – k0 and k = the number of parameters (i.e. the number of coefficients) in the model (the full model) and k0 = the number of parameters in a reduced model (esp. the baselinemodel which doesn’t use any of the variables, only the intercept).

Property 1: The covariance matrix S for the coefficient matrix B is given by the matrix formula where X is the r × (k+1) design matrix (as described in Definition 3 of Least Squares Method for Multiple Regression) andV = [vij] is the r × r diagonal matrix whose diagonal elements are vii = ni pi(1–pi), where ni = the number of observations in group i and pi = the probability of success predicted by the model for elements in group i. Groups correspond to the rows of matrixX and consist of the various combinations of values of the independent variables. Note that S = (XTW)-1 where W is X with each element in the ith row of X multiplied by vii. Observation : The standard errors of the logistic regression coefficients consist of the square root of the entries on the diagonal of the covariance matrix in Property 1.

Example 1 (Coefficients): We now turn our attention to the coefficient table given in range E18:L20 of Figure 6 of Finding Logistic Regression Coefficients using Solver (repeated in Figure 1 below). Figure 1 – Output from Logistic Regression tool

Using Property 1 we calculate the correlation matrix S (range V6:W7) for the coefficient matrix B via the theformula Actually, for computational reasons it is better to use the following equivalent array formula:

The formulas used to calculate the values for the Rems coefficient (row 20) are given in Figure 2. Note that Wald represents the Wald2 statistic and that lower and upper represent the 100-α/2 % confidence interval of exp(b). Since 1 = exp(0) is not in the confidence interval (.991743, .993871), the Rem coefficient b is significantly different from 0 and should therefore be retained in the model.

Observation: The% Correction statistic (cell N16 of Figure 1) is another way to gauge the fit of the model to the observed data. The statistic says that 76.8% of the observed cases are predicted accurately by the model. This statistic is calculated as follows: For any observed values of the independent variables, when the predicted value ofp is greater than or equal to .5 (viewed as predicting success) then the % correct is equal to the value of the observed number of successes divided by the total number of observations (for those values of the independent variables). Whenp < .5 (viewed as predicting failure) then the % correct is equal to the value of the observed number of successes divided by the total number of observations. These values are weighted by the number of observations of that type and then summed to provide the % correct statistic for all the data. For example, for the case where Rem = 450, p-Pred = .774 (cell J10), which predicts success (i.e. survived). Thus the % Correct for Rem = 450 is 85/108 = 78.7% (cell N10). The weighted sum (found in cell N16) of all these cells is then calculated by the formula =SUMPRODUCT(N6:N15,H6:H15)/H16.

Testing the Fit of the Logistic Regression Model • For larger values of b, the standard error and the wald statistic become inflated, which increases the probability that b is viewed as not making a significant contribution to the model even when it does (i.E. A type II error).

To overcome this problem it is better to test on the basis of the log-likelihood statistic since wheredf = k – k0and where LL1refers to the full log-likelihood model and LL0refers to a model with fewer coefficients (especially the model with only the intercept b0 and no other coefficients). This is equivalent to Observation: For ordinary regression the coefficient of determination Thus R2 measures the percentage of variance explained by the regression model. We need a similar statistic for logistic regression. We define the following three pseudo-R2 statistics for logistic regression.

Definition 1 : The log-linear ratio R2 is defined as follows : where LL1 refers to the full log-likelihood model and LL0 refers to a model with fewer coefficients (especially the model with only the intercept b0and no other coefficients). Cox and Snell’s R2 is defined as where n = the sample size. Nagelkerke’sR2 is defined as

Observation I : Since cannot achieve a value of 1, Nagelkerke’sR2 was developed to have properties more similar to the R2 statistic used in ordinary regression. Observation II : The initial value L0of L, i.e. where we only include the intercept value b0, is given by where n0 = number of observations with value 0, n1 = number of observations with value 1 and n = n0 + n1.

As described above, the likelihood-ratio test statistic equals: whereL1is the maximized value of the likelihood function for the full model L1, while L0 is the maximized value of the likelihood function for the reduced model. The test statistic has chi-square distribution with df = k1 – k0, i.e. the number of parameters in the full model minus the number of parameters in the reduced model.

Example 1 : Determine whether there is a significant difference in survival rate between the different values of rem in Example 1 of Basic Concepts of Logistic Regression. Also calculate the various pseudo-R2 statistics. We are essentially comparing the logistic regression model with coefficient b to that of the model without coefficient b. We begin by calculating the L1 (the full model with b) and L0(the reduced model without b). Here L1 is found in cell M16 or T6 of Figure 6 of Finding Logistic Coefficients using Solver.

We now use the following test : where df = 1. Since p-value = CHITEST(280.246,1) = 6.7E-63 < .05 = α, we conclude that differences in rems yield a significant difference in survival. The pseudo-R2 statistics are as follows: All these values are reported by the Logistic Regression data analysis tool (see range S5:T16 of Figure 6 of Finding Logistic Coefficients using Solver).

Finding Logistic Regression Coefficients using Newton’s Method • Property 1: The maximum of the log-likelihood statistic (from Definition 5 of Basic Concepts of Logistic Regression) occurs when Observation: Thus, to find the values of the coordinates bi we need to solve the equations We can do this iteratively using Newton’s method (see Definition 2 of Newton’s Methodand Property 2 of Newton’s Method) as described in Property 2.

Property 2: Let B = [bj] be the (k+1) × 1 column vector of logistic regression coefficients, let Y = [yi] be the n × 1 column vector of observed outcomes of the dependent variable, let X be the n × (k+1) design matrix (see Definition 3 of Least Squares Method for Multiple Regression), let P = [pi] be the n × 1 column vector of predicted values of success and V = [vi] be the n × n matrix where vi = pi(1 – pi). Then if B0 is an initial guess of B and for all m we define the following iteration then for m sufficiently large B ≈ Bm, and so Bm is a reasonable estimate of the coefficient vector. Observation: If we group the data as we did in Example 1 of Basic Concepts of Logistic Regression (i.e. summary data), then Property 3 holds where holds where Y = [yi] is the r × 1 column vector of summarized observed outcomes of the dependent variable, X is the corresponding r × (k+1) design matrix, P = [pi] is the r × 1 column vector of predicted values of success and V = [vi] is the r × r matrix where vi = ni pi (1 – pi).

Example 1 (using Newton’s Method) : We now return to the problem of finding the coefficients a and b for Example 1 of Basic Concepts of Logistic Regression using the Newton’s Method. We apply Newton’s method to find the coefficients as described in Figure 1. The method converges in only 4 iterations with the values a = 4.47665 and b = -0.0072. The regression equation is therefore logit(p) = 4.47665 – 0.0072x.

Example 2: A study was made as to whether environmental temperature or immersion in water of the hatching egg had an effect on the gender of a particular type of small reptile. The table in Figure 2 shows the temperature (in degrees Celsius) and immersion in water (0 = no and 1 = yes) of the 49 eggs which resulted in a live birth as well as the sex of the reptile that hatched. Determine the odds that a female will be born if the temperature is 23 degrees with the egg immersed in water vs. not immersed in water.

We use the Logistic Regression supplemental data analysis tool, selecting the Raw data and Newton Method options as shown in Figure 3.

After pressing the OK button we obtain the output displayed in Figure 4. Here we only show the first 19 elements in the sample, although the full sample is contained in range A4:C52. Note that in the raw data option the Input Range (range A4:C52) consists of one column for each independent variable (Temp and Water for this example) and a final column only containing the values 0 or 1, where 1 indicates “success” (Male in this case) and 0 indicates “failure” (Female in this case). Please don’t read any gender discrimination into these choices: we would get the same result if we chose Female to be success and Male to be failure.

The model indicates that to predict the probability that a reptile will be male you can use the following formula: We can now obtain the desired results as shown in Figure 5 by copying any formula for p-Pred from Figure 4 and making a minor modification. Here we copied the formula from cell K6 into cells G29 and G30.

The formula that now appears in cell G29 will be =1/(1+EXP(-$R$7-MMULT(A29:B29,$R$8:$R$9))). You just need to change the part A29:B29 to E29:F29 (where the values of Temp and Water actually appear). The resulting formula 1/(1+EXP(-$R$7-MMULT(E29:F29,$R$8:$R$9))) will give the result shown in Figure 5.

Comparing Logistic Regression Models • Example 1: Repeat the study from Example 3 of Finding Logistic Regression Coefficients using Newton’s Method based on the summary data shown in Figure 1.

Using the Logistic Regression supplemental data analysis tool, selecting the Newton Method option, we obtain the output displayed in Figure 2.

Logistic Regression