1 / 41

Multiple Regression

Exploring various strategies for selecting the best regression equation that explains the highest variance in Y while maintaining simplicity and interpretability.

vanceb
Download Presentation

Multiple Regression

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiple Regression Selecting the Best Equation

  2. Techniques for Selecting the "Best" Regression Equation • The best Regression equation is not necessarily the equation that explains most of the variance in Y (the highest R2). • This equation will be the one with all the variables included. • The best equation should also be simple and interpretable. (i.e. contain a small no. of variables). • Simple (interpretable) & Reliable - opposing criteria. • The best equation is a compromise between these two.

  3. We will discuss several strategies for selecting the best equation: • All Possible Regressions Uses R2, s2, Mallows Cp   Cp = RSSp/s2complete - [n-2(p+1)] • "Best Subset" Regression Uses R2,Ra2, Mallows Cp • Backward Elimination • Stepwise Regression

  4. Model General linear model with intercept b0

  5. The ANOVA table: SSRegression SSError SSTotal

  6. An Example In this example the following four chemicals are measured when curing cement: X1 = amount of tricalcium aluminate, 3 CaO - Al2O3 X2 = amount of tricalcium silicate, 3 CaO - SiO2 X3 = amount of tetracalcium alumino ferrite, 4 CaO - Al2O3 - Fe2O3 X4 = amount of dicalcium silicate, 2 CaO - SiO2 Y = heat evolved in calories per gram of cement.

  7. X1 X2 X3 X4 Y 7 26 6 60 79 1 29 15 52 74 11 56 8 20 104 11 31 8 47 88 7 52 6 33 96 11 55 9 22 109 3 71 17 6 103 1 31 22 44 73 2 54 18 22 93 21 47 4 26 116 1 40 23 34 84 11 66 9 12 113 10 68 8 12 109 The data is given below:

  8. I All Possible Regressions • Suppose we have the p independent variables X1, X2, ..., Xp. • Then there are 2p subsets of variables

  9. Variables in EquationModel no variables Y = b0 + e X1 Y = b0 + b1 X1+ e X2 Y = b0 + b2 X2+ e X3 Y = b0 + b3 X3+ e X1, X2 Y = b0 + b1 X1+ b2 X2+ e X1, X3 Y = b0 + b1 X1+ b3 X3+ e X2, X3 Y = b0 + b2 X2+ b3 X3+ e and X1, X2, X3 Y = b0 + b1 X1+ b2 X2+ b2 X3+ e

  10. Use of R2 1. Assume we carry out 2p runs for each of the subsets. Divide the Runs into the following sets Set 0: No variables Set 1: One independent variable. ... Set p: p independent variables. 2. Order the runs in each set according to R2. 3. Examine the leaders in each run looking for consistent patterns - take into account correlation between independent variables.

  11. Example (k=4) X1, X2, X3, X4 Variables in for leading runs 100 R2% Set 1: X4. 67.5 % Set 2: X1, X2. 97.9 % X1, X4 97.2 % Set 3: X1, X2, X4. 98.234 % Set 4: X1, X2, X3, X4. 98.237 % Examination of the correlation coefficients reveals a high correlation between X1, X3 (r13= -0.824) and between X2, X4 (r24= -0.973). Best Equation Y = b0 + b1 X1+ b4 X4+ e

  12. Use of R2 Number of variables required, p, coincides with where R2 begins to level out

  13. Use of the Residual Mean Square (RMS) (s2) • When all of the variables having a non-zero effect have been included in the model then the residual mean square is an estimate of s2. • If "significant" variables have been left out then RMS will be biased upward.

  14. No. of Variables p RMS s2(p) Average s2(p) 1 115.06, 82.39,1176.31, 80.35 113.53 2 5.79*,122.71,7.48**,86.59.17.57 47.00 3 5.35, 5.33, 5.65, 8.20 6.13 4 5.98 5.98 *- run X1, X2 **- run X1, X4 s2- approximately 6.

  15. Use of s2 Number of variables required, p, coincides with where s2 levels out

  16. Use of Mallows Cp • If the equation with p variables is adequate then both s2complete and RSSp/(n-p-1) will be estimating s2. • If "significant" variables have been left out then RMS will be biased upward.

  17. Then • Thus if we plot, for each run, Cp vs p and look for Cp close to p + 1 then we will be able to identify models giving a reasonable fit.

  18. Run Cp p + 1 no variables 443.2 1 1,2,3,4 202.5, 142.5, 315.2, 138.7 2 12,13,14 2.7, 198.1, 5.5 3 23,24,34 62.4, 138.2, 22.4 123,124,134,234 3.0, 3.0, 3.5, 7.5 4 1234 5.0 5

  19. Use of Cp Cp p Number of variables required, p, coincides with where Cp becomes close to p + 1

  20. Methods to Select Best EquationSummaryI All Possible Regressions • Suppose we have the p independent variables X1, X2, ..., Xp. • Then there are 2p subsets of variables • In all possible subsets regression, we regress each subset of X1, X2, ..., Xp on Y.

  21. 1. We carry out 2p runs for each of the subsets. Divide the Runs into the following sets Set 0: No variables Set 1: One independent variable. ... Set p: p independent variables. 2. Order the runs in each set according to R2. s2 or Mallows Cp 3. Examine the leaders in each run looking for consistent patterns - take into account correlation between independent variables. 4. Decide on the best equation

  22. Use of R2 Number of variables required, p, coincides with where R2 begins to level out

  23. Use of s2 Number of variables required, p, coincides with where s2 levels out

  24. Use of Cp Cp p Number of variables required, p, coincides with where Cp becomes close to p + 1

  25. II "Best Subset" Regression • Similar to all possible regressions. • If p, the number of variables, is large then the number of runs , 2p, performed could be extremely large. • In this algorithm the user supplies the value K and the algorithm identifies the best K subsets of X1, X2, ..., Xp containing m variables for predicting Y.

  26. III Backward Elimination • In this procedure the complete regression equation is determined containing all the variables - X1, X2, ..., Xp. • Then variables are checked one at a time and the least significant is dropped from the model at each stage. • The procedure is terminated when all of the variables remaining in the equation provide a significant contribution to the prediction of the dependent variable Y.

  27. The precise algorithm proceeds as follows: • Fit a regression equation containing all variables in the equation.

  28. The Partial F statistic: 2. A partial F-test is computed for each of the independent variables still in the equation. where RSS1 = the residual sum of squares with all variables that are presently in the equation, RSS2 = the residual sum of squares with on of the variables removed, and MSE1 = the Mean Square for Error with all variables that are presently in the equation.

  29. If FLowestFathen remove that variable and return to step 2. 3. The lowest partial F value is compared withFafor some pre-specifieda . If FLowest >Fathen accept the equation as it stands. If a is small then Fawill be large making easier to remove variables. Increasing the value of a will decrease the value of Fa, making it harder to remove variables resulting in an equation with more variables in the equation.

  30. Thus • Using a value of a too small may result in important variables being missed. • It may be useful to use several valves of a to determine which are needed and which variables are close to being significant.

  31. Example (k=4) (same example as before) X1, X2, X3, X4 1. X1, X2, X3, X4 in the equation. The lowest partial F = 0.018 (X3) is compared withFa(1,8)= 3.46 for a = 0.01. Remove X3.

  32. 2. X1, X2, X4 in the equation. The lowest partial F = 1.86 (X4) is compared withFa(1,9) = 3.36for a= 0.01. Remove X4.

  33. 3. X1, X2 in the equation. Partial F for both variables X1 and X2 exceed Fa(1,10) = 3.36 for a= 0.01. Equation is accepted as it stands. Y = 52.58 + 1.47 X1 + 0.66 X2 Note : F to Remove = partial F.

  34. IV Stepwise Regression • Variables are then checked one at a time using the partial correlation coefficient (or an equivalent statistic – F to enter) as a measure of importance in predicting the dependent variable Y. • In this procedure the regression equation is determined containing no variables in the model. • At each stage the variable with the highest significant partial correlation coefficient is added to the model. • Once this has been done the partial F statistic (F to remove) is computed for all variables now in the model is computed to check if any of the variables previously added can now be deleted.

  35. This procedure is continued until no further variables can be added or deleted from the model. • The partial correlation coefficient for a given variable is the correlation between the given variable and the response when the present independent variables in the equation are held fixed. • It is also the correlation between the given variable and the residuals computed from fitting an equation with the present independent variables in the equation.

  36. Equivalent Statistics • F to enter • F to remove

  37. 1. With no variables in the equation. Example (k=4) (same example as before) X1, X2, X3, X4 The correlation of each independent variable with the dependent variable Y is computed. The highest significant correlation ( r = -0.821) is with variable X4. Thus the decision is made to include X4. Regress Y with X4 -significant thus we keep X4.

  38. Compute partial correlation coefficients of Y with all other independent variables given X4 in the equation. The highest partial correlation is with the variable X1. ( [rY1.4]2 = 0.915). Thus the decision is made to include X1.

  39. Regress Y with X1, X4. R2 = 0.972 , F = 176.63 . Check to see if variables in the equation can be eliminated For X1 the partial F value =108.22 (F0.10(1,8) = 3.46) Retain X1. For X4 the partial F value =154.295 (F0.10(1,8) = 3.46) Retain X4.

  40. Compute partial correlation coefficients of Y with all other independent variables given X4 and X1 in the equation. The highest partial correlation is with the variable X2. ( [rY2.14]2 = 0.358). Thus the decision is made to include X2. Regress Y with X1, X2,X4. R2 = 0.982 . Check to see if variables in the equation can be eliminated Lowest partial F value =1.863 for X4 (F0.10(1,9) = 3.36) Remove X4 leaving X1 and X2 .

  41. Examples Using Statistical Packages

More Related