600 likes | 904 Views
Chapter 12. Multiple Regression and Model Building. Multiple Regression and Model Building. Part 1 Basic Multiple Regression Part 2 Using Squared and Interactive Terms Part 3 Dummy Variables and Advanced Statistical Inferences Part 4 Model Building and Model Diagnostics.
E N D
Chapter 12 Multiple Regression and Model Building
Multiple Regression and Model Building Part 1 Basic Multiple Regression Part 2 Using Squared and Interactive Terms Part 3 Dummy Variables and Advanced Statistical Inferences Part 4 Model Building and Model Diagnostics
Part 1 Basic Multiple Regression 12.1 The Multiple Regression Model 12.2 The Least Squares Estimates and Point Estimation and Prediction 12.3 Model Assumptions and the Standard Error 12.4 R2 and Adjusted R2 12.5 The Overall F Test 12.6 Testing the Significance of an Independent Variable 12.7 Confidence and Prediction Intervals
Part 2 Using Squared and InteractionTerms 12.8 The Quadratic Regression Model 12.9 Interaction
Part 3 Dummy Variables andAdvanced Statistical Inferences 12.10 Using Dummy Variables to Model Qualitative Independent Variables 12.11 The Partial F Test: Testing the Significance of a Portion of a Regression Model
Part 4 Model Building and ModelDiagnostics 12.12 Model Building, and the Effects of Multicollinearity 12.13 Residual Analysis in Multiple Regression 12.14 Diagnostics for Detecting Outlying and Influential Observations 12.15 Logistic Regression
12.1 The Multiple Regression Model Part 1 • Simple linear regression used one independent variable to explain the dependent variable • Some relationships are too complex to be described using a single independent variable • Multiple regression models use two or more independent variables to describe the dependent variable • This allows multiple regression models to handle more complex situations • There is no limit to the number of independent variables a model can use • Multiple regression has only one dependent variable Table of ContentsNext SectionNext Part
The Multiple Regression Model The linear regression model relating y to x1, x2,…, xk isy = my|x1,x2,…,xk + e = b0 + b1x1 + b2x2 + … + bkxk +e Where • my|x1,x2,…,xk + e = b0 + b1x1 + b2x2 + … + bkxkis the mean value of the dependent variable y when the values of the independent variables are x1, x2,…, xk • b0, b1, b2,…bk are the regression parameters relating the mean value of y to x1, x2,…, xk • eis an errorterm that describes the effects on y of all factors other than the independent variablesx1, x2,…, xk
12.2 The Least Squares Estimatesand Point Estimation and Prediction • Estimation/prediction equation • is the point estimate of the mean value of the dependent variable when the values of the independent variables are x1, x2,…, xk • It is also the point prediction of an individual value of the dependent variable when the values of the independent variables are x1, x2,…, xk • b0, b1, b2,…, bk are the least squares point estimates of the parameters b0, 1, 2,…, k • x01, x02,…, x0k are specified values of the independent predictor variables x1, x2,…, xk Table of ContentsNext SectionNext Part
Calculating the Model • A formula exists for computing the least squares model for multiple regression • This formula is written using matrix algebra and is presented in Appendix G of the CD-ROM • In practice, the model can be easily computed using Excel, MINITAB, MegaStat or many other computer packages
Example 12.3: Fuel ConsumptionCase Minitab Output FuelCons = 13.1 - 0.0900 Temp + 0.0825 Chill Predictor Coef StDev T P Constant 13.1087 0.8557 15.32 0.000 Temp -0.09001 0.01408 -6.39 0.001 Chill 0.08249 0.02200 3.75 0.013 S = 0.3671 R-Sq = 97.4% R-Sq(adj) = 96.3% Analysis of Variance Source DF SS MS F P Regression 2 24.875 12.438 92.30 0.000 Residual Error 5 0.674 0.135 Total 7 25.549 Predicted Values (Temp = 40, Chill = 10) Fit StDev Fit 95.0% CI 95.0% PI 10.333 0.170 ( 9.895, 10.771) ( 9.293, 11.374)
12.3 Model Assumptions and the Standard Error • The model isy = my|x1,x2,…,xk + e = b0 + b1x1 + b2x2 + … + bkxk +e • Assumptions for multiple regression are stated about the model error terms, ’s Table of ContentsNext SectionNext Part
The Regression Model AssumptionsContinued • Mean of Zero AssumptionThe mean of the error terms is equal to 0 • Constant Variance AssumptionThe variance of the error terms s2 is, the same for every combination values of x1, x2,…, xk • Normality AssumptionThe error terms follow a normal distribution for every combination values of x1, x2,…, xk • Independence AssumptionThe values of the error terms are statistically independent of each other
Mean Square Error • This is the point estimate of the residual variance s2 • SSE is from last slide • This formula is slightly different from simple regression
Standard Error • This is the point estimate of the residual standard deviation s • MSE is from last slide • This formula too is slightly different from simple regression
12.4 R2 and Adjusted R2 Table of ContentsNext SectionNext Part
Multiple Correlation Coefficient R • The multiple correlation coefficient R is just the square root of R2 • With simple linear regression, r would take on the sign of b1 • There are multiple bi’s with multiple regression • For this reason, R is always positive • To interpret the direction of the relationship between the x’s and y, you must look to the sign of the appropriate bi coefficient
The Adjusted R2 • Adding an independent variable to multiple regression will raise R2 • R2 will rise slightly even if the new variable has no relationship to y • The adjusted R2 corrects this tendency in R2 • As a result, it gives a better estimate of the importance of the independent variables
Calculating The Adjusted R2 The adjusted multiple coefficient of determination is
12.5 The Overall F Test To test H0: b1= b2 = …= bk = 0 versus Ha: At least one of b1, b2,…, bk≠ 0 The test statistic is Reject H0 in favor of Ha if: F(model) > Fa* or p-value < a *Fa is based on k numerator and n-(k+1) denominator degrees of freedom Table of ContentsNext SectionNext Part
12.6 Testing the Significance of an Independent Variable • A variable in a multiple regression model is not likely to be useful unless there is a significant relationship between it and y • To test significance, we use the null hypothesis:H0: bj = 0 • Versus the alternative hypothesis:Ha: bj ≠ 0 Table of ContentsNext SectionNext Part
Testing Significance of anIndependent Variable #4 • It is customary to test the significance of every independent variable in a regression model • If we can reject H0: bj = 0 at the 0.05 level of significance, we have strong evidence that the independent variable xj is significantly related to y • If we can reject H0: bj = 0 at the 0.01 level of significance, we have very strong evidence that the independent variable xj is significantly related to y • The smaller the significance level a at which H0 can be rejected, the stronger is the evidence that xj is significantly related to y
12.7 Confidence and PredictionIntervals • The point on the regression line corresponding to a particular value of x01, x02,…, x0k, of the independent variables is • It is unlikely that this value will equal the mean value of y for these x values • Therefore, we need to place bounds on how far the predicted value might be from the actual value • We can do this by calculating a confidence interval for the mean value of y and a prediction interval for an individual value of y Table of ContentsNext SectionNext Part
Distance Value • Both the confidence interval for the mean value of y and the prediction interval for an individual value of y employ a quantity called the distance value • With simple regression, we were able to calculate the distance value fairly easily • However, for multiple regression, calculating the distance value requires matrix algebra • See Appendix G on CD-ROM for more detail
A Confidence Interval for a MeanValue of y • Assume that the regression assumption hold • The formula for a 100(1-a) confidence interval for the mean value of y is as follows: • This is based on n-(k+1) degrees of freedom
A Prediction Interval for an IndividualValue of y • Assume that the regression assumption hold • The formula for a 100(1-a) prediction interval for an individual value of y is as follows: • This is based on n-(k+1) degrees of freedom
12.8 The Quadratic Regression Model Part 2 • One useful form of linear regression is the quadratic regression model • Assume that we have n observations of x and y • The quadratic regression model relating y to x isy = b0 + b1x + b2x2 + e Where • b0 + b1x + b2x2 is the mean value of the dependent variable y when the value of the independent variable is x • b0, b1, and b2 are unknown regression parameters relating the mean value of y to x • e is an error term that describes the effects on y of all factors other than x and x2 Table of ContentsNext SectionNext Part
More Variables • We have only looked at the simple case where we have y and x • That gave us the following quadratic regression modely = b0 + b1x + b2x2 + e • However, we are not limited to just two terms • The following would also be a valid quadratic regression modely = b0 + b1x1 + b2x12 + b3x2 + b4x3 + e
12.9 Interaction • Multiple regression models often contain interaction variables • These are variables that are formed by multiplying two independent variables together • For example, x1·x2 • In this case, the x1·x2 variable would appear in the model along with both x1 and x2 • We use interaction variables when the relationship between the mean value of y and one of the independent variables is dependent on the value of another independent variable Table of ContentsNext SectionNext Part
12.10 Using Dummy Variables toModel Qualitative Independent Variables Part 3 • So far, we have only looked at including quantitative data in a regression model • However, we may wish to include descriptive qualitative data as well • For example, might want to include the gender of respondents • We can model the effects of different levels of a qualitative variable by using what are called dummy variables • Also known as indicator variables Table of ContentsNext SectionNext Part
How to Construct Dummy Variables • A dummy variable always has a value of either 0 or 1 • For example, to model sales at two locations, would code the first location as a zero and the second as a 1 • Operationally, it does not matter which is coded 0 and which is coded 1
What If We Have More Than TwoCategories? • Consider having three categories, say A, B, and C • Cannot code this using one dummy variable • A=0, B=1, and C=2 would be invalid • Assumes the difference between A and B is the same as B and C • Must use multiple dummy variables • Specifically, a categories requires a-1 dummy variables • For A, B, and C, would need two dummy variables • x1 is 1 for A, zero otherwise • x2 is 1 for B, zero otherwise • If x1 and x2 are zero, must be C • This is why the third dummy variable is not needed
Interaction Models • So far, have only considered dummy variables as stand-alone variables • Model so far isy = b0 + b1x + b2D + e Where D is dummy variable • However, can also look at interaction between dummy variable and other variables • That model would take the form y = b0 + b1x + b2D + b3xD+ e • With an interaction term, both the intercept and slope are shifted
12.11 The Partial F Test: Testing the Significance of a Portion of a Regression Model • So far, have looked at testing single slope coefficients using t test • Have also looked at testing all the coefficients at once using F test • The partial F test allows us to test the significance of any set of independent variables in a regression model Table of ContentsNext SectionNext Part
The Partial F Test Model Partial F Statistic: To test H0: g+1= g+2 = …= k = 0 versus Ha: At least one of the g+1, g+2,…, k is not equal to 0 Reject H0 in favor of Ha if: F > Faor p-value < a Fais based on k-g numerator and n-(k+1) denominator degrees of freedom
12.12 Model Building and the Effectsof Multicollinearity Part 4 Multicollinearity refers to the condition where the independent variables (or predictors) in a model are dependent, related, or correlated with each other Effects Hinders ability to use t statistics and p-values to assess the relative importance of predictors Does not hinder ability to predict the dependent (or response) variable Detection Scatter Plot Matrix Correlation Matrix Variance Inflation Factors (VIF) Table of ContentsNext Section
Variance Inflation Factors (VIF) The variance inflation factor for the jth independent (or predictor) variable xj is where Rj2 is the multiple coefficient of determination for the regression model relating xj to the other predictors – x1,…,xj-1,xj+1, xk Notes: VIFj = 1 implies xj not related to other predictors max(VIFj) > 10 suggest severe multicollinearity mean(VIFj) substantially greater than 1 suggests severe multicollinearity
Comparing Regression Models on R2,s, Adjusted R2, and Prediction Interval • Multicollinearity causes problems evaluating the p-values of the model • Therefore, we need to evaluate more than the additional importance of each independent variable • We also need to evaluate how the variables work together • One way to do this is to determine if the overall model gives a high R2 and adjusted R2, a small s, and short prediction intervals
Comparing Regression Models on R2,s, Adjusted R2, and Prediction Interval Continued • Adding any independent variable will increase R2 • Even adding an unimportant independent variable • Thus, R2 cannot tell us (by decreasing) that adding an independent variable is undesirable • A better criterion is the size of the standard error s • If s increases when an independent variable is added, we should not add that variable • However, decreasing s alone is not enough • Adding a variable reduces degrees of freedom and that makes the prediction interval for y wider • Therefore, an independent variable should only be included if it reduces s enough to offset the higher t value and reduces the length of the desired prediction interval for y
C Statistic • Another quantity for comparing regression models is called the C statistic • Also known as CP statistic • First, calculate mean square error for the model containing all p potential independent variables • Denoted s2p • Next, calculate SSE for a reduced model with k independent variables • Calculate C as
C Statistic Continued • We want the value of C to be small • Adding unimportant independent variables will raise the value of C • While we want C to be small, we also wish to find a model for which C roughly equals k+1 • A model with C substantially greater than k+1 has substantial bias and is undesirable • If a model has a small value of C and C for this model is less than k+1, then it is not biased and the model should be considered desirable
Stepwise Regression and BackwardElimination • Testing various combinations of variables can be tedious • In many situations, it is useful to have an iterative model selection procedure • At each step, a single independent variable is added to or deleted from the model • The model is then reevaluated • This continues until a final model is found • There are two such approaches • Stepwise regression • Backward elimination
Stepwise Regression #1 • Assume there are p potential independent variables • Further, assume that p is large • Stepwise regression uses t statistics to determine the significance of the independent variables in various models • Stepwise regression needs two alpha values • aentry, the probability of a type I error related to entering an independent variable into the model • astay, the probability of a type I error related to retaining an independent variable that was previously entered into the model
Stepwise Regression #2 • Step 1: The stepwise procedure considers the p possible one-independent variable regression models • Finds the variable with the largest absolute t statistic • Denoted as x[1] • If x[1] is not significant at the aentry level, the process terminates by concluding none of the independent variables are significant • Otherwise, x[1] is retained for use in Step 2
Stepwise Regression #3 • Step 2: The stepwise procedure considers the p-1 possible two-independent variable models of the formy = b0 + b1x[1] + b2xj + e • For each new variable, it testsH0: b2 = 0Ha: b2 0 • Pick the variable giving the largest t statistic • If resulting variable is significant, checks x[1] against astay to see if it should stay in the model • This is needed due to multicollinearity
Stepwise Regression #4 • Further steps: This adding and checking for removal continues until all non-selected independent variables are insignificant and will not enter model • Will also terminate when the variable to be added to the model is the one just removed from it
Backward Elimination • With backwards elimination, we begin with a full regression model containing a p potential independent variables • We then find the one having the smallest t statistic • If this variable is significant, we stop • If this variable is insignificant, it is dropped and the regression is rerun with p-1 potential independent variables • The process continues to remove variables one-at-a-time until all the variables are significant