DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos

DSCI 5340: Predictive Modeling and Business ForecastingSpring 2013 – Dr. Nick Evangelopoulos Lecture 2: Review of Multiple Regression (Ch. 4-5) Material based on: Bowerman-O’Connell-Koehler, Brooks/Cole

Review of textbook HW Page 127-128 Ex 3.12 (Use Excel) Page 128 Ex 3.13, 3.17 Page 132 Ex 3.25 Page 134 Ex 3.35

Excel Data Analysis Add-in In Excel, Make Sure Analysis ToolPak is an add-in.

Ex 3.12 Page 128 Scatter Plot An accountant wishes to predict direct labor cost (y) based on the batch size (x) of a product produced in a job shop. Data for 12 production runs are given.

Ex 3.13 Page 128 Interpretation of Mean of Y Given X a. m y|x=60 = b0 + b1(60) : The average value of y for repeated values of X=60. This is the point on the regression line predicted for Y at X=60. b. m y|x=30 = b0 + b1(30) : The average value of y for repeated values of X=30. This is the point on the regression line predicted for Y at X=30. The distribution of values around X=30 should be similar to that for X=60. c. Interpretation of slope: As the Batch Size increases by one unit, the direct labor cost increases by b1= 10.1463. Fitted model: Ŷ = 18.49 + 10.15X

Ex 3.13 Page 128 Interpretation of Model Intercept b0: 18.49 is the Labor Cost if the batch size is 0. Theoretically, this costs would be 0, but it can be interpreted as fixed costs. Interpretation of Error Term: There may be other factors that determine direct labor costs, such as benefits to employees, type of product, number of employees, etc. Thus, the model may be more accurate with additional independent variables that are being compensated by having an error term in the model.

Ex 3.17 Page 128 Accu-Copiers, Inc., sells and services the Accu-500 copying machine. As part of its standard service contract, the company agrees to perform routine service on the copier. To obtain information about the time it takes to perform routine service, Accu-Copiers has collected data for 11 service calls, shown in Table 3.7 (p. 126)

EX 3.17 Page 128

EX 3-25 Page 132: Test for correlation The test for correlation between X and Y: H0: ρ = 0 vs. Ha: ρ ≠ 0 Has the same test statistic and p-value as the test for significance of the regression slope coefficient. However, the two tests use different assumptions.

EX 3-35 Page 134 A State Department of Taxation asked taxpayers to report the time y (in hours) required to complete a tax form and the number of times x (including this one) the taxpayer has filled out this form

EX 3-35 Page 134 To understand this model, not that as x increases, 1/x decreases and thus μy|x decreases.

Multiple Regression Graphically

Residuals The residuals will be denoted êi: êi = yi - íi They represent the distance that each dependent variable value is from the estimated regression line or the portion of the variation in y that cannot be “explained” with the data available. What assumptions can we test using these residuals?

Regression model assumptions What are the Assumptions of Regression Analysis? How can these assumptions be checked? The relationship is linear. The disturbances ei have constant variance s2e . The disturbances are independent. The disturbances are normally distributed.

Graphical Techniques scatterplots residual plots histograms (not an exact science)

Properties of residual plots Property 1: The average of the residuals will be equal to zero. This property holds regardless of whether the assumptions are true or not and is a direct result of the way the least-squares method works. Property 2: There should be no systematic pattern in a residual plot. (What is a systematic pattern?) Property 3: Residuals should look like random numbers chosen from a normal distribution. (How close to normality should the chart look?)

Residual plots In a residual analysis it is suggested that the following plots be used: 1. Plot the residuals versus each explanatory variable. 2. Plot the residuals versus the predicted or fitted values. 3. If the data are measured over time, plot the residuals versus some variable representing the time sequence. What assumptions can each of these support or indicate a violation?

Residual plots Plots may be constructed using the actual residuals, êi, or the standardized residuals. The standardized residuals are simply the residuals divided by their standard deviation. Why do you think standardized residuals are sometimes used instead of regular residuals?

No Violations of the Assumptions of Regression Plot shows random residuals

Does this Plot Look Like One of the Assumptions of Regression Analysis is Violated?

PLOT OF RESIDUALS - Standardized values are small.

Outliers The method of least squares estimation chooses the regression coefficient estimates so the error sum of squares, SSE, is a minimum. In doing this, the distances from the true y values, yi, to the points on the regression line of or surface, íi, are minimized. Least squares thus tries to avoid any large distances from yi to íi.

Outliers OUTLIER: When a sample data point has a y value that is much different from the y values of the other points in the sample. An outlier is any value whose studentized residual is greater than 2. An outlier does not have to be influential. That is, removing the outlier may not change the regression coefficients very much.

No influential observations

A High Leverage Observation That is Not Influential

Leverages The slope of the line appears to be determined almost entirely by this one point. The sixth observation is said to have high leverage and is referred to as a leverage point. What do you think the term “leverage point” means?

Studentized residuals Another measure sometimes used in place of the standardized residual is the standardized residual computed after deleting the ith observation. This measure is called the studentized residual or studentized deleted residual. (Note that SAS refers to the standardized residual as the studentized residual.)

Checking Model Assumptions Checking Assumption 1 - Normal distributionConstruct a histogram Checking Assumption 2 - Constant variancePlot residuals versus predicted Y values Checking Assumption 3 - Errors are independentDurbin-Watson statistic Plot of errors and time

Cook’s distance measure hi 1 - hi 1 k + 1 Di = (standardized residual)2 Detecting Sample Outliers • Sample leverages • Standardized residuals • Cook’s distance measure

Example of An Influential Observation

Should an unusual observation be deleted? If an observation is exerting undue influence on the fit of the model, then from an exploratory and data-mining standpoint, removing the observation may reveal a substantial changes in the model. An observation may be miscoded or not be appropriate for the collected data. No more than 10% of the data should be deleted to improve the model.

Dummy Variables

Test of Null Hypothesis (F-test) Tests the null hypothesis: H0: 2=3p = 0 Ha: at least one beta is not zero Null hypothesis is known as a joint or simultaneoushypothesis, because it compares the values of all i simultaneously. This tests overall significance of regression model. There is an F test for the overall model.

Model building: Backward Selection • A “deconstruction” approach • Begin with the saturated (full) regression model • Compute the drop in R2 as a consequence of eliminating each predictor variable, and the partial F-test value; treat as if the variable was the last to enter the regression equation • Compare the lowest partial F-test value, (designated FL), to the critical value of F (designated FC) a. If FL < FC, remove the variable and recompute the regression equation using the remaining predictor variables and return to step 2. b. FL > FC, adopt the regression equation as calculated

Model building: Stepwise Selection • Calculate correlations of all predictors with response variable • Select the predictor variable with highest correlation. Regress Y on Xi. Retain the predictor if there is a significant F-test value. • Calculate partial correlations of all variable not in equation with response variable. Select next predictor to enter that has the highest partial correlation. Call this predictor Xj. • Compute the regression equation with both Xi and Xj entered. Retain Xj if its partial F-value exceeds the tabulated F (1, n-2-1) df. • Now determine whether Xi warrants retention. Compare its partial F-value as if Xj was entered into the equation first.

Stepwise Continued • Retain if its F-value exceeds the tabulated F value • Enter a new Xk variable. Compute regression with three predictors. Compute partial F-values for Xi, Xj and Xk. • Determine whether any should be retained by comparing observed partial F with the critical F. • 6) Retain regression equation when no other predictor can be entered or removed from the model.

DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos