290 likes | 443 Views
Multiple Regression. Introduction. In this chapter, we extend the simple linear regression model. Any number of independent variables is now allowed. We wish to build a model that fits the data better than the simple linear regression model. Computer printout is used to help us:
E N D
Introduction • In this chapter, we extend the simple linear regression model. Any number of independent variables is now allowed. • We wish to build a model that fits the data better than the simple linear regression model.
Computer printout is used to help us: • Assess/Validate the model • How well does it fit the data? • Is it useful? • Are any of the required conditions violated? • Apply the model • Interpreting the coefficients • Estimating the expected value of the dependent variable
Dependent variable Independent variables Model and Required Conditions • We allow for k independent variables to potentially be related to the dependent variable Y = b0 + b1X1+ b2X2 + …+ bkXk + e Coefficients Random error variable
Multiple Regression for k = 2, Graphical Demonstration Y The simple linear regression model allows for one independent variable, “X” Y = b0 + b1X + e Y = b0 + b1X Y = b0 + b1X Note how the straight line becomes a plane Y = b0 + b1X1 + b2X2 Y = b0 + b1X1 + b2X2 Y = b0 + b1X1 + b2X2 X 1 The multiple linear regression model allows for more than one independent variable. Y = b0 + b1X1 + b2X2 + e X2
Required Conditions for the Error Variable • The error e is normally distributed. • The mean is equal to zero and the standard deviation is constant (se)for all possible values of the Xis. • All errors are independent.
Estimating the Coefficients and Assessing the Model • The procedure used to perform regression analysis: • Obtain the model coefficients and statistics using Excel. • Diagnose violations of required conditions. Try to remedy problems when identified. • Assess the model fit using statistics obtained from the sample. • If the model assessment indicates good fit to the data, use it to interpret the coefficients and generate predictions.
Example 18.1 Where to locate a new motor inn? • La Quinta Motor Inns is planning an expansion. • Management wishes to predict which sites are likely to be profitable, defined as having 50% or higher operating margin (net profit expressed as a percentage of total revenue). • Several potential predictors of profitability are: • Competition (room supply) • Market awareness (competing motel) • Demand generators (office and college) • Demographics (household income) • Physical quality/location (distance to downtown)
Physical Operating Margin Profitability Competition/ Supply Market Awareness Demand/ Customers Community Rooms Nearest Office Space College Enrollment Income Disttwn Median household income. Number of hotels/motels rooms within 3 miles from the site. Distance to the nearest motel. Distance to downtown.
Model and Data • Data were collected from 100 randomly-selected inns that belong to La Quinta, and ran for the following suggested model: Margin = b0 + b1Rooms + b2Nearest + b3Office + b4College + b5 Income + b6 Disttwn + e Xm18-01
Excel Output This is the sample regression equation (sometimes called the prediction equation) Margin = 38.14 - 0.0076 Rooms +1.65 Nearest + 0.020 Office + 0.21 College + 0.41 Income - 0.23 Disttwn
Model Assessment • The model is assessed using three measures: • The standard error of estimate • The coefficient of determination • The F-test of the analysis of variance • The standard error of estimates is used in the calculations for the other measures.
Standard Error of Estimate • The standard deviation of the error is estimated by the Standard Error of Estimate: (k+1 coefficients were estimated) • The magnitude of seis judged by comparing it to:
From the printout, se = 5.51 • The mean value of Y can be determined as: • It seems that se is not particularly small (relative to the mean of Y). • Question:Can we conclude the model does not fit the data well? Not necessarily.
Coefficient of Determination • The definition is: • From the printout, R2 = 0.5251 • 52.51% of the variation in operating margin is explained by the six independent variables. 47.49% are unexplained. • When adjusted for the impact of k relative to n (intended to flag potential problems with small sample size), we have: Adjusted R2 = 1-[SSE/(n-k-1)] / [SS(Total)/(n-1)] = = 49.44%
Testing the Validity of the Model • Consider the question: Is there at least one independent variable linearly related to the dependent variable? • To answer this question, we test the hypothesis: H0: b1 = b2 = … = bk = 0 H1: At least one bi is not equal to zero. • If at least one bi is not equal to zero, the model has some validity. • The test is similar to an Analysis of Variance ...
MSR/MSE • The hypotheses can be tested by an ANOVA procedure. The Excel output is: k = n–k–1 = n-1 = MSR=SSR/k SSR SSE MSE=SSE/(n-k-1) SSR: Sum of Squares for Regression SSE: Sum of Squares for Error
As in analysis of variance, we have: [Total Variation in Y] = SSR + SSE. Large F indicates a large SSR; that is, much of the variation in Y is explained by the regression model. Therefore, if F is large, the model is considered valid and hence the null hypothesis should be rejected. • The Rejection Region: • F>Fa,k,n-k-1
Fa,k,n-k-1 = F0.05,6,100-6-1=2.17 F = 17.14 > 2.17 Also, the p-value (Significance F) = 0.0000 Reject the null hypothesis. Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis: at least one of the bi is not equal to zero. Thus, at least one independent variable is linearly related to Y. This linear regression model is valid
Interpreting the Coefficients • b0 = 38.14. This is the intercept, the value of Y when all the variables take the value zero. Since the data range of all the independent variables do not cover the value zero, do not interpret the intercept. • b1 = – 0.0076. In this model, for each additional room within 3 mile of the La Quinta inn, the operating margin decreases on average by .0076% (assuming the other variables are held constant).
b2 = 1.65. In this model, for each additional mile that the nearest competitor is to a La Quinta inn, the operating margin increases on average by 1.65%, when the other variables are held constant. • b3 = 0.020.For each additional 1000 sq-ft of office space, the operating margin will increase on average by .02%, when the othervariables are held constant. • b4 = 0.21. For each additional thousand students, the operating margin increases on average by .21%, when the othervariables are held constant.
b5 = 0.41. For each increment of $1000 in median household income, the operating margin would increase on average by .41%, when the other variables remain constant. • b6 = -0.23. For each additional mile to the downtown center, the operating margin decreases on average by .23%, when the other variables are held constant.
Test statistic Testing Individual Coefficients • The hypothesis for each bi is: • Excel output: H0: bi= 0 H1: bi¹ 0 d.f. = n - k -1 Insufficient Evidence Ignore Insufficient Evidence
La Quinta Inns, Point Estimate Xm18-01 • Predict the average operating margin of an inn at a site with the following characteristics: • 3815 rooms within 3 miles, • Closet competitor .9 miles away, • 476,000 sq-ft of office space, • 24,500 college students, • $35,000 median household income, • 11.2 miles distance to downtown center. MARGIN = 38.14 - 0.0076 (3815)+1.65 (.9) + 0.020 (476) +0.21 (24.5) + 0.41 (35) - 0.23 (11.2) = 37.1%
Plot the residuals versus the predicted values of Y Regression Diagnostics • The conditions required for the model assessment to apply must be checked. • Is the error variable normally distributed? • Is the error variance constant? • Are the errors independent? • Can we identify outlier? • Is multicolinearity (correlation between the Xi’s) a problem? Draw a histogram of the residuals Plot the residuals versus the time periods