570 likes | 685 Views
Ch 17 實習. Introduction. In this chapter we extend the simple linear regression model, and allow for any number of independent variables. We expect to build a model that fits the data better than the simple linear regression model. Introduction. We shall use computer printout to
E N D
Introduction • In this chapter we extend the simple linear regression model, and allow for any number of independent variables. • We expect to build a model that fits the data better than the simple linear regression model.
Introduction • We shall use computer printout to • Assess the model • How well it fits the data • Is it useful • Are any required conditions violated? • Employ the model • Interpreting the coefficients • Predictions using the prediction equation • Estimating the expected value of the dependent variable
Dependent variable Independent variables Model and Required Conditions • We allow for k independent variables to potentially be related to the dependent variable y = b0 + b1x1+ b2x2 + …+ bkxk + e Coefficients Random error variable
The Least Squares Method • Least Squares Criterion • Computation of Coefficients’ Values The formulas for the regression coefficients b0, b1, b2, . . . bp involve the use of matrix algebra. We will rely on computer software packages to perform the calculations. • A Note on Interpretationof Coefficients bi represents an estimate of the change in y corresponding to a one-unit change in xi when all other independent variables are held constant.
Multiple Regression for k = 2 y The simple linear regression model allows for one independent variable, “x” y =b0 + b1x + e Note how the straight line becomes a plain, and... y = b0 + b1x1 + b2x2 X 1 The multiple linear regression model allows for more than one independent variable. Y = b0 + b1x1 + b2x2 + e X2
Required conditions for the error variable • The error e is normally distributed. • The mean is equal to zero and the standard deviation is constant (se)for all values of y. • The errors are independent.
Estimating the Coefficients and Assessing the Model, Example • Example 17.1 Where to locate a new motor inn? • La Quinta Motor Inns is planning an expansion. • Management wishes to predict which sites are likely to be profitable. • Several areas where predictors of profitability can be identified are: • Competition • Market awareness • Demand generators • Demographics • Physical quality
Physical Estimating the Coefficients and Assessing the Model, Example Operating Margin Profitability Market awareness Competition Customers Community Rooms Nearest Office space College enrollment Income Disttwn Median household income. Number of hotels/motels rooms within 3 miles from the site. Distance to the nearest competing hotel Distance to downtown.
Estimating the Coefficients and Assessing the Model, Example • Data were collected from randomly selected 100 inns that belong to La Quinta, and ran for the following suggested model: Margin = b0 + b1Rooms + b2Nearest + b3Office + b4College + b5Income + b6Disttwn Xm17-01
Regression Analysis, Excel Output This is the sample regression equation (sometimes called the prediction equation) Margin = 38.14 - 0.0076Number +1.65Nearest + 0.020Office Space +0.21Enrollment + 0.41Income - 0.23Distance
Model Assessment • The model is assessed using three tools: • The standard error of estimate • The coefficient of determination • The F-test of the analysis of variance • The standard error of estimates participates in building the other tools.
Standard Error of Estimate • The standard deviation of the error is estimated by the Standard Error of Estimate: • The magnitude of seis judged by comparing it to
Standard Error of Estimate • From the printout, se = 5.51 • Calculating the mean value of y we have • It seems se is not particularly small. • Question:Can we conclude the model does not fit the data well?
Coefficient of Determination • The definition is • From the printout, R2 = 0.5251 • 52.51% of the variation in operating margin is explained by the six independent variables. 47.49% remains unexplained. • When adjusted for degrees of freedom, Adjusted R2 = 1-[SSE/(n-k-1)] / [SS(Total)/(n-1)] =1-MSE/ = 49.44%
為何使用調整後判定係數 • 因為只要自變數增加,SSR一定增加,則判定係數一定會增加 • 數學上: • SST=SSR+SSE,總變異不變,多一各自變數,SSE變小,因此SSR增加 • 圖形上: 兩各自變數下,距離迴歸平面的最短距離 一各自變數下,距離迴歸線的最短距離
Testing the Validity of the Model • We pose the question: Is there at least one independent variable linearly related to the dependent variable? • To answer the question we test the hypothesis H0: b1 = b2 = … = bk H1: At least one bi is not equal to zero. • If at least one bi is not equal to zero, the model has some validity.
Testing the Validity of the La Quinta Inns Regression Model • The hypotheses are tested by an ANOVA procedure ( the Excel output) MSR/MSE k = n–k–1 = n-1 = SSR MSR=SSR/k SSE MSE=SSE/(n-k-1)
Testing the Validity of the La Quinta Inns Regression Model [Variation in y] = SSR + SSE. Large F results from a large SSR. Then, much of the variation in y is explained by the regression model; the model is useful, and thus, the null hypothesis should be rejected. Therefore, the rejection region is… • Rejection region • F>Fa,k,n-k-1
Testing the Validity of the La Quinta Inns Regression Model Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. At least one of the bi is not equal to zero. Thus, at least one independent variable is linearly related to y. This linear regression model is valid Fa,k,n-k-1 = F0.05,6,100-6-1=2.17 F = 17.14 > 2.17 Also, the p-value (Significance F) = 0.0000 Reject the null hypothesis.
Test statistic Testing the Coefficients • The hypothesis for each bi is • Excel printout H0: bi= 0 H1: bi¹ 0 d.f. = n - k -1
Using the Linear Regression Equation • The model can be used for making predictions by • Producing prediction interval estimate for the particular value of y, for a given values of xi. • Producing a confidence interval estimate for the expected value of y, for given values of xi. • The model can be used to learn about relationships between the independent variables xi, and the dependent variable y, by interpreting the coefficients bi
La Quinta Inns, Predictions • Predict the average operating margin of an inn at a site with the following characteristics: • 3815 rooms within 3 miles, • Closet competitor .9 miles away, • 476,000 sq-ft of office space, • 24,500 college students, • $35,000 median household income, • 11.2 miles distance to downtown center. Xm17-01 MARGIN = 38.14 - 0.0076(3815)+1.65(.9) + 0.020(476) +0.21(24.5) + 0.41(35) - 0.23(11.2) = 37.1%
La Quinta Inns, Predictions • Interval estimates by Excel (Data Analysis Plus) It is predicted, with 95% confidence that the operating margin will lie between 25.4% and 48.8%. It is estimated the average operating margin of all sites that fit this category falls within 33% and 41.2%. The average inn would not be profitable (Less than 50%).
Example 1 Consider the following statistics of a multiple regression model: Total variation in y = SSY = 1000, SSE = 300, n = 50, and k = 4 . (a) Determine the standard error of estimate (b) Determine the multiple coefficient of determination (c) Determine the F-statistics ANSWER: (a) 2.582 (b) 70% (c) F= MSR/MSE = 26.25
Example 2 • Pat Statsdud, a student ranking near the bottom of the class, decided that a certain amount of studying could actually improve final grades. However, too much studying would not be warranted, since Pat’s ambition (if that’s what one could call it) was to ultimately graduate with the absolute minimum level of work. Pat was registered in a statistics course, which had only 3 weeks to go before the final exam, and where the final grade was determined in the following way: Total mark=20%(Assignment)+30%(Midterm test)+50%(Final exam) To determine how much work to do in the remaining 3 weeks, Pat needed to be able to predict the final exam mark on the basis of the assignment mark and the midterm mark. Pat’s marks on these were 12/20 and 14/30, respectively. Accordingly, Pat undertook the following analysis. The final exam mark, assignment mark, and midterm test mark for 30 student who took the statistics course last year were collected.
Example 2 continued • a. Determine the regression equation. • b. What is the standard error of estimate? Briefly describe how you interpret this statistics • c. What is the coefficient of determination? What does this statistic tell you? • d. Test the validity of the model. • e. Interpret each of the coefficients. • f. Can Pat infer that the assignment mark is linearly related to the final grade in this model. • g. Can Pat infer that the midterm mark is linearly related to the final grade in this model? • h. Predict Pat’s final exam mark with 95% confidence. • i. Predict Par’s final grade with 95% confidence.
[補充] Partial F Test (I) • Usage: determine when to add or delete a group of variables • Assume original model has q independent variables • If we add variables Xq+1, Xq+2, …, Xpto the model,
Example 1 • 抽查貨運公司十天的行車紀錄,得行駛哩數(X1) ,貨運數量(X2) ,車種(X3),及行駛時間(Y)的資料,經輸入電腦而得各差異平方和(SS)及自由度如下: • SSR(X1,X2,X3)=270 d.f.=3 SSR(X1,X2)=250 d.f.=2 SSE(X1,X2,X3) =30 d.f.=6 • 試檢定X3是否值得引入模型中(α=5%)
^ Plot the residuals versus y Regression Diagnostics • The conditions required for the model assessment to apply must be checked. • Is the error variable normally distributed? • Is the error variance constant? • Are the errors independent? • Can we identify outlier? • Is multicolinearity (intercorrelation)a problem? Draw a histogram of the residuals Plot the residuals versus the time periods
Diagnostics: Multicolinearity • Example 17.2: Predicting house price (Xm17-02) • A real estate agent believes that a house selling price can be predicted using the house size, number of bedrooms, and lot size. • A random sample of 100 houses was drawn and data recorded. • Analyze the relationship among the four variables
Diagnostics: Multicolinearity • The proposed model isPRICE = b0 + b1BEDROOMS + b2H-SIZE +b3LOTSIZE + e The model is valid, but no variable is significantly related to the selling price ?!
Diagnostics: Multicolinearity • Multicolinearity is found to be a problem. • Multicolinearity causes two kinds of difficulties: • The t statistics appear to be too small. • The b coefficients cannot be interpreted as “slopes”.
Remedying Violations of the Required Conditions • Nonnormality or heteroscedasticity can be remedied using transformations on the y variable. • The transformations can improve the linear relationship between the dependent variable and the independent variables. • Many computer software systems allow us to make the transformations easily.
Diagnostics: The Error Distribution The errors histogram The errors may be normally distributed
Residual vs. predicted y It appears there is no problem of heteroscedasticity (the error variance seems to be constant). Diagnostics: Heteroscedasticity
Reducing Nonnormality by Transformations • A brief list of transformations • y’ = log y (for y > 0) • Use when the se increases with y, or • Use when the error distribution is positively skewed • y’ = y2 • Use when the s2e is proportional to E(y), or • Use when the error distribution is negatively skewed • y’ = y1/2 (for y > 0) • Use when the s2e is proportional to E(y) • y’ = 1/y • Use when s2eincreases significantly when y increases beyond some critical value.
Residual over time Diagnostics: First Order Autocorrelation The errors are not independent!!
Durbin - Watson Test:Are the Errors Autocorrelated? • This test detects first order autocorrelation between consecutive residuals in a time series • If autocorrelation exists the error variables are not independent Residual at time i
Positive First Order Autocorrelation + Residuals + + + 0 Time + + + + Positive first order autocorrelation occurs when consecutive residuals tend to be similar. Then, the value of d is small (less than 2).
Negative First Order Autocorrelation Residuals + + + 0 Time + + + + Negative first order autocorrelation occurs when consecutive residuals tend to markedly differ. Then, the value of d is large (greater than 2).
First order correlation exists Inconclusive test Positive first order correlation Does not exists dL dU One tail test for Positive First Order Autocorrelation • If d<dL there is enough evidence to show that positive first-order correlation exists • If d>dU there is not enough evidence to show that positive first-order correlation exists • If d is between dL and dU the test is inconclusive.
Negative first order correlation exists Negative first order correlation does not exist Inconclusive test 4-dU 4-dL One Tail Test for Negative First Order Autocorrelation • If d>4-dL, negative first order correlation exists • If d<4-dU, negative first order correlation does not exists • if d falls between 4-dU and 4-dL the test is inconclusive.
First order correlation does not exist First order correlation does not exist First order correlation exists Inconclusive test First order correlation exists Inconclusive test 0 dL dU 2 4-dU 4-dL 4 Two-Tail Test for First Order Autocorrelation • If d<dL or d>4-dL first order autocorrelation exists • If d falls between dL and dU or between 4-dU and 4-dLthe test is inconclusive • If d falls between dU and 4-dU there is no evidence for first order autocorrelation
Diagnostics: First Order Autocorrelation Using the computer - Excel Tools > Data Analysis > Regression (check the residual option and then OK) Tools > Data Analysis Plus > Durbin Watson Statistic > Highlight the range of the residuals from the regression run > OK Test for positive first order auto-correlation: n=20, k=2. From the Durbin-Watson table we have: dL=1.10, dU=1.54. The statistic d=0.5931 Conclusion: Because d<dL , there is sufficient evidence to infer that positive first order autocorrelation exists. The residuals