220 likes | 396 Views
Welcome to BUAD 310. Instructor: Kam Hamidieh Lecture 25, Monday April 28, 2014. Agenda & Announcement. Today: Go over exam 2 briefly Continue with Multiple Regression Extra office hours this week: Thursday May 1, 3-5 PM. Reminder: Case Study due on Wednesday April 30 by 5 PM.
E N D
Welcome to BUAD 310 Instructor: Kam Hamidieh Lecture 25, Monday April 28, 2014
Agenda & Announcement Today: • Go over exam 2 briefly • Continue with Multiple Regression • Extra office hours this week: Thursday May 1, 3-5 PM. Reminder: • Case Study due on Wednesday April 30 by 5 PM. • Homework 7, now posted, is due on Friday, May 2, 2014 • Final Exam on Thursday May 8th, 11 AM – 1:00 PM, in room THH 101. See http://web-app.usc.edu/maps/ BUAD 310 - Kam Hamidieh
From Last Time Checking Assumptions: • Y is linear in each predictor:- Look at the plot of Y against each X. - Plot the residuals versus each x and fitted values • Constant Variance Assumption- Plot the residuals versus each x and fitted values • Normality Assumption- Look at the histogram & Q-Q plot of the residuals. • Independence AssumptionIf residuals have time or spatial dependency, you can just plot them in order. Also: Watch for unusual points. BUAD 310 - Kam Hamidieh
Transformations • Transformation: re-expression of a variable by applying a function to each observation. • Transformations allow the use of regression analysis to describe a curved patternand improve your model (better residual analysis.) • You can transform Y or X or both but the interpretations becomes difficult. • Looking at the plots of Y vs X and the distributions of your variables can help you pick the right transformations. • A nonlinear transformation useful in business applications: logarithms. BUAD 310 - Kam Hamidieh
Collinearity - Example The file “testdata.txt” on our website contains the data shown on the right. The data were synthetically generated from: Y = 25 – 5 X1 + N(0,sd=10) X1 = 1,2,…,20 X2 is the same as X1 expect the last point. BUAD 310 - Kam Hamidieh
Pairs Plot • Comments? • Do you think there is a linear relationship between Y and X1? • How about Y and X2? • Will the R2 be small or large? • Will the p-value for F-Stat be small or large? • Do you each of the predictors X1 and X2 be statistically significant? BUAD 310 - Kam Hamidieh
Results of Simple Linear Regressions The results of the simple linear regression are “good”. BUAD 310 - Kam Hamidieh
Multiple Regression Results What ?!?!?! Do the results make sense? BUAD 310 - Kam Hamidieh
Multicollinearity • The problem of multicollinearity: if you have two or more predictor variables that are highly correlated with each other then it can make all the regression results very unreliable. • How to detect it: • Examine the pairs plot and correlation matrix; If you see correlations of 0.9 or higher, you should suspect multicollinearity. • High F stat value but non of the predictors are significant. • Standard errors seem very large. • More quantitative approach: compute the variance inflation factor or VIF. • Others… BUAD 310 - Kam Hamidieh
Variance Inflation Factor • Suppose you have multiple regression model with k predictors. • The variance inflation factor (VIF) for Xj, j = 1,2,..k, is defined as:where Rj2 is the R2 in the regression of Xj on all of the other predictor variables. (No Y involved.) • Why is this a good idea? BUAD 310 - Kam Hamidieh
More on VIF • It can be shown that: • As Rj2gets close to 1, se(bj) gets bigger and bigger…gets inflated! • Don’t need to put this formula on your cheat sheet. • …hence the name variance inflation factor. BUAD 310 - Kam Hamidieh
Guidelines on VIF • What is the range of values for VIF? • What does a VIF of near 1 mean? • The cut offs of VIF = 5 or 10 is most often used to identify danger. BUAD 310 - Kam Hamidieh
Our Example R2 = 0.9988 What is the VIF here? VIF = 1/(1 – 0.9988) ≈ 833! This is extremely large!!! BUAD 310 - Kam Hamidieh
What to do then? Here are some solutions: • Amputation! Remove the redundant variables. Variable selection methods can help a lot. • Re-express the predictors. For example: if it makes sense, you can create a new predictor by average two highly predictors. Example? BUAD 310 - Kam Hamidieh
From Our Real Estate Example VIF for LTV:1/(1-0.2369) ≈ 1.3 VIF for HomeValue:1/(1-0.0558) ≈ 1.1 VIF for StatedIncome:1/(1-0.0451) ≈ 1.0 VIF for CreditScore:1/(1-0.2446) ≈ 1.3 Multicollinearity is not a problem here. BUAD 310 - Kam Hamidieh
In Class Exercise 1 • Comment on the following statements. Do you agree or disagree and why? • The presence of multicollinearity violates an assumption of the multiple regression model. • In order to calculate the VIF for a predictor, we need to use the values of the response. • An analyst would like to build a regression model to predict Y from X1, X2, X3, and X4. She looks at the correlation matrix below: • Do you see a pair of variables that could potentially cause aproblem in her regression? Why? • What is the VIF for X2? BUAD 310 - Kam Hamidieh
Confidence and Prediction Intervals • The fitted value of the response corresponding to a particular combination of values of the independent variables X1,…, Xk is • We use this value as an estimate for the mean (or a future value) of y when X1=x1,…, Xk=xk, but our estimate will not be exactly right • Therefore, we need to place bounds on how far this guess might be from the truth • We can do this by calculating a confidence interval mean for the value of y and a prediction interval for an individual value of y BUAD 310 - Kam Hamidieh
Which to Choose? • Use the prediction interval (PI) when you want to predict an individual value of the response variable. • Use the confidence interval (CI) when you want to estimate the mean value of the response. • Note: the prediction interval will always be wider than the confidence interval (given the same values of the explanatory variables) BUAD 310 - Kam Hamidieh
Example: Women’s Clothing Stores Our variables of interest: Y = sales at stores in a chain of women’s apparel (annually in dollars per square foot of retail space) X1 = median household income in the area (thousands of dollars) X2 = number of competing apparel stores in the same mall. Goal: Predict sales at the stores of this chain Data: “23_mall_sales.txt”, n = 65 BUAD 310 - Kam Hamidieh
Regression Results Estimated Mean Sales = 60 + 7.96 (Income) – 24.17 (Competitors), Se = 68.03 BUAD 310 - Kam Hamidieh
Prediction Intervals • Suppose you want to create a prediction interval at a location with median income of $70,000 and 3 competitors near by. • You best point estimate for the mean sales and an individual value will be the same: • However the width of the intervals will be different. = 60 + 7.96 (70) – 24.17 (3) ≈ 545 $/(sqr foot) BUAD 310 - Kam Hamidieh
Using Software to Get CI & PI • Your book gives an approximate(95%) formula for the prediction interval (see page 614 and you can use the result on this page for your project as well.): • However, in practice, let a reliable software do it. • Software gives: • 95% CI for mean response: (409, 683) • 95% PI: (527, 564) BUAD 310 - Kam Hamidieh