340 likes | 1.27k Views
Chapter Seven. Objective: To identify and understand multicollinearity and its impact on multiple regression analysisTopics:MulticollinearityGeneralized F-testHidden ExtrapolationOmitted Variable BiasCase: The Hotdog Case. 2. Hot Dog: Background. Your company: Dubuque. Ball Park: a leading brand.Ball Park may reduce hot dog price. Problem: Impact on Dubuque's market share.Some argue that the impact will be small because Oscar Mayer is Dubuque's leading competitor..
E N D
1. Summer 2009
Clayton State University
School of Business
Dr. Reza Kheirandish 1 Managerial Statistics, A Case Base Approachby: Klibanoff, Sandroni, Moselle, Saraniti
2. Chapter Seven Objective: To identify and understand multicollinearity and its impact on multiple regression analysis
Topics:
Multicollinearity
Generalized F-test
Hidden Extrapolation
Omitted Variable Bias
Case: The Hotdog Case 2
3. Hot Dog: Background Your company: Dubuque.
Ball Park: a leading brand.
Ball Park may reduce hot dog price.
Problem: Impact on Dubuque’s market share.
Some argue that the impact will be small because Oscar Mayer is Dubuque’s leading competitor. 3
4. Hot Dog: More Background Ball Park produces two Hot Dogs.
Regular and All-beef Hot Dogs.
Current Prices:
Ball Park 1.79 and 1.89 (regular and beef)
Dubuque 1.49
Oscar Mayer 1.69
Ball Park new pricing:
Regular 1.45, All-beef 1.55 4
5. Hot Dog: Questions How does Dubuque’s own price affect Dubuque’s market share?
How does Oscar Mayer’s price affect Dubuque’s market share?
How does Ball Park’s price affect Dubuque’s market share?
Who is Dubuque’s leading competitor, Ball Park or Oscar Mayer? 5
6. Hot Dog: Further Questions What will happen to Dubuque’s market share if Dubuque does not respond to Ball Park’s new campaign?
How much should Dubuque charge for its hot dog? 6
7. Hot Dog: Regression 7
8. Ball Park’s P-values 8
9. P-values Do the high p-values indicate that neither of Ball Park’s prices has a significant relationship to Dubuque’s share?
At first, you might think so, BUT the answer is NO!
This regression is suffering from a serious multicollinearity problem 9
10. Multicollinearity Multicollinearity is the term used to describe the correlation among the independent variables.
A multicollinearity problem occurs when this correlation is high.
10
11. Correlations Kstat can generate this correlations table
Note the high correlation (0.97938) between the two Ball Park prices.
They seem to be coordinating their pricing
The following scatterplot shows this to be true ? ? ?
11
12. Ball Park Prices 12
13. P-values revisited Individually Ball Park’s hot dogs are not significant. This may suggest dropping them from the regression.
This is NOT necessarily correct.
The high correlation makes us unable to separate the two “Ball Park effects” on our market share.
To decide, we must test whether Ball Park hot dogs taken together are significant.
We can do this with an F-test. 13
14. The F-test
Ho: ?bpreg = ?bpbeef = 0
Ha: At least one coefficient ?bpreg or ?bpbeef (or both) is not equal to zero
The t-tests run by Kstat question whether each independent variable is significantly different from zero in isolation.
This F-test asks: Are they jointly signficant? 14
15. The F-test First, click Statistics>Analysis of variance. You will obtain the following dialog box 15
16. The F-test 16
17. The F-test The p-value is small (0.0000003).
We reject the null hypothesis: Ho: ?bpreg = ?bpbeef = 0
We accept the Alternative Hypothesis
and conclude that at least one of the Ball Park hot dogs have an effect on Dubuque’s market share. 17
18. How to Detect Multicollinearity Suppose that the correlation between two independent variables is 0.65 or 0.75. Is that a multicollinearity problem?
The variance inflation factor is an indicator of a multicollinearity problem.
If the VIF is above 10 then there is a serious multicollinearity problem.
Kstat computes VIF automatically for you in the Model Analysis menu 18
19. What Next? Let’s run a multiple regression without Ball Park (all-beef) hot dog.
Just an experiment. NOT the final model.
19
20. Regression without BP Beef 20
21. What to do… Once we remove pbpbeef from the regression, the t-ratio of pbpreg skyrockets.
The same would happen to the t-ratio of pbpbeef if we removed pbpreg.
Keep the regression with BOTH hot dogs but interpret with care the p-values and coefficients 21
22. Conclusion (even more) Ball Park is Dubuque’s leading competitor.
Dubuque’s market share falls by an estimated 0.045% for each cent of decrease in both Ball Park’s prices.
Dubuque’s market share expected to fall by 1.5% = 0.045% x 34.
To maintain market share, Dubuque reduced price by 20 cents.
(0.076% x 20 = 1.5%) 22
23. Summary Ball Park’s prices are very correlated.
This creates a multicollinearity problem.
We cannot accurately estimate separate effects for the two Ball Park prices.
But jointly they do have an effect. 23
24. Other Issues The case also asks us to consider a new pricing strategy for Ball Park.
They might be planning on charging 1.45 for the regular hot dogs and 1.95 for the all-beef version
Can we just plug both of these values into Kstat’s prediction menu and make a forecast for our market share?
Nope.
24
25. Extrapolation We might want to check these new prices to see whether or not we are extrapolating.
That is, does our data set include prices like these within it, or are we making estimates beyond the domain of our existing experience?
Let’s start by looking at the univariate statistics in Kstat 25
26. Univariate Statistics This looks okay.
145 falls between the min/max values for pbpreg
195 falls between the min/max values for pbpbeef
Prices of 145, 195 are nothing new to us 26
27. But Wait! Consider the Pair of Prices 27
28. Hidden Extrapolation The X values we are considering are jointly far from the data set that we are using.
We just wont see it by looking at them one at a time.
Ball Park’s prices have always been within 10 cents of one another. They have never been 50 cents apart in the data set that we are using.
Be aware of this possibility when testing sets of variables that are highly correlated. 28
29. Omitted Variable Bias Multicollinearity creates a bias because of the variables that are INCLUDED in the regression
Omitted variable bias is caused by the variables LEFT OUT of the regression 29
30. Strike Outs 30
31. Strike Outs: OVB Do more strike-outs lead to higher salaries?
No
But something is up. This isn’t spurious
On average, players with more strike-outs DO make more money than others…
They also hit more home runs and play more games and get more hits and …
Omitting other variables biases the coefficient on strike-outs 31
32. Strike-outs with Home runs added 32
33. Two Interpretations of Two Coefficients The first regression (w/out Home runs) answers the following question:
On average, how much does salary change for every strike-out?
The second regression with both variables measures the effect of strike-outs on salary holding the number of home runs fixed.
i.e. on average, for a player with a certain number of home runs, how much does salary change for every strike-out. 33
34. Influence Diagram Strike-outs have a direct (negative) correlation with salary.
Strike-outs also have a positive correlation with home runs which have a positive correlation with salary.
This creates a second [indirect] effect which in this case dominates the direct one. 34
35. Conclusions Omitted Variable Bias can distort coefficients
Leaving out correlated variables forces the variables that are present to carry the weight of both direct and indirect effects
Including all variables isolates their effect so coefficients only measure their direct effect
Building models is hard. You cant include everything and so there will always be some OVB.
Including some highly correlated variables can create other problems like multicollinearity
35