200 likes | 301 Views
Class 23. The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions. Adjusted R-square. Pg 9-12 Pfeifer note.
E N D
Class 23 The most over-rated statistic The four assumptions The most Important hypothesis test yet Using yes/no variables in regressions
Adjusted R-square Pg 9-12 Pfeifer note Our better method of forecasting hours would use a mean of 7.9 and standard deviation of 3.89 (and the t-distribution with 14 dof) The sample variance is The variation in Hours that regression will try to explain
Adjusted R-square Pg 9-12 Pfeifer note Our better method of forecasting hours for job A would use a mean of 10.51 and standard deviation of 2.77 (and the t-distribution with 13 dof) The squared standard error is The variation in Hours regression leaves unexplained.
Adjusted R-square Pg 9-12 Pfeifer note • Adjusted R-square is the percentage of variation explained • The initial variation is s2 = 15.1 • The variation left unexplained (after using MSF in a regression) is (standard error)2 = 7.69. • Adjusted R-square = • Adjusted R-square = (15.1-7.69)/15.1 = 0.49 • The regression using MSF explained 49% of the variation in hours. • The “adjusted” happened in the calculation of s and standard error.
From the Pfeifer note Standard error = 0 Adj R-square = 1.0 Adj R-square = 0.5 Standard error = s Adj R-square = 0.0
Why Pfeifer says R2 is over-rated • There is no standard for how large it should be. • In some situations an adjusted R2 of 0.05 would be FANTASTIC. In others, an adjusted R2 of 0.96 would be DISAPOINTING. • It has no real use. • Unlike “standard error” which is needed to make probability forecasts. • It is usually redundant • When comparing models, lower standard errors mean higher adjR2 • The correlation coefficient (which shares the same sign as b) ≈ the square root of adjR2.
The Coal Pile Example 96% of the variation in W is explained by this regression. • The firm needed a way to estimate the weight of a coal pile (based on it’s dimensions) We just used MULTIPLE regression.
The Coal Pile Example 100% of the variation in W is explained by this regression. • Engineer Bob calculated the Volume of each pile and used simple regression… Standard error went from to 20.6 to 2.8!!!
Sec 5 of Pfeifer note Sec 12.4 of EMBS The Four Assumptions • Linearity • Independence • The n observations were sampled independently from the same population. • Homoskedasticity • All Y’s given X share a common σ. • Normality • The probability distribution of Y│X is normal. • Errors are normal. Y’s don’t have to be.
Sec 5 of Pfeifer note Sec 12.4 of EMBS The four assumptions Our better method of forecasting hours for job A would use a mean of 10.51 and standard deviation of 2.77 (and the t-distribution with 13 dof) Linearity Independence (all 15 points count equally) homoskedasticity Normality
P 13 of Pfeifer note Sec 12.5 of EMBS Hypotheses • H0: P=0.5 (LTT, wunderdog) • H0: Independence (supermarket job and response, treatment and heart attack, light and myopia, tosser and outcome) • H0: μ=100 (IQ) • H0: μM= μF (heights, weights, batting average) • H0: μcompact= μmid =μlarge (displacement)
P 13 of Pfeifer note Sec 12.5 of EMBS H0: b=0 • b=0 means X and Y are independent • In this way it’s like the chi-squared independence test….for numerical variables. • b=0 means don’t use X to forecast Y • Don’t put X in the regression equation • b=0 means just use to forecast Y • b=0 means the “true” adj R-square is zero.
P 13 of Pfeifer note Sec 12.5 of EMBS Testing b=0 is EASY!!! • H0: μ=100 • P-value from the t.dist with n-1 dof • H0: b=0 • (-0)/(se of coef) • P-value from t.dist using n-2 dof. The t-stat to test b=0. The 2-tailed p-value. The standard error of the coefficient
Using Yes/No variable in Regression Numerical Categorical Numerical Categorical Does MPG “depend” on fuel type? n=60 Sec 8 of Pfeifer note Sec 13.7 of EMBS
Fuel type (yes/no) and mpg (numerical) H0: μP = μR Or H0: μP – μR = 0 • Un-stack the data so there are two columns of MPG data. • Data Analysis, T-test two sample Sec 8 of Pfeifer note Sec 13.7 of EMBS
Using Yes/No variables in Regression • Convert the categorical variable into a 1/0 DUMMY Variable. • Use an if statement to do this. • It won’t matter which is assigned 1, which is assigned 0. • It doesn’t even matter what 2 numbers you assign to the two categories (regression will adjust) • Regress MPG (numerical) on DUMMY (1/0 numerical) • Test H0: b=0 using the regression output. Sec 8 of Pfeifer note Sec 13.7 of EMBS
Using Yes/No variables in Regression Sec 8 of Pfeifer note Sec 13.7 of EMBS
Regression with one Dummy variable For Regular, 27.7 When D=0, H0: μP = μR Or H0: μP – μR = 0 Or H0: b = 0 For premium, 24.3 When D=1,
What we learned today • We learned about “adjusted R square” • The most over-rated statistic of all time. • We learned the four assumptions required to use regression to make a probability forecast of Y│X. • And how to check each of them. • We learned how to test H0: b=0. • And why this is such an important test. • We learned how to use a yes/no variable in a regression. • Create a dummy variable.