440 likes | 580 Views
Statistics and Data Analysis. Professor William Greene Stern School of Business IOMS Department Department of Economics. Statistics and Data Analysis. Part 18 – Regression Modeling. Linear Regression Models. Least squares results Regression model Sample statistics
E N D
Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics
Statistics and Data Analysis Part 18 – Regression Modeling
Linear Regression Models • Least squares results • Regression model • Sample statistics • Estimates of population parameters • How good is the model? • In the abstract • Statistical measures of model fit • Assessing the validity of the relationship
Regression Model • Regression relationshipyi= α + β xi + εi • Random εi implies random yi • Observed random yihas two unobserved components: • Explained: α + β xi • Unexplained: εi • Random component εi zero mean, standard deviation σ, normal distribution.
Using the Regression Model • Prediction: Use xi as information to predict yi. • The natural predictor is the mean, • xiprovides more information. • With xi, the predictor is
Regression Fits Regression of salary vs. Regression of fuel bill vs. number years of experience of rooms for a sample of homes
Explained Variation • The proportion of variation “explained” by the regression is called R-squared (R2) • It is also called the Coefficient of Determination
Regression Fits R2 = 0.924 R2 = 0.522 R2 = 0.424 R2 = 0.880
R2 is still positive even if the correlation is negative. R2 = 0.338
R Squared Benchmarks • Aggregate time series: expect .9+ • Cross sections, .5 is good. Sometimes we do much better. • Large survey data sets, .2 is not bad. R2 = 0.924 in this cross section.
Correlations rxy = 0.723 rxy = +1.000 rxy = -.402
R-Squared is rxy2 • R-squared is the square of the correlation between yi and the predicted yi which is a + bxi. • The correlation between yi and (a+bxi) is the same as the correlation between yi and xi. • Therefore,…. • A regression with a high R2 predicts yi well.
Adjusted R-Squared • We will discover when we study regression with more than one variable, a researcher can increase R2 just by adding variables to a model, even if those variables do not really explain y or have any real relationship at all. • To have a fit measure that accounts for this, “Adjusted R2” is a number that increases with the correlation, but decreases with the number of variables.
Is R2 Large? • Is there really a relationship between x and y? • We cannot be 100% certain. • We can be “statistically certain” (within limits) by examining R2. • F is used for this purpose.
Is R2 Large? • Since F = (N-2)R2/(1 – R2), if R2 is “large,” then F will be large. • For a model with one explanatory variable in it, the standard benchmark value for a ‘large’ F is 4.
Movie Madness Fit R2 F
Why Use F and not R2? • When is R2 “large?” we have no benchmarks to decide. • How large is “large?” We have a table for F statistics to determine when F is statistically large: yes or no.
F Table n2 is N-2 The “critical value” depends on the number of observations. If F is larger than the appropriate value in the table, conclude that there is a “statistically significant” relationship. There is a huge F table on pages 732-742 of your text. Analysts now use computer programs, not tables like this, to find the critical values of F for their model/data.
Internet Buzz Regression n2 is N-2 Regression Analysis: BoxOffice versus Buzz The regression equation is BoxOffice = - 14.4 + 72.7 Buzz Predictor Coef SE Coef T P Constant -14.360 5.546 -2.59 0.012 Buzz 72.72 10.94 6.65 0.000 S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4% Analysis of Variance Source DF SS MS F P Regression 1 7913.6 7913.6 44.16 0.000 Residual Error 60 10751.5 179.2 Total 61 18665.1
$135 Million Klimt, to Ronald Lauder http://www.nytimes.com/2006/06/19/arts/design/19klim.html?ex=1308369600&en=37eb32381038a749&ei=5088&partner=rssnyt&emc=rss
$100 Million … sort of Stephen Wynn with a Prized Possession, 2007
An Enduring Art Mystery Graphics show relative sizes of the two works. The Persistence of Statistics. Hildebrand, Ott and Gray, 2005 Why do larger paintings command higher prices? The Persistence of Memory. Salvador Dali, 1931
Monet in Large and Small Sale prices of 328 signed Monet paintings The residuals do not show any obvious patterns that seem inconsistent with the assumptions of the model. Log of $price = a + b log surface area + e
The Data Note: Using logs in this context. This is common when analyzing financial measurements (e.g., price) and when percentage changes are more interesting than unit changes. (E.g., what is the % premium when the painting is 10% larger?)
Monet Regression: There seems to be a regression. Is there a theory?
Conclusions about F • R2 answers the question of how well the model fits the data • F answers the question of whether there is a statistically valid fit (as opposed to no fit). • What remains is the question of whether there is a valid relationship – i.e., is β different from zero.
The Regression Slope • The model is yi = α+βxi+εi • The “relationship” depends on β. • If β equals zero, there is no relationship • The least squares slope, b, is the estimate of β based on the sample. • It is a statistic based on a random sample. • We cannot be sure it equals the true β. • To accommodate this view, we form a range of uncertainty around b. I.e., a confidence interval.
Uncertainty About the Regression Slope Hypothetical Regression Fuel Bill vs. Number of Rooms The regression equation is Fuel Bill = -252 + 136 Number of Rooms Predictor Coef SE Coef T P Constant -251.9 44.88 -5.20 0.000 Rooms 136.2 7.09 19.9 0.000 S = 144.456 R-Sq = 72.2% R-Sq(adj) = 72.0% This is b, the estimate of β This “Standard Error,” (SE) is the measure of uncertainty about the true value. The “range of uncertainty” is b ± 2 SE(b). (Actually 1.96, but people use 2)
Internet Buzz Regression Range of Uncertainty for b is 72.72+1.96(10.94)to72.72-1.96(10.94)= [51.27 to 94.17] Regression Analysis: BoxOffice versus Buzz The regression equation is BoxOffice = - 14.4 + 72.7 Buzz Predictor Coef SE Coef T P Constant -14.360 5.546 -2.59 0.012 Buzz 72.72 10.94 6.65 0.000 S = 13.3863 R-Sq = 42.4% R-Sq(adj) = 41.4% Analysis of Variance Source DF SS MS F P Regression 1 7913.6 7913.6 44.16 0.000 Residual Error 60 10751.5 179.2 Total 61 18665.1
Elasticity in the Monet Regression: b = 1.7246. This is the elasticity of price with respect to area. The confidence interval would be 1.7246 1.96(.1908) = [1.3506 to 2.0986] The fact that this does not include 1.0 is an important result – prices for Monet paintings are extremely elastic with respect to the area.
Conclusion about b • So, should we conclude the slope is not zero? • Does the range of uncertainty include zero? • No, then you should conclude the slope is not zero. • Yes, then you can’t be very sure that β is not zero. • Tying it together. If the range of uncertainty does not include 0.0 then, • The ratio b/SE is larger than2. • The square of the ratio is larger than 4. • The square of the ratio is F. • F larger than 4 gave the same conclusion. • They are looking at the same thing.
Summary • The regression model – theory • Least squares results, a, b, s, R2 • The fit of the regression model to the data • ANOVA and R2 • The F statistic and R2 • Uncertainty about the regression slope