90 likes | 192 Views
Lecture 9: Explaining Variation in Y. BUEC 333 Summer 2009 Simon Woodcock. Explaining Variation in Y.
E N D
Lecture 9: Explaining Variation in Y BUEC 333 Summer 2009 Simon Woodcock
Explaining Variation in Y • We’ve said several times that the goal of regression analysis is to “explain” variation in the dependent variable Yion the basis of variation in the independent variables X1i , X2i , ... , Xki • What does this mean? • And how do we know whether we’re doing a good job? • These are today’s topics.
The Total Sum of Squares • When we talk about the “variation” in Yi to be “explained” by the independent variables, we’re talking about how Yivaries around its mean. • That is, we want to explain departures in Yifrom it’s population mean μY • why? because we can already “explain” the mean pretty well using • Of course, we don’t know μY so we look at departures in Yi from its sample mean (i.e., ) • However, we always have(why?) so trying to “explain” the total of these departures is pretty useless • Instead, we focus on what’s usually called the Total Sum of Squares (TSS) which isn’t zero unless there’s no variation in Yiat all. • When TSS is big, there is lots of variation in Yiaround its mean – and this is what we want to explain using the independent variables.
The Decomposition of Variance • We can always write:(all I’ve done is add and subtract the predicted value -- draw a picture) • It follows that (you should be able to show this yourself – and I recommend you try it):where ESS is the explained sum of squares, and RSS is the residual sum of squares. • We’ve decomposed the total (squared) variation in Yiaround its mean into a component that our regression model explains (ESS), and a component that our regression model cannot explain (RSS).
The Proportion of Variance Explained: R2 • When we build a regression model, we frequently want to know how well it “fits” the data. • Does our model do a good job of explaining the variation in Yi? • We can use our decomposition of TSS into ESS and RSS to measure the proportion of the variation in Yithat is explained by our model. • We call the proportionof the variation in Yithatis explained by the regression model R2: • Notice that 0 ≤ R2 ≤ 1
Using R2to Assess Model Fit • R2is a useful measure to assess how well our model “fits” the data – that is, how well it explains the variation in Yi. • When R2 = 0, the regression explains none of the variation in Yi • the regression model explains variation in Yino better than the sample mean does (draw a picture) • When R2= 1, the regression explains all of the variation in Yi • this means there is an exact relationship between Yiand the independent variables (no errors – draw a picture) • Typically, we don’t encounter either of these extremes in real data (draw a picture) • Usually, bigger values of R2are “better” in the sense that our regression model does a “better” job of predicting Yi • but all it tells us is that there is a strong linear relationship between Yi and the independent variables – it doesn’t imply anything causal.
More About R2 • How big should R2be to be confident in our model? • that depends on the context • in wage regressions (regress wage on education, experience, etc.) there are so many things that affect what a person earns that are hard to measure (luck, ability, motivation, etc.) that we are happy when R2is above 0.4 • in “macro” or financial regressions (e.g., regress the unemployment rate on inflation, economic growth, etc.) we are suspicious if R2is below 0.9 • There is a temptation to build a model (i.e., choose your independent variables) to maximize R2 • avoid this temptation! • if you add another independent variable to your model, R2never decreases – even if the new variable has no “real” relationship with the dependent variable!
Motivating Adjusted R2 • There are other reasons to avoid building a model to maximize R2 • Occam’s Razor: “one should not increase, beyond what is necessary, the number of entities required to explain anything” (all else equal, we prefer smaller, simpler models) • losing degrees of freedom: a model’s degrees of freedom is the number of observations (n) minus the number of parameters you estimate (k slope parameters + 1 intercept). When we add independent variables to the model, we lose degrees of freedom and (we’ll see soon), our parameter estimates are less precise. • So if we add extra variables to the model, we need to trade off a better fit (in terms of R2) against parsimony (having a small, simple model). • An alternative to R2that takes this into account is adjusted R2.
Adjusted R2 • Another way to measure the quality of a model’s fit is adjusted R2: • Adjusted R2(pronounced Rbar-squared) penalizes for having lots of independent variables (or few degrees of freedom) • It can increase, decrease, or stay the same when we add an extra regressor to the model. • If we add an extra independent variable that is only weakly related to the dependent variable, adjusted R2 will decrease • Like R2, adjusted R2 is less than 1, but it is not necessarily positive (if R2is very close to zero, adjusted R2can be negative) • It’s not the “be all and end all” – to assess whether a regression model is “good” we need to look at plenty of other things: do regression coefficients have plausible sign & magnitude? does the model give sensible predictions? is it missing independent variables that we know matter? etc.