Lecture 17: Advanced model building

Lecture 17:Advanced model building March 19, 2014

Administrative • Problem set 7 due next Wednesday • Exam 2 over everything through next week • Talk today at 12pm: RemaPadman (Heinz) will talk on Healthcare Analytics – new uses of data+analysis for medical decision making.

Penultimate section of course Upcoming material (today) • Regression Diagnostics • Can we quantify how much any one observation changes our model (outliers)? Yes. • Can we quantify which models are better? Yes, to some degree • Are there other/better ways to evaluate our model? Sometimes.

Penultimate section of course Upcoming material: next few weeks • Specification (no time in the course to really do casual inference  although we’ll talk about it) • Fundamental question for the social sciences, even business: ‘does advertising have an effect on sales’, etc. • Very hard to answer. Doesn’t mean that it’s not answerable. • Techniques could be the topic of multiple courses. We’ll be very brief (it gets tricky/hard quickly). • Forecasting • What will demand be next summer? How will costs grow over the next 5 years? • Essentially predictions with a time component. • Typically interested in minimizing forecast error (what is it?) • Forecast error = RMSE. • Two of the most common uses of regression • But how you should specify the model (variable choice) is different!

Inclusion / Exclusion • We’ve focused almost extensively on estimating a model and then interpreting the results. • This is very important, but… • It takes the specification model itself for granted. • Specification: which variables to include in a model? • We’ve touched on this but not in an exceptionally rigorous way • You’ll see: it depends on what we want to do. • It’s tempting to focus on R2 • It’s important. It’s useful. • But it’s not the only useful thing. The standard error of the regression (aka: RSE, RSD, RMSE, etc.) is very important.

Partial F-test • F-test • Conceptually: is there a significant amount of variation explained by the model? • Partial F-test • Similar: going from model M1 (reduced) to model M2 (complete), has there been a significant increase in the amount of variation explained?

Partial F-test • Two equivalent formulas to calculate the partial F statistic: • k = # of variables in the complete model • j = # of variables in the reduced model • To get the p-value in Excel: =FDIST(F, k-j, n-k-1) • This is different than the F.DIST function!!! • FDIST is the “old” excel function. F.DIST.RT() is the new excel version. FDIST = 1 – F.DIST. Stupid. Stupid. Stupid Change from MS

Partial F test • Using the data TeachingRating.xls • Fit two models: • Predict course evaluations by using age, beauty, non-native speaker • Predict course evaluations by using age, beauty, non-native speaker, female, and an interaction between female and age. • The partial F-statistic from the going from 1 to 2 is what? • 9.52 • 10.36 • 2.34 • 0.34

Diagnostics • What did we do about outliers in simple regression? • Did another analysis without the observation and see if there are any differences with the various estimates. • The same can be true for multiple regression • sometimes much harder to identify which observation might be problematic because we’re dealing with a multi-dimensional space. • Leverage: a statistic we can calculate for each observation • A measure of influence of the observation on the model. • Ranges from 0 to 1 (low to high influence). • Observations with leverage values larger than 3(k+1) / npotentially problematic • kis the # of explanatory variables and n is sample size • Why is it a function of k and n?

Leverage • Unfortunately calculating the leverage of each observation is problematic in Excel. You can do it but it’s a royal pain • Doesn’t mean that you don’t need to know them, but I would give you the calculated leverage in an exam, etc. • Leverage is just the potential for being problematic. Once you’ve identified potentially troubling observations, try re-estimating the model without that data point. • Even if your results change, it doesn’t mean you should drop the data point(s). It might be completely legitimate. But it helps you understand your data. • A data point with high leverage isn’t necessarily an outlier. • Outliers have “unusual” values (relative to the rest of the dist).

Cook’s Distance • To look at potential outliers with high leverage, we can calculate the Cook’s Distance for an observation i • Cook’s Di < 0.5, usually fine. • 0.5 < Di < 1, might be problematic • Di > 1, probably problematic. where: ei = residual for observation i k = number of parameters of the model MSE = Mean Square Error = SSE / (n-k-1) = se2 hi = leverage of observation i • So I could provide the leverage values for each obs and have you calculate D, or identify leveraging observations, etc.

Model Validation • Assuming the data is OK, how do we tell if the model is good? • Looked at R2 and se, as we should. • It’s always a good idea to minimize se (if we’re interested in forecasting or predicting). Or maximize R2 if understand changes in Y. • But both are functions of the data • With different data, they might change. • And the data we have is just a sample. What makes this sample so much better than another one? Nothing… • Another common, and very good, approach is to perform some subsampling and model validation • Split the data: • Group 1 (the bigger group): use the data to fit a model • Group 2: use the regression from Group 1 to predict “out of sample” observations from Group 2. How well does it do? It’ll be worse (why?) but how much worse? • Calculate Residuals: R2 (correlation^2 between fitted and observed), and se (stdev of residuals). • How do you choose the groups?

Lecture 17: Advanced model building