1 / 12

Lecture 17: Advanced model building

Lecture 17: Advanced model building. March 19, 2014. Administrative. Problem set 7 due next Wednesday Exam 2 over everything through next week Talk today at 12pm: Rema Padman (Heinz) will talk on Healthcare Analytics – new uses of data+analysis for medical decision making. .

egil
Download Presentation

Lecture 17: Advanced model building

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 17:Advanced model building March 19, 2014

  2. Administrative • Problem set 7 due next Wednesday • Exam 2 over everything through next week • Talk today at 12pm: RemaPadman (Heinz) will talk on Healthcare Analytics – new uses of data+analysis for medical decision making.

  3. Penultimate section of course Upcoming material (today) • Regression Diagnostics • Can we quantify how much any one observation changes our model (outliers)? Yes. • Can we quantify which models are better? Yes, to some degree • Are there other/better ways to evaluate our model? Sometimes.

  4. Penultimate section of course Upcoming material: next few weeks • Specification (no time in the course to really do casual inference  although we’ll talk about it) • Fundamental question for the social sciences, even business: ‘does advertising have an effect on sales’, etc. • Very hard to answer. Doesn’t mean that it’s not answerable. • Techniques could be the topic of multiple courses. We’ll be very brief (it gets tricky/hard quickly). • Forecasting • What will demand be next summer? How will costs grow over the next 5 years? • Essentially predictions with a time component. • Typically interested in minimizing forecast error (what is it?) • Forecast error = RMSE. • Two of the most common uses of regression • But how you should specify the model (variable choice) is different!

  5. Inclusion / Exclusion • We’ve focused almost extensively on estimating a model and then interpreting the results. • This is very important, but… • It takes the specification model itself for granted. • Specification: which variables to include in a model? • We’ve touched on this but not in an exceptionally rigorous way • You’ll see: it depends on what we want to do. • It’s tempting to focus on R2 • It’s important. It’s useful. • But it’s not the only useful thing. The standard error of the regression (aka: RSE, RSD, RMSE, etc.) is very important.

  6. Partial F-test • F-test • Conceptually: is there a significant amount of variation explained by the model? • Partial F-test • Similar: going from model M1 (reduced) to model M2 (complete), has there been a significant increase in the amount of variation explained?

  7. Partial F-test • Two equivalent formulas to calculate the partial F statistic: • k = # of variables in the complete model • j = # of variables in the reduced model • To get the p-value in Excel: =FDIST(F, k-j, n-k-1) • This is different than the F.DIST function!!! • FDIST is the “old” excel function. F.DIST.RT() is the new excel version. FDIST = 1 – F.DIST. Stupid. Stupid. Stupid Change from MS

  8. Partial F test • Using the data TeachingRating.xls • Fit two models: • Predict course evaluations by using age, beauty, non-native speaker • Predict course evaluations by using age, beauty, non-native speaker, female, and an interaction between female and age. • The partial F-statistic from the going from 1 to 2 is what? • 9.52 • 10.36 • 2.34 • 0.34

  9. Diagnostics • What did we do about outliers in simple regression? • Did another analysis without the observation and see if there are any differences with the various estimates. • The same can be true for multiple regression • sometimes much harder to identify which observation might be problematic because we’re dealing with a multi-dimensional space. • Leverage: a statistic we can calculate for each observation • A measure of influence of the observation on the model. • Ranges from 0 to 1 (low to high influence). • Observations with leverage values larger than 3(k+1) / npotentially problematic • kis the # of explanatory variables and n is sample size • Why is it a function of k and n?

  10. Leverage • Unfortunately calculating the leverage of each observation is problematic in Excel. You can do it but it’s a royal pain • Doesn’t mean that you don’t need to know them, but I would give you the calculated leverage in an exam, etc. • Leverage is just the potential for being problematic. Once you’ve identified potentially troubling observations, try re-estimating the model without that data point. • Even if your results change, it doesn’t mean you should drop the data point(s). It might be completely legitimate. But it helps you understand your data. • A data point with high leverage isn’t necessarily an outlier. • Outliers have “unusual” values (relative to the rest of the dist).

  11. Cook’s Distance • To look at potential outliers with high leverage, we can calculate the Cook’s Distance for an observation i • Cook’s Di < 0.5, usually fine. • 0.5 < Di < 1, might be problematic • Di > 1, probably problematic. where: ei = residual for observation i k = number of parameters of the model MSE = Mean Square Error = SSE / (n-k-1) = se2 hi = leverage of observation i • So I could provide the leverage values for each obs and have you calculate D, or identify leveraging observations, etc.

  12. Model Validation • Assuming the data is OK, how do we tell if the model is good? • Looked at R2 and se, as we should. • It’s always a good idea to minimize se (if we’re interested in forecasting or predicting). Or maximize R2 if understand changes in Y. • But both are functions of the data • With different data, they might change. • And the data we have is just a sample. What makes this sample so much better than another one? Nothing… • Another common, and very good, approach is to perform some subsampling and model validation • Split the data: • Group 1 (the bigger group): use the data to fit a model • Group 2: use the regression from Group 1 to predict “out of sample” observations from Group 2. How well does it do? It’ll be worse (why?) but how much worse? • Calculate Residuals: R2 (correlation^2 between fitted and observed), and se (stdev of residuals). • How do you choose the groups?

More Related