1 / 12

Stat 324 – Day 25

Stat 324 – Day 25. Variable Selection Techniques cont. Variable selection. Want to find the combination of variables that explains the most variability in the simplest possible model

gguthrie
Download Presentation

Stat 324 – Day 25

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stat 324 – Day 25 Variable Selection Techniques cont

  2. Variable selection • Want to find the combination of variables that explains the most variability in the simplest possible model • Look for variables that explain a higher percentage of the remaining unexplained variation (partial correlation coefficients) • Can use automated procedures … with caution

  3. Variable Selection • Forward selection: Bring in most highly correlated variable and then the variable with the highest “partial correlation” and so on • Backward elimination: Take out the least significant variable, refit model, repeat. • Stepwise regression: Forward selection but at each step considering removing any variables that are now insignificant • Assumes variables are appropriate

  4. Stepwise Regression (Mixed)

  5. Best Subsets

  6. Last Time

  7. Last Time: AIC vs. BIC AIC BIC tyer: 322.4 te: 322.7 tye: 324.2 ter: 324.6 • tyer: 311.1 • tiyer: 311.9 • typer: 312.7 • tiyper: 313.9

  8. Practice problem • I chose to use the model with 4 variables, where SAT score is predicted by YEARS, EXPEND, RANK, and ln(takers) because this model had the lowest AIC. The states that are doing the best are the ones with the most positive residuals, because positive residuals means they performed better than the model predicted them to. New Hampshire is the state that is doing the best. New Hampshire had the highest residual of about 59, which means it is performing 59 pts better than the model predicted it would with the variables YEARS, EXPEND, RANK, and ln(takers).

  9. Other notes • Insignificant terms • Doesn’t really hurt to leave them in the model as long as you clarify that they are not significant • vs. Parsimony, R2adj • Could keep in by request of subject matter expert or for sake of completeness (e.g., lower order terms of polynomial, set of indicator variables, indicators in presence of interactions)

  10. Want to adjust for the model size

  11. PRESS and R2predicted • Want the model that does the best job of predicting future observations. • But what if you don’t have future observations? • Internal validation: PRESS statistic – want the smallest value

  12. Recap • Adding more variables to model

More Related