120 likes | 135 Views
Stat 324 – Day 25. Variable Selection Techniques cont. Variable selection. Want to find the combination of variables that explains the most variability in the simplest possible model
E N D
Stat 324 – Day 25 Variable Selection Techniques cont
Variable selection • Want to find the combination of variables that explains the most variability in the simplest possible model • Look for variables that explain a higher percentage of the remaining unexplained variation (partial correlation coefficients) • Can use automated procedures … with caution
Variable Selection • Forward selection: Bring in most highly correlated variable and then the variable with the highest “partial correlation” and so on • Backward elimination: Take out the least significant variable, refit model, repeat. • Stepwise regression: Forward selection but at each step considering removing any variables that are now insignificant • Assumes variables are appropriate
Last Time: AIC vs. BIC AIC BIC tyer: 322.4 te: 322.7 tye: 324.2 ter: 324.6 • tyer: 311.1 • tiyer: 311.9 • typer: 312.7 • tiyper: 313.9
Practice problem • I chose to use the model with 4 variables, where SAT score is predicted by YEARS, EXPEND, RANK, and ln(takers) because this model had the lowest AIC. The states that are doing the best are the ones with the most positive residuals, because positive residuals means they performed better than the model predicted them to. New Hampshire is the state that is doing the best. New Hampshire had the highest residual of about 59, which means it is performing 59 pts better than the model predicted it would with the variables YEARS, EXPEND, RANK, and ln(takers).
Other notes • Insignificant terms • Doesn’t really hurt to leave them in the model as long as you clarify that they are not significant • vs. Parsimony, R2adj • Could keep in by request of subject matter expert or for sake of completeness (e.g., lower order terms of polynomial, set of indicator variables, indicators in presence of interactions)
PRESS and R2predicted • Want the model that does the best job of predicting future observations. • But what if you don’t have future observations? • Internal validation: PRESS statistic – want the smallest value
Recap • Adding more variables to model