Variable Selection: Penalized Regression Last Time

Stat 324 – Day 25 Penalized Regression

Last Time - Variable selection • Want to find the combination of variables that explains the most variability in the simplest possible model • Look for variables that explain a higher percentage of the remaining unexplained variation (partial correlation coefficients) • Can use automated procedures … with caution

Principal components • Example: Have ranked communities on 9 variables. What best distinguishes the communities? • Climate and Terrain (higher scores are better) • Housing (lower scores are better) • Health Care & the Environment (higher) • Crime (lower scores are better) • Transportation (higher) • Education (higher) • The Arts (higher) • Recreation (higher) • Economics (higher)

Example • The first principal component formula: • Could then be used as an explanatory variable in a regression model to predict rating • Second component can also be used with the bonus of being orthogonal to the first • *probably should standardize first

Example • Here is how the original variable correlate with the first three principal components Five variables have a strong correlation with PC1 (communities with better housing tend to have better health etc.) PC1 is really about quality of arts PC2 is about health PC3 suggests places with high crime tend to also have better recreation facilities

Stepwise Regression (Mixed)

Best Subsets

Last Time

Last Time: AIC vs. BIC AIC BIC tyer: 322.4 te: 322.7 tye: 324.2 ter: 324.6 • tyer: 311.1 • tiyer: 311.9 • typer: 312.7 • tiyper: 313.9 The idea behind these measures is similar but BIC has a larger penalty for number of variables so tends to be a bit more conservative (often choosing smaller, less complex models)

Other notes • Insignificant terms • Doesn’t really hurt to leave them in the model as long as you clarify that they are not significant • vs. Parsimony, R2adj • Could keep in by request of subject matter expert or for sake of completeness (e.g., lower order terms of polynomial, set of indicator variables, indicators in presence of interactions)

Today • Another method, developed to deal with multicollinearity, is increasingly popular as a form of variable selection as well

To Do • Practice problem • Wednesday/Thursday: Lab Assignment • Email Dr. Chance questions!

Variable Selection: Penalized Regression Last Time

Variable Selection: Penalized Regression Last Time

Presentation Transcript

Vision for the Blind . Stat 19 SEM 2. 263057202. Talk 1.

Statistical Inference and Regression Analysis: Stat-GB.3302.30, Stat-UB.0015.01

Statistical Inference and Regression Analysis: Stat-GB.3302.30, Stat-UB.0015.01

Statistical Inference and Regression Analysis: Stat-GB.3302.30, Stat-UB.0015.01

Intermediate Applied Statistics STAT 460

STAT 3130

Line of Best Fit

Statistical Office of the Republic of Serbia

CS 311 – Lecture 12 Outline

Stat 470-8

Statistical Office of the Republic of Serbia

Statistics Major at Penn State