190 likes | 376 Views
Marketing Analysts, LLC. Variable Selection in R. Fun with carets, elasticnets, and the Reverend Thomas Bayes …. Charles Ellis, MAi Research Mitchell Killian, Ipsos Marketing. Why do Variable Selection?. "Pluralitas non est ponenda sine neccesitate"
E N D
Marketing Analysts, LLC Variable Selection in R Fun with carets, elasticnets, and the Reverend Thomas Bayes … Charles Ellis, MAi Research Mitchell Killian, Ipsos Marketing
Why do Variable Selection? "Pluralitas non est ponenda sine neccesitate" • Overcoming the “Curse of Dimensionality” and developing more efficient data mining activities • Identifying relevant features & discarding those that are not • Enhancing the performance of data mining algorithms • Better prediction/classification • This applies to almost all fields, but especially those that are “data rich and theory poor” (e.g., Marketing)
Options for Tackling the Problem • Many different approaches have been suggested … it is a growing field, and many of them are implemented in R code. • BMA • rfe • glmnet • stepPlr • subselect • varselectRF • WilcoxCV • clustvarsel • Party • Boruta • PenalizedSVM • spikeslab • glmulti • BMS
Options for Tackling the Problem • Today we focus on three approaches, which range in degree of complexity and applicability. • Recursive Feature Elimination (package: caret [Kuhn]) • Bayesian Model Averaging (package: bma [Raftery et al.]) • Penalized regression (package: glmnet [Friedman et al.])
Recursive Feature Elimination(with resampling) • Implemented in the package caret • The basic idea (from Kuhn, 2009) • For each resampling unit (default is 10-fold cross-validation) do the following: • Partition the data into training & test sets • Train the model on the training set using all predictors • Predict outcomes using the test data • Calculate variable importance for all predictors
Recursive Feature Elimination (cont’d)(with resampling) • For each subset size (Si)to be considered keep the Si most important variables • Train the model on the training set using the Si predictors • Predict outcomes using the test data [Optional] Recalculate the rankings for each predictor • Calculate the performance profile over the Si predictors using the held-back samples • Determine the appropriate number of predictors • Fit the final model based on the optimal Si using the original training set
Recursive Feature Elimination An Example • Data set up (same across all examples) • Hot Breakfast Cereal Category • N = 310 consumers • Outcome Variable – Overall Liking • 5 point scale • Predictors – 31 Agree-Disagree statements measuring attitudes toward package and its components • 5 point scale • Outcome and predictors are treated as continuous (although they need not be)
Recursive Feature Elimination • The top 5 variables are: q4b_2, q4b_28, q4b_21, q4b_24, q4b_9 Results:
Recursive Feature Elimination Results (cont’d):
Recursive Feature Elimination Results (cont’d):
Bayesian Model Averaging • Implemented in the package BMA (also BMS) • The basic idea (from Hoeting et al., 1999) • All models are wrong, some are useful (Box, 1987) • Approach is to average over model uncertainty • Average over the posterior distribution of any statistic (e.g., parameter estimates) • Can be problematic for models with a large number of potential predictors • For “r” predictors, the set of potential models is 2r • Occam’s Window – Average over the subset of models that are supported by the data
Bayesian Model Averaging Note the similarity in the predictors chosen across the 5 best models (compared to the rfe algorithm) Results:
Bayesian Model Averaging Results (cont’d):
Penalized Regression • Implemented in the package glmnet • glmnet is an extension/application of the elasticnet package (Zou & Hastie, 2008) • The basic idea (from Friedman et al., 2010) • Ridge regression – applies an adjustment (the “ridge”) to the coefficient estimates, allowing them to borrow from each other, and shrinks the coefficients values. • However, Ridge aggressively shrinks coefficients to be equal to each other, allowing for no meaningful interpretation • Additionally, there is no easy way to determine how to set the penalization parameter
Penalized Regression (cont’d) • Lasso regression also adjusts the coefficients but tends to be “somewhat indifferent to very correlated predictors” • Essentially turns coefficients on/off, elevating one variable over another • Elastic Net – a compromise between Ridge and Lasso • Averages the effects of highly correlated predictors to create a “weighted” contribution of each variable • Lambda, a ridge regression penalty, shrinks coefficients toward each other • Alpha influences the number of non-zero coefficients in the model. • Alpha=0 is Ridge Regression and Alpha=1 is Lasso
Penalized Regression (cont’d) Alpha=0 Alpha=1 Alpha=0.2 At each step there is a unique value of lambda
Penalized Regression α= 0.75 α= 0.10 Results: The impact of different parameterizations of alpha
Penalized Regression Again, notice the similarity wrt predictors chosen with the other two algorithms Results: