230 likes | 446 Views
Bolasso : Model Consistent Lasso Estimation through the Bootstrap. Bach-ICML08 Presented by Sheehan Khan to “Beyond Lasso” reading group April 9, 2009. Outline. Follows the structure of the paper Define Lasso Comments on scaling the penalty Bootstrapping/ Bolasso Results
E N D
Bolasso: Model Consistent Lasso Estimation through the Bootstrap Bach-ICML08 Presented by Sheehan Khan to “Beyond Lasso” reading group April 9, 2009
Outline • Follows the structure of the paper • Define Lasso • Comments on scaling the penalty • Bootstrapping/Bolasso • Results • A few afterthoughts on the paper • Synopsis of 2009 Extended Tech report • Discussion2
Problem formulation • Standard Lasso formulation • New notation (consistent with ICML08) • Response vector (n samples) • Design matrix (n samples xp features) • Generative model
How should we set μn? • Shows 5 mutual exclusive possibilities • implies • means we minimize implies is not a consistent estimate of
How should we set μn? • slower than requires denotes the active set • faster thanloose the sparsifying effect of l1 penalty We saw similar arguments in Adaptive Lasso
How should we set μn? • we can state: Prop1: Prop2: *Dependence on Q omitted in the body of paperbut appears in the appendix
So what? • Props 1&2 tell us that asymptotically: • We have positive probability selecting the active features • We have vanishing probability of missing active features • We may or may not get additional non-active features based on the dataset • With many independent sets, the common features must be the active sets
Bootstrap • In practice we do not get many datasets • We can use m bootstrap replications of the given set • For now we use pairs, later we will used centered residuals
Asymptotic Error • Prop3:Given that we have • Can be tightened if
Results on Synthetic Data • 1000 samples • 16 features (first 8 active) • Average over 256 datasets • Force lasso (black) bolasso (red) lasso bolasso (m=128) m=2,4,8…256
Results on Synthetic Data • 1000 samples • 16 features (first 8 active) • Average over 256 datasets • Force lasso (black) bolasso (red) lasso bolasso (m=128) m=2,4,8…256
Results on Synthetic Data • 64 features (8 active) • Error is squared distance between sparsity pattern vectors averaged over 32 datasets lasso(black), bolasso(green), forward greedy(magenta), threshold LS(red), adaptive lasso(blue)
Results on Synthetic Data • 64 samples • 32 features (8 active) • Bolasso-S has soft intersection (90%) MSE prediction 1.24???
Results on UCI data MSE prediction
Some thoughts • Why do they compare bolasso variable selection error to lasso, forward greedy, threshold LS, and adaptive lasso but then compare mean square prediction to lasso, ridge and bagging? • All these results have low dimensional data, we are interested in large amounts of features • This is considered in the 2009 tech. rep. • Based on the plots it seems that its best to use as large as possible (in contrast to Prop3) • Is there any insight to the size of positive constants which have a huge impact? • Based on the results it seems that we really want to use bolasso in the problems where we know this bound to be loose
2009 Tech Report • Main extensions • Fills in the math details omitted previously • Discusses bootstrap pairs vs. residuals • Proves both consistent in low dimensional data • Show empirical results favouring residuals in high dimensional data • New upper and lower bounds for selecting active components in low dimensional data • Propose similar method for high dimensions • Lasso with high regularization parameter • Then bootstrap within the supports • Discusses implementation details
Bootstrap Recap • Previously we sampled uniformly from the given dataset with replacement to generate bootstrap set • Done in parallel • Bootstrapping can also be done sequentially • We saw this when reviewing Boosting
Bootstrap Residuals • Compute residual errors based on lasso using the current dataset • Compute centered residuals • Create a new dataset from the pairs
Synthetic Results in High Dimensional Data • 64 samples, 128 features (8 active)
Varying Replications in High Dimensional Data lasso(black), bollasom={2,4,8,16,32,64,128,256}(red), m=512(blue)
The End • Thanks for your attention and participation • Questions/Discussion???