1 / 22

Bolasso : Model Consistent Lasso Estimation through the Bootstrap

Bolasso : Model Consistent Lasso Estimation through the Bootstrap. Bach-ICML08 Presented by Sheehan Khan to “Beyond Lasso” reading group April 9, 2009. Outline. Follows the structure of the paper Define Lasso Comments on scaling the penalty Bootstrapping/ Bolasso Results

nguyet
Download Presentation

Bolasso : Model Consistent Lasso Estimation through the Bootstrap

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bolasso: Model Consistent Lasso Estimation through the Bootstrap Bach-ICML08 Presented by Sheehan Khan to “Beyond Lasso” reading group April 9, 2009

  2. Outline • Follows the structure of the paper • Define Lasso • Comments on scaling the penalty • Bootstrapping/Bolasso • Results • A few afterthoughts on the paper • Synopsis of 2009 Extended Tech report • Discussion2

  3. Problem formulation • Standard Lasso formulation • New notation (consistent with ICML08) • Response vector (n samples) • Design matrix (n samples xp features) • Generative model

  4. How should we set μn? • Shows 5 mutual exclusive possibilities • implies • means we minimize implies is not a consistent estimate of

  5. How should we set μn? • slower than requires denotes the active set • faster thanloose the sparsifying effect of l1 penalty We saw similar arguments in Adaptive Lasso

  6. How should we set μn? • we can state: Prop1: Prop2: *Dependence on Q omitted in the body of paperbut appears in the appendix

  7. So what? • Props 1&2 tell us that asymptotically: • We have positive probability selecting the active features • We have vanishing probability of missing active features • We may or may not get additional non-active features based on the dataset • With many independent sets, the common features must be the active sets

  8. Bootstrap • In practice we do not get many datasets • We can use m bootstrap replications of the given set • For now we use pairs, later we will used centered residuals

  9. Bolasso

  10. Asymptotic Error • Prop3:Given that we have • Can be tightened if

  11. Results on Synthetic Data • 1000 samples • 16 features (first 8 active) • Average over 256 datasets • Force lasso (black) bolasso (red) lasso bolasso (m=128) m=2,4,8…256

  12. Results on Synthetic Data • 1000 samples • 16 features (first 8 active) • Average over 256 datasets • Force lasso (black) bolasso (red) lasso bolasso (m=128) m=2,4,8…256

  13. Results on Synthetic Data • 64 features (8 active) • Error is squared distance between sparsity pattern vectors averaged over 32 datasets lasso(black), bolasso(green), forward greedy(magenta), threshold LS(red), adaptive lasso(blue)

  14. Results on Synthetic Data • 64 samples • 32 features (8 active) • Bolasso-S has soft intersection (90%) MSE prediction 1.24???

  15. Results on UCI data MSE prediction

  16. Some thoughts • Why do they compare bolasso variable selection error to lasso, forward greedy, threshold LS, and adaptive lasso but then compare mean square prediction to lasso, ridge and bagging? • All these results have low dimensional data, we are interested in large amounts of features • This is considered in the 2009 tech. rep. • Based on the plots it seems that its best to use as large as possible (in contrast to Prop3) • Is there any insight to the size of positive constants which have a huge impact? • Based on the results it seems that we really want to use bolasso in the problems where we know this bound to be loose

  17. 2009 Tech Report • Main extensions • Fills in the math details omitted previously • Discusses bootstrap pairs vs. residuals • Proves both consistent in low dimensional data • Show empirical results favouring residuals in high dimensional data • New upper and lower bounds for selecting active components in low dimensional data • Propose similar method for high dimensions • Lasso with high regularization parameter • Then bootstrap within the supports • Discusses implementation details

  18. Bootstrap Recap • Previously we sampled uniformly from the given dataset with replacement to generate bootstrap set • Done in parallel • Bootstrapping can also be done sequentially • We saw this when reviewing Boosting

  19. Bootstrap Residuals • Compute residual errors based on lasso using the current dataset • Compute centered residuals • Create a new dataset from the pairs

  20. Synthetic Results in High Dimensional Data • 64 samples, 128 features (8 active)

  21. Varying Replications in High Dimensional Data lasso(black), bollasom={2,4,8,16,32,64,128,256}(red), m=512(blue)

  22. The End • Thanks for your attention and participation • Questions/Discussion???

More Related