1 / 42

Differentiable Sparse Coding

Differentiable Sparse Coding. David Bradley and J. Andrew Bagnell NIPS 2008. 100,000 ft View. Complex Systems. Joint Optimization. Cameras. Voxel Features. Voxel Classifier. Voxel Grid. Initialize with “cheap” data. Ladar. Column Cost. 2-D Planner. Y (Path to goal).

corine
Download Presentation

Differentiable Sparse Coding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Differentiable Sparse Coding David Bradley and J. Andrew Bagnell NIPS 2008

  2. 100,000 ft View • Complex Systems Joint Optimization Cameras Voxel Features Voxel Classifier Voxel Grid Initialize with “cheap” data Ladar Column Cost 2-D Planner Y (Path to goal)

  3. 10,000 ft view • Sparse Coding = Generative Model • Semi-supervised learning • KL-Divergence Regularization • Implicit Differentiation Optimization Classifier Unlabeled Data Latent Variable Loss Gradient

  4. Sparse Coding Understand X

  5. As a combination of factors B

  6. Sparse coding uses optimization Reconstruction loss function Some vector Want to use to classify x Projection (feed-forward):

  7. Sparse vectors: Only a few significant elements

  8. Example: X=Handwritten Digits Basis vectors colored by w

  9. Optimization vs. Projection Input Basis Projection

  10. Optimization vs. Projection Input Basis KL-regularized Optimization Outputs are sparse for each example

  11. Generative Model i Latent variables are Independent Likelihood Prior Examples are Independent

  12. Sparse Approximation Likelihood Prior MAP Estimate

  13. Sparse Approximation Distance between weight vector and prior mean Distance between reconstruction and input Regularization Constant

  14. Example: Squared Loss + L1 • Convex + sparse (widely studied in engineering) • Sparse coding solves for B as well (non-convex for now…) • Shown to generate useful features on diverse problems Tropp, Signal Processing, 2006 Donoho and Elad, Proceedings of the National Academy of Sciences, 2002 Raina, Battle, Lee, Packer, Ng, ICML, 2007

  15. L1 Sparse Coding Shown to generate useful features on diverse problems Optimize B over all examples

  16. Differentiable Sparse Coding Labeled Data X Y Unlabeled Data Optimization Module (B) Learning Module (θ) Loss Function W Raina, Battle, Lee, Packer, Ng, ICML, 2007 X Reconstruction Loss Sparse Coding Bradley & Bagnell, “Differentiable Sparse Coding”, NIPS 2008

  17. L1 Regularization is Not Differentiable Labeled Data X Y Unlabeled Data Optimization Module (B) Learning Module (θ) Loss Function W X Reconstruction Loss Sparse Coding Bradley & Bagnell, “Differentiable Sparse Coding”, NIPS 2008

  18. Why is this unsatisfying?

  19. Problem #1:Instability • L1 Map Estimates are discontinuous • Outputs are not differentiable • Instead use KL-divergence Proven to compete with L1 in online learning

  20. Problem #2: No closed-form Equation At the MAP estimate:

  21. Solution: Implicit Differentiation Differentiate both sides with respect to an element of B: Solve for this Since is a function of B:

  22. Example: Squared Loss, KL prior KL-Divergence

  23. Handwritten Digit Recognition 50,000 digit training set 10,000 digit validation set 10,000 digit test set

  24. Handwritten Digit Recognition Step #1: Unsupervised Sparse Coding L2 loss and L1 prior Training Set Raina, Battle, Lee, Packer, Ng, ICML, 2007

  25. Handwritten Digit Recognition Step #2: Maxent Classifier Loss Function Sparse Approximation

  26. Handwritten Digit Recognition Step #3: Maxent Classifier Loss Function Supervised Sparse Coding

  27. KL Maintains Sparsity Log scale Weight concentrated in few elements

  28. KL adds Stability W (L1) X W (KL) W (backprop) Duda, Hart, Stork. Pattern classification. 2001

  29. Performance vs. Prior Better

  30. Classifier Comparison Better

  31. Comparison to other algorithms Better

  32. Transfer to English Characters 24,000 character training set 12,000 character validation set 12,000 character test set 8 16

  33. Transfer to English Characters Step #1: Maxent Classifier Loss Function Sparse Approximation Digits Basis Raina, Battle, Lee, Packer, Ng, ICML, 2007

  34. Transfer to English Characters Step #2: Maxent Classifier Loss Function Supervised Sparse Coding

  35. Transfer to English Characters Better

  36. Text Application Step #1: X 5,000 movie reviews Unsupervised Sparse Coding KL loss + KL prior 10 point sentiment scale 1=hated it, 10=loved it Pang, Lee, Proceeding of the ACL, 2005

  37. Text Application L2 Loss Step #2: X 5-fold Cross Validation Linear Regression Sparse Approximation 10 point sentiment scale 1=hated it, 10=loved it

  38. Text Application L2 Loss Step #3: X 5-fold Cross Validation Linear Regression Supervised Sparse Coding 10 point sentiment scale 1=hated it, 10=loved it

  39. Movie Review Sentiment State of the art graphical model Supervised basis Unsupervised basis Better Blei, McAuliffe, NIPS, 2007

  40. Ladar RGB Camera NIR Camera Future Work MMP Example Paths Camera Sparse Coding Engineered Features Voxel Classifier Labeled Training Data Laser

  41. Future Work: Convex Sparse Coding • Sparse approximation is convex • Sparse coding is not because fixed-size basis is a non-convex constraint • Sparse coding ↔ sparse approximation on infinitely large basis + non-convex rank constraint • Relax to a convex L1 rank constraint • Use boosting for sparse approximation directly on infinitely large basis Bengio, Le Roux, Vincent, Dellalleau, Marcotte, NIPS, 2005 Zhao, Yu. Feature Selection for Data Mining, 2005 Riffkin, Lippert. Journal of Machine Learning Research, 2007

  42. Questions?

More Related