1 / 22

Laplace Maximum Margin Markov Networks

Laplace Maximum Margin Markov Networks. Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint work with Eric P. Xing and Bo Zhang. Outline. Introduction Structured Prediction

merrill
Download Presentation

Laplace Maximum Margin Markov Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Laplace Maximum Margin Markov Networks Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint work with Eric P. Xing and Bo Zhang.

  2. Outline • Introduction • Structured Prediction • Max-margin Markov Networks • Max-Entropy Discrimination Markov Networks • Basic Theorems • Laplace Max-margin Markov Networks • Experimental Results • Summary ICML 2008 @ Helsinki, Finland

  3. Classical Classification Models • Inputs: • a set of training samples , where and • Outputs: • a predictive function : • Examples ( ): • Logistic Regression • Max-likelihood estimation • Support Vector Machine (SVM) • Max-margin learning ICML 2008 @ Helsinki, Finland

  4. Structured Prediction • Complicated Prediction Examples: • Part-of-speech (POS) Tagging: • Image segmentation • Inputs: • a set of training samples: , where • Outputs: • a predictive function : “Do you want fries with that?” -> <verb pron verb noun prep pron> ICML 2008 @ Helsinki, Finland

  5. Structured Prediction Models • Conditional Random Fields (CRFs) (Lafferty et al., 2001) • Based on Logistic Regression • Max-likelihood estimation (point-estimate) • Max-margin Markov Networks (M3Ns) (Taskar et al., 2003) • Based on SVM • Max-margin learning ( point-estimate) • Markov properties are encoded in the feature functions ICML 2008 @ Helsinki, Finland

  6. Between MLE and max-margin learning • Likelihood-based estimation • Probabilistic (joint/conditional likelihood model) • Easy adapt to perform Bayesian learning, and consider prior knowledge, missing data • Max-margin learning • Non-probabilistic (concentrate on input-output mapping) • Not obvious how to perform Bayesian learning or consider prior, and missing data • Sound theoretical guarantee with limited samples • Maximum Entropy Discrimination (MED) (Jaakkola, et al., 1999) • A Bayesian learning approach • The optimization problem (binary classification) ICML 2008 @ Helsinki, Finland

  7. MaxEnt Discrimination Markov networks • MaxEnt Discrimination Markov Networks (MaxEntNet): • Generalized maximum entropy or regularized KL-divergence • Subspace of distributions defined with expected margin constraints • Bayesian Prediction ICML 2008 @ Helsinki, Finland

  8. Solution to MaxEntNet • Theorem 1 (Solution to MaxEntNet): • Posterior Distribution: • Dual Optimization Problem: • Convex conjugate (closed proper convex ) • Def: • Ex: ICML 2008 @ Helsinki, Finland

  9. Reduction to M3Ns • Theorem 2 (Reduction of MaxEntNet to M3Ns): • Assume • Posterior distribution: • Dual optimization: • Predictive rule: • Thus, MaxEntNet subsumes M3Ns and admits all the merits of max-margin learning • Furthermore, MaxEntNet has at least three advantages … ICML 2008 @ Helsinki, Finland

  10. The Three Advantages • PAC-Bayes prediction error guarantee • Introduce regularization effects, such as sparsity bias • Provides an elegant approach to incorporate latent variables and structures ICML 2008 @ Helsinki, Finland

  11. The Three Advantages • PAC-Bayes prediction error guarantee • Introduce regularization effects, such as sparsity bias • Provides an elegant approach to incorporate latent variables and structures ICML 2008 @ Helsinki, Finland

  12. Generalization Guarantee • MaxEntNet is an averaging model • we also call it a Bayesian Max-Margin Markov Network • Theorem 3 (PAC-Bayes Bound) If Then ICML 2008 @ Helsinki, Finland

  13. Laplace M3Ns (LapM3N) • The prior in MaxEntNet can be designed to introduce useful regularizatioin effects, such as sparsity bias • Normal prior is not so good for sparse data … • Instead, we use the Laplace prior … • hierarchical representation (scale mixture) ICML 2008 @ Helsinki, Finland

  14. Posterior shrinkage effect in LapM3N • Exact integration in LapM3N • Alternatively, • Similar calculation for M3Ns (A standard normal prior) A functional transformation ICML 2008 @ Helsinki, Finland

  15. Variational Bayesian Learning • Exact dual function is hard to optimize • Use the hierarchical representation, we get: • We optimize an upper bound: • Why is it easier? • Alternating minimization leads to nicer optimization problems An M3N optimization problem! Closed-form solution! ICML 2008 @ Helsinki, Finland

  16. Variational Bayesian Learning (Cont’) ICML 2008 @ Helsinki, Finland

  17. Experiments • Compare LapM3N with: • CRFs: MLE • L2-CRFs: L2-norm penalized MLE • L1-CRFs: L1-norm penalized MLE (sparse estimation) • M3N: max-margin learning ICML 2008 @ Helsinki, Finland

  18. Experimental results on synthetic datasets • Datasets with 100 i.i.d features of which 10, 30, 50 are relevant • For each setting, we generate 10 datasets, each having 1000 samples. True labels are assigned via a Gibbs sampler with 5000 iterations. • Datasets with 30 correlated relevant features + 70 i.i.d irrelevant features ICML 2008 @ Helsinki, Finland

  19. Experimental results on OCR datasets • We randomly construct OCR100, OCR150, OCR200, and OCR250 for 10 fold CV. ICML 2008 @ Helsinki, Finland

  20. Sensitivity to Regularization Constants • L1-CRFs are much sensitive to regularization constants; the others are more stable • LapM3N is the most stable one • L1-CRF and L2-CRF: • - 0.001, 0.01, 0.1, 1, 4, 9, 16 • M3N and LapM3N: • - 1, 4, 9, 16, 25, 36, 49, 64, 81 ICML 2008 @ Helsinki, Finland

  21. Summary • We propose a general framework MaxEntNet to do Bayesian max-margin structured prediction • MaxEntNet subsumes the standard M3Ns • PAC-Bayes Theoretical Error Bound • We propose Laplace max-margin Markov networks • Enjoys a posterior shrinkage effect • Can perform as well as sparse models on synthetic data; better on real data sets • More stable to regularization constants ICML 2008 @ Helsinki, Finland

  22. Thanks! Detailed Proof:http://166.111.138.19/junzhu/MaxEntNet_TR.pdf http://www.sailing.cs.cmu.edu/pdf/2008/zhutr1.pdf ICML 2008 @ Helsinki, Finland

More Related