220 likes | 414 Views
Laplace Maximum Margin Markov Networks. Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint work with Eric P. Xing and Bo Zhang. Outline. Introduction Structured Prediction
E N D
Laplace Maximum Margin Markov Networks Jun Zhu jun-zhu@mails.tsinghua.edu.cn Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint work with Eric P. Xing and Bo Zhang.
Outline • Introduction • Structured Prediction • Max-margin Markov Networks • Max-Entropy Discrimination Markov Networks • Basic Theorems • Laplace Max-margin Markov Networks • Experimental Results • Summary ICML 2008 @ Helsinki, Finland
Classical Classification Models • Inputs: • a set of training samples , where and • Outputs: • a predictive function : • Examples ( ): • Logistic Regression • Max-likelihood estimation • Support Vector Machine (SVM) • Max-margin learning ICML 2008 @ Helsinki, Finland
Structured Prediction • Complicated Prediction Examples: • Part-of-speech (POS) Tagging: • Image segmentation • Inputs: • a set of training samples: , where • Outputs: • a predictive function : “Do you want fries with that?” -> <verb pron verb noun prep pron> ICML 2008 @ Helsinki, Finland
Structured Prediction Models • Conditional Random Fields (CRFs) (Lafferty et al., 2001) • Based on Logistic Regression • Max-likelihood estimation (point-estimate) • Max-margin Markov Networks (M3Ns) (Taskar et al., 2003) • Based on SVM • Max-margin learning ( point-estimate) • Markov properties are encoded in the feature functions ICML 2008 @ Helsinki, Finland
Between MLE and max-margin learning • Likelihood-based estimation • Probabilistic (joint/conditional likelihood model) • Easy adapt to perform Bayesian learning, and consider prior knowledge, missing data • Max-margin learning • Non-probabilistic (concentrate on input-output mapping) • Not obvious how to perform Bayesian learning or consider prior, and missing data • Sound theoretical guarantee with limited samples • Maximum Entropy Discrimination (MED) (Jaakkola, et al., 1999) • A Bayesian learning approach • The optimization problem (binary classification) ICML 2008 @ Helsinki, Finland
MaxEnt Discrimination Markov networks • MaxEnt Discrimination Markov Networks (MaxEntNet): • Generalized maximum entropy or regularized KL-divergence • Subspace of distributions defined with expected margin constraints • Bayesian Prediction ICML 2008 @ Helsinki, Finland
Solution to MaxEntNet • Theorem 1 (Solution to MaxEntNet): • Posterior Distribution: • Dual Optimization Problem: • Convex conjugate (closed proper convex ) • Def: • Ex: ICML 2008 @ Helsinki, Finland
Reduction to M3Ns • Theorem 2 (Reduction of MaxEntNet to M3Ns): • Assume • Posterior distribution: • Dual optimization: • Predictive rule: • Thus, MaxEntNet subsumes M3Ns and admits all the merits of max-margin learning • Furthermore, MaxEntNet has at least three advantages … ICML 2008 @ Helsinki, Finland
The Three Advantages • PAC-Bayes prediction error guarantee • Introduce regularization effects, such as sparsity bias • Provides an elegant approach to incorporate latent variables and structures ICML 2008 @ Helsinki, Finland
The Three Advantages • PAC-Bayes prediction error guarantee • Introduce regularization effects, such as sparsity bias • Provides an elegant approach to incorporate latent variables and structures ICML 2008 @ Helsinki, Finland
Generalization Guarantee • MaxEntNet is an averaging model • we also call it a Bayesian Max-Margin Markov Network • Theorem 3 (PAC-Bayes Bound) If Then ICML 2008 @ Helsinki, Finland
Laplace M3Ns (LapM3N) • The prior in MaxEntNet can be designed to introduce useful regularizatioin effects, such as sparsity bias • Normal prior is not so good for sparse data … • Instead, we use the Laplace prior … • hierarchical representation (scale mixture) ICML 2008 @ Helsinki, Finland
Posterior shrinkage effect in LapM3N • Exact integration in LapM3N • Alternatively, • Similar calculation for M3Ns (A standard normal prior) A functional transformation ICML 2008 @ Helsinki, Finland
Variational Bayesian Learning • Exact dual function is hard to optimize • Use the hierarchical representation, we get: • We optimize an upper bound: • Why is it easier? • Alternating minimization leads to nicer optimization problems An M3N optimization problem! Closed-form solution! ICML 2008 @ Helsinki, Finland
Variational Bayesian Learning (Cont’) ICML 2008 @ Helsinki, Finland
Experiments • Compare LapM3N with: • CRFs: MLE • L2-CRFs: L2-norm penalized MLE • L1-CRFs: L1-norm penalized MLE (sparse estimation) • M3N: max-margin learning ICML 2008 @ Helsinki, Finland
Experimental results on synthetic datasets • Datasets with 100 i.i.d features of which 10, 30, 50 are relevant • For each setting, we generate 10 datasets, each having 1000 samples. True labels are assigned via a Gibbs sampler with 5000 iterations. • Datasets with 30 correlated relevant features + 70 i.i.d irrelevant features ICML 2008 @ Helsinki, Finland
Experimental results on OCR datasets • We randomly construct OCR100, OCR150, OCR200, and OCR250 for 10 fold CV. ICML 2008 @ Helsinki, Finland
Sensitivity to Regularization Constants • L1-CRFs are much sensitive to regularization constants; the others are more stable • LapM3N is the most stable one • L1-CRF and L2-CRF: • - 0.001, 0.01, 0.1, 1, 4, 9, 16 • M3N and LapM3N: • - 1, 4, 9, 16, 25, 36, 49, 64, 81 ICML 2008 @ Helsinki, Finland
Summary • We propose a general framework MaxEntNet to do Bayesian max-margin structured prediction • MaxEntNet subsumes the standard M3Ns • PAC-Bayes Theoretical Error Bound • We propose Laplace max-margin Markov networks • Enjoys a posterior shrinkage effect • Can perform as well as sparse models on synthetic data; better on real data sets • More stable to regularization constants ICML 2008 @ Helsinki, Finland
Thanks! Detailed Proof:http://166.111.138.19/junzhu/MaxEntNet_TR.pdf http://www.sailing.cs.cmu.edu/pdf/2008/zhutr1.pdf ICML 2008 @ Helsinki, Finland