1 / 25

A Framework For Tuning Posterior Entropy

A Framework For Tuning Posterior Entropy. Rajhans Samdani Joint work with Ming-Wei Chang ( Microsoft Research ) and Dan Roth University of Illinois at Urbana-Champaign. Workshop on Inferning , ICML 2012, Edinburgh. Inference: Predicting Structures.

Download Presentation

A Framework For Tuning Posterior Entropy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Framework For Tuning Posterior Entropy Rajhans Samdani Joint work with Ming-Wei Chang (Microsoft Research) and Dan RothUniversity of Illinois at Urbana-Champaign Workshop on Inferning, ICML 2012, Edinburgh

  2. Inference: Predicting Structures • Predict the output variable y from the space of allowed outputs Y given input variable xusing parameters or weight vectorw • E.g. • predict POS tags given a sentence, • predict word alignments given sentences in two different languages, • predict the entity-relation structure from a document • Prediction expressed as y* = argmaxy2YP (y | x; w)

  3. Learning: Weakly Supervised Learning • Labeled data is scarce and difficult to obtain • A lot of work on learning with a small amount of labeled data • Expectation Maximization (EM) algorithm is the de facto standard • More recently: significant work on injecting weak supervision or domain knowledge via constraints into EM • Constraint-driven Learning (CoDL; Chang et al, 07) • Posterior regularization (PR; Ganchev et al, 10)

  4. Learning Using EM: a Quick Primer qt(y) = argminqKL( q(y) , P(y|x;wt) ) (Neal and Hinton, 99) qt(y) = P(y|x;wt) Conditional distribution of y given w E-step is an inference step, and M-step learns w.r.t. the distribution inferred Posterior distribution • Given unlabeled data: x, estimate w; hidden:y • for t = 1 … Tdo • E-step: infer a posterior distribution, q, over y: • M:step: estimate the parameters ww.r.t. q: wt+1 = argmaxwEqlog P (x, y; w)

  5. Different EM Variations • Hard EM changes the E-step of the EM algorithm • Which version to use: EM (PR) vs hard EM (CoDL) (Spitkovskyet al, 10) (Pedro’s talk)? • Or is there something better out there? • OUR CONTRIBUTION: a unified framework for EM algorithms, Unified EM (UEM) (Samdani et al, 12) • A framework which explicitly provides a handle on the entropy of the inferred distribution during the E-step • Includes existing EM algorithms • Pick the most suitable EM algorithm in a simple, adaptive, and principled way

  6. Outline • Background: Expectation Maximization (EM) • Unified Expectation Maximization (UEM) • Motivation • Formulation and mathematical intuition • Experiments

  7. Different Versions Of EM EM/Posterior Regularization (Ganchev et al, 10) Hard EM/Constraint driven-learning (Chang et al, 07) E-step: M-step: argmaxwEqlog P (x, y; w) • E-step: argminqKL(qt(y),P(y|x;wt)) • M-step: argmaxwEqlog P (x, y; w) y*=argmaxyP(y|x,w) Eq[Uy] ·b Uy·b Not clear which version To use!!!

  8. Motivation: Unified Expectation Maximization (UEM) EM Hard EM EM (PR) and hard EM (CODL) differ mostly in the entropy of the posterior distribution UEM tunes the entropy of the posterior distribution qand is parameterized by a single parameter °

  9. Unified EM (UEM) Changes the entropy of the posterior EM (PR) minimizes the KL-Divergence KL(q , P (y|x;w)) KL(q , p) = yq(y) log q(y) – q(y) log p(y) UEM changes the E-step of standard EM and minimizes a modified KL divergence KL(q , P (y|x;w); °) where KL(q , p; °) = y°q(y) log q(y) – q(y) log p(y) Different ° values ! different EM algorithms

  10. Effect of Changing ° KL(q , p; °) = y°q(y) log q(y) – q(y) log p(y) q with ° = 1 q with ° = 1 Original Distribution p q with ° = 0 q with ° = -1

  11. Unifying Existing EM Algorithms KL(q , p; °) = y°q(y) log q(y) – q(y) log p(y) Changing ° essentially changes the “hardness of inference” Deterministic Annealing (Smith and Eisner, 04; Hofmann, 99) No Constraints Hard EM EM -1 0 1 1 With Constraints ° CODL PR

  12. Outline • Setting up the problem • Introduction to Unified Expectation Maximization • Experiments • POS tagging • Entity-Relation Extraction • Word Alignment

  13. Experiments: exploring the role of ° • Test if changing the inference step by tuning °helps improve the performance over baselines • Compare against: • Posterior Regularization (PR) corresponds to ° = 1.0 • Constraint-driven Learning (CODL) corresponds to °= -1 • Study the relation between the quality of initialization and ° (or “hardness” of inference)

  14. Unsupervised POS Tagging • Model as first order HMM • Try varying qualities of initialization: • Uniform initialization: initialize with equal probability for all states • Supervised initialization: initialize with parameters trained on varying amounts of labeled data • Test the “conventional wisdom” that hard EM does well with good initialization and EM does better with a weak initialization

  15. Unsupervised POS tagging: Different EM instantiations EM Hard EM Initialization with 40-80 examples Initialization with 20 examples Performance relative to EM Initialization with 10 examples Initialization with 5 examples Uniform Initialization °

  16. R23 R12 Experiments: Entity-Relation Extraction Dole ’s wife, Elizabeth , is a resident of N.C. E1E2E3 • Extract entity types (e.g. Loc, Org, Per) and relation types (e.g. Lives-in, Org-based-in, Killed) between pairs of entities • Add constraints: • Type constraints between entity and relations • Expected count constraints to regularize the counts of ‘None’ relation • Semi-supervised learning with a small amount of labeled data

  17. Result on Relations UEM Statistically significantly better than PR Macro-f1 scores % of labeled data

  18. Experiments: Word Alignment Word alignment from a language S to language T We try En-Fr and En-Es pairs We use an HMM-based model with agreement constraints for wordalignment PR with agreement constraints known to give HUGE improvements over HMM (Ganchev et al’08; Graca et al’08) Use our efficient algorithm to decomposes the E-step into individual HMMs

  19. Word Alignment: EN-FR with 10k Unlabeled Data Alignment Error Rate

  20. Word Alignment: EN-FR Alignment Error Rate

  21. Word Alignment: FR-EN Alignment Error Rate

  22. Word Alignment: EN-ES Alignment Error Rate

  23. Word Alignment: ES-EN Alignment Error Rate

  24. Experiments Summary • In different settings, different baselines work better • Entity-Relation extraction: CODL does better than PR • Word Alignment: PR does better than CODL • Unsupervised POS tagging: depends on the initialization • UEM allows us to choose the best algorithm in all of these cases • Best version of EM: a new version with 0 < ° < 1

  25. Unified EM: Summary Questions? • UEM: a unified framework for EM algorithms which tunes the entropy of the posterior by a single parameter ° • ° : adaptively changes the entropy of the posterior based on the data, initialization, and constraints • Experimentally: the best °corresponds to neither EM (PR) nor hard EM (CODL) and found through the UEM framework • Shows the role of inference in learning: learned parameters seem to be sensitive to entropy of the inferred posterior • Open question: What is actually going on? • How is the entropy of the E-step actually changing the learnt model?

More Related