250 likes | 324 Views
A Framework For Tuning Posterior Entropy. Rajhans Samdani Joint work with Ming-Wei Chang ( Microsoft Research ) and Dan Roth University of Illinois at Urbana-Champaign. Workshop on Inferning , ICML 2012, Edinburgh. Inference: Predicting Structures.
E N D
A Framework For Tuning Posterior Entropy Rajhans Samdani Joint work with Ming-Wei Chang (Microsoft Research) and Dan RothUniversity of Illinois at Urbana-Champaign Workshop on Inferning, ICML 2012, Edinburgh
Inference: Predicting Structures • Predict the output variable y from the space of allowed outputs Y given input variable xusing parameters or weight vectorw • E.g. • predict POS tags given a sentence, • predict word alignments given sentences in two different languages, • predict the entity-relation structure from a document • Prediction expressed as y* = argmaxy2YP (y | x; w)
Learning: Weakly Supervised Learning • Labeled data is scarce and difficult to obtain • A lot of work on learning with a small amount of labeled data • Expectation Maximization (EM) algorithm is the de facto standard • More recently: significant work on injecting weak supervision or domain knowledge via constraints into EM • Constraint-driven Learning (CoDL; Chang et al, 07) • Posterior regularization (PR; Ganchev et al, 10)
Learning Using EM: a Quick Primer qt(y) = argminqKL( q(y) , P(y|x;wt) ) (Neal and Hinton, 99) qt(y) = P(y|x;wt) Conditional distribution of y given w E-step is an inference step, and M-step learns w.r.t. the distribution inferred Posterior distribution • Given unlabeled data: x, estimate w; hidden:y • for t = 1 … Tdo • E-step: infer a posterior distribution, q, over y: • M:step: estimate the parameters ww.r.t. q: wt+1 = argmaxwEqlog P (x, y; w)
Different EM Variations • Hard EM changes the E-step of the EM algorithm • Which version to use: EM (PR) vs hard EM (CoDL) (Spitkovskyet al, 10) (Pedro’s talk)? • Or is there something better out there? • OUR CONTRIBUTION: a unified framework for EM algorithms, Unified EM (UEM) (Samdani et al, 12) • A framework which explicitly provides a handle on the entropy of the inferred distribution during the E-step • Includes existing EM algorithms • Pick the most suitable EM algorithm in a simple, adaptive, and principled way
Outline • Background: Expectation Maximization (EM) • Unified Expectation Maximization (UEM) • Motivation • Formulation and mathematical intuition • Experiments
Different Versions Of EM EM/Posterior Regularization (Ganchev et al, 10) Hard EM/Constraint driven-learning (Chang et al, 07) E-step: M-step: argmaxwEqlog P (x, y; w) • E-step: argminqKL(qt(y),P(y|x;wt)) • M-step: argmaxwEqlog P (x, y; w) y*=argmaxyP(y|x,w) Eq[Uy] ·b Uy·b Not clear which version To use!!!
Motivation: Unified Expectation Maximization (UEM) EM Hard EM EM (PR) and hard EM (CODL) differ mostly in the entropy of the posterior distribution UEM tunes the entropy of the posterior distribution qand is parameterized by a single parameter °
Unified EM (UEM) Changes the entropy of the posterior EM (PR) minimizes the KL-Divergence KL(q , P (y|x;w)) KL(q , p) = yq(y) log q(y) – q(y) log p(y) UEM changes the E-step of standard EM and minimizes a modified KL divergence KL(q , P (y|x;w); °) where KL(q , p; °) = y°q(y) log q(y) – q(y) log p(y) Different ° values ! different EM algorithms
Effect of Changing ° KL(q , p; °) = y°q(y) log q(y) – q(y) log p(y) q with ° = 1 q with ° = 1 Original Distribution p q with ° = 0 q with ° = -1
Unifying Existing EM Algorithms KL(q , p; °) = y°q(y) log q(y) – q(y) log p(y) Changing ° essentially changes the “hardness of inference” Deterministic Annealing (Smith and Eisner, 04; Hofmann, 99) No Constraints Hard EM EM -1 0 1 1 With Constraints ° CODL PR
Outline • Setting up the problem • Introduction to Unified Expectation Maximization • Experiments • POS tagging • Entity-Relation Extraction • Word Alignment
Experiments: exploring the role of ° • Test if changing the inference step by tuning °helps improve the performance over baselines • Compare against: • Posterior Regularization (PR) corresponds to ° = 1.0 • Constraint-driven Learning (CODL) corresponds to °= -1 • Study the relation between the quality of initialization and ° (or “hardness” of inference)
Unsupervised POS Tagging • Model as first order HMM • Try varying qualities of initialization: • Uniform initialization: initialize with equal probability for all states • Supervised initialization: initialize with parameters trained on varying amounts of labeled data • Test the “conventional wisdom” that hard EM does well with good initialization and EM does better with a weak initialization
Unsupervised POS tagging: Different EM instantiations EM Hard EM Initialization with 40-80 examples Initialization with 20 examples Performance relative to EM Initialization with 10 examples Initialization with 5 examples Uniform Initialization °
R23 R12 Experiments: Entity-Relation Extraction Dole ’s wife, Elizabeth , is a resident of N.C. E1E2E3 • Extract entity types (e.g. Loc, Org, Per) and relation types (e.g. Lives-in, Org-based-in, Killed) between pairs of entities • Add constraints: • Type constraints between entity and relations • Expected count constraints to regularize the counts of ‘None’ relation • Semi-supervised learning with a small amount of labeled data
Result on Relations UEM Statistically significantly better than PR Macro-f1 scores % of labeled data
Experiments: Word Alignment Word alignment from a language S to language T We try En-Fr and En-Es pairs We use an HMM-based model with agreement constraints for wordalignment PR with agreement constraints known to give HUGE improvements over HMM (Ganchev et al’08; Graca et al’08) Use our efficient algorithm to decomposes the E-step into individual HMMs
Word Alignment: EN-FR with 10k Unlabeled Data Alignment Error Rate
Word Alignment: EN-FR Alignment Error Rate
Word Alignment: FR-EN Alignment Error Rate
Word Alignment: EN-ES Alignment Error Rate
Word Alignment: ES-EN Alignment Error Rate
Experiments Summary • In different settings, different baselines work better • Entity-Relation extraction: CODL does better than PR • Word Alignment: PR does better than CODL • Unsupervised POS tagging: depends on the initialization • UEM allows us to choose the best algorithm in all of these cases • Best version of EM: a new version with 0 < ° < 1
Unified EM: Summary Questions? • UEM: a unified framework for EM algorithms which tunes the entropy of the posterior by a single parameter ° • ° : adaptively changes the entropy of the posterior based on the data, initialization, and constraints • Experimentally: the best °corresponds to neither EM (PR) nor hard EM (CODL) and found through the UEM framework • Shows the role of inference in learning: learned parameters seem to be sensitive to entropy of the inferred posterior • Open question: What is actually going on? • How is the entropy of the E-step actually changing the learnt model?