1 / 43

Penalized EP for Graphical Models Over Strings

Explore modeling and reconstruction of missing entries in an infinite multilingual table using linguistic approaches, graphical models, and statistical inference. Learn about Expectation Propagation for probabilistic reconstruction. Predict pronunciations of novel words through Morpho-Phonology. Implement log-linear approximations for efficient inference. Utilize belief propagation for large state space computations.

marilyna
Download Presentation

Penalized EP for Graphical Models Over Strings

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Penalized EP for Graphical Models Over Strings Ryan Cotterell and Jason Eisner

  2. Natural Language is Built from Words

  3. Can store info about each word in a table

  4. Problem: Too Many Words! • Technically speaking, # words =  • Really the set of (possible) words is ∑* • Names • Neologisms • Typos • Productive processes: • friend  friendless  friendlessness  friendlessnessless  … • hand+bag  handbag (sometimes can iterate)

  5. Noblegases Positiveions Solution: Don’t model every cell separately

  6. Can store info about each word in a table

  7. Can store info about each word in a table Ultimate goal: Probabilistically reconstruct all missing entries of this infinite multilingual table, given some entries and some text. Approach: Linguistics + generative modeling + statistical inference. Modeling ingredients: Finite-state machines+ graphical models. Inference ingredients:Expectation Propagation (this talk).

  8. Can store info about each word in a table Ultimate goal: Probabilistically reconstruct all missing entries of this infinite multilingual table, given some entries and some text. Approach: Linguistics + generative modeling + statistical inference. Modeling ingredients: Finite-state machines+ graphical models. Inference ingredients:Expectation Propagation (this talk).

  9. Predicting Pronunciations of Novel Words (Morpho-Phonology) dæmn rizajgn eɪʃən z How do you pronounce this word? dæmneɪʃən rizajgneɪʃən dæmnz rizajgnz dˌæmnˈeɪʃən rˌɛzɪgnˈeɪʃən ???? rizˈajnz damns resignation damnation resigns

  10. Predicting Pronunciations of Novel Words (Morpho-Phonology) dæmn rizajgn eɪʃən z How do you pronounce this word? dæmneɪʃən rizajgneɪʃən dæmnz rizajgnz dˌæmnˈeɪʃən rˌɛzɪgnˈeɪʃən dˌæmz rizˈajnz damns resignation damnation resigns

  11. Graphical Models over Strings a a • Use Graphical Model Framework to model many strings jointly! r a X1 X1 u X1 s s s h h e e g i n s g g i i n n r r a g n r u e a ε u u ε a e e e e ε ε ψ1 ψ1 ψ1 X2 X2 X2

  12. Zooming in on a WFSA • Compactly represents an (unnormalized) probability distribution over allstrings in • Marginal belief: How do we pronounce damns? • Possibilities: /damz/, /dams/, /damnIz/, etc.. n/.25 z/.5 m/1 d/1 a/1 s/.25 z/1 z/1 I/1

  13. Log-Linear Approximation • Given a WFSA distribution p, find a log-linear approximation q • min KL(p || q) “inclusive KL divergence” • q corresponds to a smaller/tidier WFSA • Two Approaches: • Gradient-Based Optimization (Discussed Here) • Closed Form Optimization

  14. ML Estimation = Moment Matching fo = 3 foo = 1 bar = 2 az = 4 Broadcast n-gram counts Fit model that predicts same counts

  15. FSA Approx. = Moment Matching a a fo = 3 foo = 1 s s h h e e bar = 2 g g i i n n r r zz= 0.1 u u e e e e e e ε ε az = 4 xx = 0.1 Fit model that predicts same counts Compute with forward-backward!

  16. Gradient-Based Minimization • Objective: • Gradient with respect to • Difference between two expectations of feature counts, which are determined by the weighted DFA q • Features are just n-gram counts! Arc weights are determined by a parameter vector - just like a log-linear model

  17. Does qneed a lot of features? • Game: what order of n-grams do we need to put probability 1 on a string? • Word 1: noon • Bigram model? No - Trigram model • Word 2: papa • Trigram model? No - 4-gram model - very big! • Word 3: abracadabra • 6-gram model – way too big!

  18. Variable Order Approximations • Intuition: In NLP marginals are often peaked • Probability mass mostly on a few similar strings! • q should reward a few long n-grams • also need short n-gram features for backoff 6-gram table. Too Big! Variable order table. Very Small!

  19. Variable Order Approximations • Moral: Use only the n-grams you really need!

  20. Belief Propagation (BP) in a Nutshell X6 X3 X1 X4 X2 X5

  21. Belief Propagation (BP) in a Nutshell X6 X3 X1 n/.25 X4 z/.5 m/1 d/1 a/1 s/.25 X2 z/1 z/1 I/1 X5

  22. Belief Propagation (BP) in a Nutshell X6 X3 X1 X4 X2 X5

  23. Computing Marginal Beliefs X7 X3 X1 X4 X2 X5

  24. Computing Marginal Beliefs X7 X3 X1 X4 X2 X5

  25. Belief Propagation (BP) in a Nutshell a a a a X6 X3 s s s s h h h h e e e e X1 g g g g i i i i n n n n r r r r u u u u X4 e e e e e e e e ε ε ε ε X2 X5

  26. Computing Marginal Beliefs a a a a X7 X3 s s s s h h h h e e e e X1 g g g g i i i i n n n n r r r r u u u u X4 e e e e ε ε ε ε X2 X5

  27. Computing Marginal Beliefs a a a a a a a a a a C Computation of belief results in large state space X7 X3 s s s s s s s s s s h h h h h h h h h h e e e e e e e e e e X1 g g g g g g g g g g i i i i i i i i i i n n n n n n n n n n r r r r r r r r r r u u u u u u u u u u X4 e e e e e e e e e e ε ε ε ε ε ε ε ε ε ε X2 X5

  28. Computing Marginal Beliefs a a a a a a a a a a C Computation of belief results in large state space X7 X3 s s s s s s s s s s h h h h h h h h h h e e e e e e e e e e X1 g g g g g g g g g g i i i i i i i i i i n n n n n n n n n n r r r r r r r r r r u u u u u u u u u u X4 e e e e e e e e e e ε ε ε ε ε ε ε ε ε ε What a hairball! X2 X5

  29. Computing Marginal Beliefs a a a a X7 Approximation Required!!! X3 s s s s h h h h e e e e X1 g g g g i i i i n n n n r r r r u u u u X4 e e e e ε ε ε ε X2 X5

  30. BP over String-Valued Variables a X1 • In fact, with a cyclic factor graph,messages and marginal beliefs grow unboundedly complex! a a a a ε ψ2 ψ1 a X2 a a

  31. BP over String-Valued Variables • In fact, with a cyclic factor graph,messages and marginal beliefs grow unboundedly complex! a a X1 a a a a ε ψ2 ψ1 a X2 a a

  32. BP over String-Valued Variables a X1 • In fact, with a cyclic factor graph,messages and marginal beliefs grow unboundedly complex! a a a a ε ψ2 ψ1 a X2 a a a a a

  33. BP over String-Valued Variables • In fact, with a cyclic factor graph,messages and marginal beliefs grow unboundedly complex! a a a X1 a a a a ε ψ2 ψ1 a X2 a a a a

  34. BP over String-Valued Variables • In fact, with a cyclic factor graph,messages and marginal beliefs grow unboundedly complex! a a a a a a a a a a a a X1 a a a a ε ψ2 ψ1 a X2 a a a a a a a a a a a a a a

  35. Expectation Propagation (EP) in a Nutshell a a a a X7 X3 s s s s h h h h e e e e X1 g g g g i i i i n n n n r r r r u u u u X4 e e e e ε ε ε ε X2 X5

  36. Expectation Propagation (EP) in a Nutshell a a a X7 X3 s s s h h h e e e X1 g g g i i i n n n r r r u u u X4 e e e ε ε ε X2 X5

  37. Expectation Propagation (EP) in a Nutshell a a X7 X3 s s h h e e X1 g g i i n n r r u u X4 e e ε ε X2 X5

  38. Expectation Propagation (EP) in a Nutshell a X7 X3 s h e X1 g i n r u X4 e ε X2 X5

  39. Expectation Propagation (EP) in a Nutshell X7 X3 X1 X4 X2 X5

  40. EP In a Nutshell Approximate belief is now a table of n-grams. The point-wise product is now super easy! X7 X3 X1 X4 X2 X5

  41. How to approximate a message? a a a a θ KL( || ) s s s s h h h h e e e e g g g g i i i i n n n n r u u u u e ε ε ε ε = = Minimize with respect to the parameters θ

  42. Results • Question 1: Does EP work in general (comparison to baseline)? • Question 2: Do variable order approximations improve over fixed n-grams? • Unigram EP (Green) – fast, but inaccurate • Bigram EP (Blue) – also fast and inaccurate • Trigram EP (Cyan) – slow and accurate • Penalized EP (Red) – fast and accurate • Baseline (Black) – accurate and slow (pruning based)

  43. Fin Thanks for you attention! For more information on structured models and belief propagation, see the Structured Belief Propagation Tutorial at ACL 2015 by Matt Gormley and Jason Eisner.

More Related