1 / 150

Learning Approximate Inference Policies for Fast Prediction

This workshop talk discusses the use of trainable hacks in approximate inference for fast prediction, focusing on the application to natural language processing problems. It explores the challenges of speed-accuracy tradeoffs and the need for good fast approximate inference algorithms.

carlalong
Download Presentation

Learning Approximate Inference Policies for Fast Prediction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning Approximate Inference Policies for Fast Prediction Jason Eisner ICML “Inferning” Workshop June 2012 1

  2. Beware: Bayesians in Roadway A Bayesian is the person who writes down the function you wish you could optimize

  3. lexicon (word types) semantics sentences discourse context resources inflection cognates transliteration abbreviation neologism language evolution entailment correlation tokens N translation alignment editing quotation speech misspellings,typos formatting entanglement annotation To recover variables, model and exploit their correlations

  4. Motivating Tasks • Structured prediction (e.g., for NLP problems) • Parsing ( trees) • Machine translation ( word strings) • Word variants ( letter strings, phylogenies, grids) • Unsupervised learning via Bayesian generative models • Given a few verb conjugation tables and a lot of text • Find/organize/impute all verb conjugation tables of the language

  5. Motivating Tasks • Structured prediction (e.g., for NLP problems) • Parsing ( trees) • Machine translation ( word strings) • Word variants ( letter strings, phylogenies, grids) • Unsupervised learning via Bayesian generative models • Given a few verb conjugation tables and a lot of text • Find/organize/impute all verb conjugation tables of the language • Given some facts and a lot of text • Discover more facts through information extraction and reasoning

  6. Current Methods • Dynamic programming • Exact but slow • Approximate inference in graphical models • Are approximations any good? • May use dynamic programming as subroutine(structured BP) • Sequential classification

  7. Speed-Accuracy Tradeoffs • Inference requires lots of computation • Is some computation going to waste? • Sometimes the best prediction is overdetermined … • Quick ad hoc methods sometimes work: how to respond? • Is some computation actively harmful? • In approximate inference, passing a message can hurt • Frustrating to simplify model just to fix this • Want to keep improving our models! • But need good fast approximate inference • Choose approximations automatically • Tuned to data distribution & loss function • “Trainable hacks” – more robust

  8. This talk is about “trainable hacks” training data feedback Prediction device likelihood (suitable for domain)

  9. This talk is about “trainable hacks” training data feedback Prediction device loss + runtime (suitable for domain)

  10. Loss Datadistribution Predictionrule Optimizedparameters of prediction rule Bayesian Decision Theory • What prediction rule? (approximate inference + beyond) • What loss function? (can include runtime) • How to optimize? (backprop, RL, …) • What data distribution? (may have to impute)

  11. Probabilisticdomain model Partialdata This talk is about “trainable hacks” Completetraining data feedback Prediction device loss + runtime (suitable for domain)

  12. Part 1:Your favorite approximate inference algorithm is a trainable hack

  13. General CRFs: Unrestricted model structure Y2 Y1 Y3 Y4 X1 X2 X3 14 . Add edges to model the conditional distribution well. But exact inference is intractable. So use loopy sum-product or max-product BP.

  14. General CRFs: Unrestricted model structure DT .9 NN .05 … NN .8 JJ .1 … VBD .7 VB .1 … IN .9 NN .01 … DT .9 NN .05 … NN .4 JJ .3 … . .99 , .001 … The cat sat on the mat . 15 Inference: compute properties of the posterior distribution.

  15. General CRFs: Unrestricted model structure DT NN VBD IN DT NN . The cat sat on the mat . 16 Decoding: coming up with predictions from the results of inference.

  16. General CRFs: Unrestricted model structure Could be present in linear-chain CRFs as well. 17 • One uses CRFs with several approximations: • Approximate inference. • Approximate decoding. • Mis-specified model structure. • MAP training (vs. Bayesian). • Why are we still maximizing data likelihood? • Our system is more like a Bayes-inspired neural network that makes predictions.

  17. Black box decision function parameterized by ϴ (Appr.) Inference (Appr.) Decoding x p(y|x) ŷ L(y*,ŷ) Train directly to minimize task loss(Stoyanov, Ropson, & Eisner 2011; Stoyanov & Eisner 2012) 18 • Adjust ϴto (locally) minimize training loss • E.g., via back-propagation (+ annealing) • “Empirical Risk Minimization under Approximations (ERMA)”

  18. Optimization Criteria 19

  19. Optimization Criteria MLE 20

  20. Optimization Criteria MLE 21

  21. Optimization Criteria MLE 22

  22. Experimental Results 23 • 3 NLP problems; also synthetic data • We show that: • General CRFs work better when they match dependencies in the data. • Minimum risk training results in more accurate models. • ERMA software package available at www.clsp.jhu.edu/~ves/software

  23. ERMA software packagehttp://www.clsp.jhu.edu/~ves/software 24 • Includes syntax for describing general CRFs. • Supports sum-product and max-product BP. • Can optimize several commonly used loss functions: MSE, Accuracy, F-score. • The package is generic: • Little effort to model new problems. • About1-3 days to express each problem in our formalism.

  24. Modeling Congressional Votes First , I want to commend the gentleman from Wisconsin (Mr. Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… 25 The ConVote corpus [Thomas et al., 2006]

  25. Modeling Congressional Votes First , I want to commend the gentleman from Wisconsin (Mr. Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… Yea 26 The ConVote corpus [Thomas et al., 2006]

  26. Modeling Congressional Votes Mr. Sensenbrenner First , I want to commend the gentleman from Wisconsin (Mr.Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… Had it not been for the heroic actions of the passengers of United flight 93 who forced the plane down over Pennsylvania, congress's ability to serve … Yea Yea 27 The ConVote corpus [Thomas et al., 2006]

  27. Modeling Congressional Votes Mr. Sensenbrenner First , I want to commend the gentleman from Wisconsin (Mr.Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… Had it not been for the heroic actions of the passengers of United flight 93 who forced the plane down over Pennsylvania, congress's ability to serve … Yea Yea 28 The ConVote corpus [Thomas et al., 2006]

  28. Modeling Congressional Votes • Predict representative votes based on debates. Y/N 29 An example from the ConVote corpus [Thomas et al., 2006]

  29. Modeling Congressional Votes • Predict representative votes based on debates. Y/N Text First , I want to commend the gentleman from Wisconsin (Mr.Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… 30 An example from the ConVote corpus [Thomas et al., 2006]

  30. Modeling Congressional Votes • Predict representative votes based on debates. Y/N Y/N Context Text Text First, I want to commend the gentleman from Wisconsin (Mr.Sensenbrenner), the chairman of the committee on the judiciary , not just for the underlying bill… 31 An example from the ConVote corpus [Thomas et al., 2006]

  31. Modeling Congressional Votes 32

  32. Modeling Congressional Votes 33

  33. Modeling Congressional Votes 34

  34. Modeling Congressional Votes 35

  35. Modeling Congressional Votes *Boldfaced results are significantly better than all others (p < 0.05) 36

  36. Information Extraction from Semi-Structured Text What: Special Seminar Who: Prof. Klaus Sutner Computer Science Department, Stevens Institute of Technology Topic: "Teaching Automata Theory by Computer" Date: 12-Nov-93 Time: 12:00 pm Place: WeH 4623 Host: Dana Scott (Asst: Rebecca Clark x8-6737) ABSTRACT: We will demonstrate the system "automata" that implements finite state machines… … After the lecture, Prof. Sutner will be glad to demonstrate and discuss the use of MathLink and his "automata" package CMU Seminar Announcement Corpus [Freitag, 2000] 37

  37. Information Extraction from Semi-Structured Text What: Special Seminar Who: Prof. Klaus Sutner Computer Science Department, Stevens Institute of Technology Topic: "Teaching Automata Theory by Computer" Date: 12-Nov-93 Time: 12:00 pm Place: WeH 4623 Host: Dana Scott (Asst: Rebecca Clark x8-6737) ABSTRACT: We will demonstrate the system "automata" that implements finite state machines… … After the lecture, Prof. Sutner will be glad to demonstrate and discuss the use of MathLink and his "automata" package speaker start time location speaker CMU Seminar Announcement Corpus [Freitag, 2000] 38

  38. Skip-Chain CRF for Info Extraction … … O S S S S O S … … Klaus Who: Prof. Sutner will Sutner Prof. CMU Seminar Announcement Corpus [Freitag, 2000] Skip-chain CRF [Sutton and McCallum, 2005; Finkel et al., 2005] 39 Extract speaker, location, stime, and etime from seminar announcement emails

  39. Semi-Structured Information Extraction 40

  40. Semi-Structured Information Extraction 41

  41. Semi-Structured Information Extraction 42

  42. Semi-Structured Information Extraction *Boldfaced results are significantly better than all others (p < 0.05). 43

  43. Collective Multi-Label Classification Oil The collapse of crude oil supplies from Libyahas not only lifted petroleum prices, but added a big premium to oil delivered promptly. Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later. … Libya Sports Reuters Corpus Version 2 [Lewis et al, 2004] 44

  44. Collective Multi-Label Classification Oil The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly. Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later. … Libya Sports Reuters Corpus Version 2 [Lewis et al, 2004] 45

  45. Collective Multi-Label Classification Oil The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly. Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later. … Libya Sports 46

  46. Collective Multi-Label Classification Oil The collapse of crude oil supplies from Libya has not only lifted petroleum prices, but added a big premium to oil delivered promptly. Before protests began in February against Muammer Gaddafi, the price of benchmark European crude for imminent delivery was $1 a barrel less than supplies to be delivered a year later. … Libya Sports [Ghamrawi and McCallum, 2005; Finley and Joachims, 2008] 47

  47. Multi-Label Classification 48

  48. Multi-Label Classification 49

  49. Multi-Label Classification 50

More Related