1 / 155

Declarative Specification of NLP Systems

Declarative Specification of NLP Systems. Jason Eisner. student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John Blatz, Roy Tromble. IBM, May 2006. An Anecdote from ACL’05. -Michael Jordan. Just draw a model that actually makes sense for your problem.

Download Presentation

Declarative Specification of NLP Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Declarative Specification of NLP Systems JasonEisner student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John Blatz, Roy Tromble IBM, May 2006

  2. An Anecdote from ACL’05 -Michael Jordan

  3. Just draw a model that actually makes sense for your problem. Just do Gibbs sampling. Um, it’s only 6 lines in Matlab… An Anecdote from ACL’05 -Michael Jordan

  4. Conclusions to draw from that talk • Mike & his students are great. • Graphical models are great.(because they’re flexible) • Gibbs sampling is great.(because it works with nearly any graphical model) • Matlab is great.(because it frees up Mike and his students to doodle all day and then execute their doodles)

  5. Could NLP be this nice? • Mike & his students are great. • Graphical models are great.(because they’re flexible) • Gibbs sampling is great.(because it works with nearly any graphical model) • Matlab is great.(because it frees up Mike and his students to doodle all day and then execute their doodles)

  6. Could NLP be this nice? Parts of it already are … Language modeling Binary classification (e.g., SVMs) Finite-state transductions Linear-chain graphical models Toolkits available; you don’t have to be an expert But other parts aren’t … Context-free and beyond Machine translation Efficient parsers and MT systems are complicated and painful to write

  7. Could NLP be this nice? This talk: A toolkit that’s general enough for these cases. (stretches from finite-state to Turing machines) “Dyna” But other parts aren’t … Context-free and beyond Machine translation Efficient parsers and MT systems are complicated and painful to write

  8. Warning • Lots more beyond this talk see the EMNLP’05 and FG’06 papers see http://dyna.org (download + documentation) sign up for updates by email wait for the totally revamped next version 

  9. the case forLittle Languages declarative programming small is beautiful

  10. Sapir-Whorf hypothesis • Language shapes thought • At least, it shapes conversation • Computer language shapes thought • At least, it shapes experimental research • Lots of cute ideas that we never pursue • Or if we do pursue them, it takes 6-12 months to implement on large-scale data • Have we turned into a lab science?

  11. Declarative Specifications • State what is to be done • (How should the computer do it? Turn that over to a general “solver” that handles the specification language.) • Hundreds of domain-specific “little languages” out there. Some have sophisticated solvers.

  12. dot (www.graphviz.org) digraph g { graph [rankdir = "LR"]; node [fontsize = "16“ shape = "ellipse"]; edge []; "node0" [label = "<f0> 0x10ba8| <f1>"shape = "record"]; "node1" [label = "<f0> 0xf7fc4380| <f1> | <f2> |-1"shape = "record"]; … "node0":f0 -> "node1":f0 [id = 0]; "node0":f1 -> "node2":f0 [id = 1]; "node1":f0 -> "node3":f0 [id = 2]; … } nodes edges What’s the hard part? Making a nice layout! Actually, it’s NP-hard …

  13. dot (www.graphviz.org)

  14. c4 <<c4 d4 e4>> { f4 <<c4 d4 e4>> } << g2 \\ { f4 <<c4 d4 e4>> } >> LilyPond (www.lilypond.org)

  15. LilyPond (www.lilypond.org)

  16. Declarative Specs in NLP • Regular expression (for a FST toolkit) • Grammar (for a parser) • Feature set (for a maxent distribution, SVM, etc.) • Graphical model (DBNs for ASR, IE, etc.) Claim of this talk: Sometimes it’s best to peek under the shiny surface. Declarative methods are still great, but should be layered:we need them one level lower, too.

  17. Not always flexible enough … Need to open up the parser and rejigger it. Declarative specification of algorithms. New toolkit Not always flexible enough … Need to open up the learner and rejigger it. Declarative specification of objective functions. Existingtoolkits Declarative Specs in NLP • Regular expression (for a FST toolkit) • Grammar (for a parser) • Feature set (for a maxent distribution, SVM, etc.)

  18. Declarative Specification of Algorithms

  19. How you build a system (“big picture” slide) cool model practical equations PCFG pseudocode (execution order) tuned C++ implementation (data structures, etc.) • for width from 2 to n • for i from 0 to n-width • k = i+width • for j from i+1 to k-1 • …

  20. Wait a minute … Didn’t I just implement something like this last month? • chart management / indexing • cache-conscious data structures • prioritization of partial solutions (best-first, A*) • parameter management • inside-outside formulas • different algorithms for training and decoding • conjugate gradient, annealing, ... • parallelization? I thought computers were supposed to automate drudgery

  21. How you build a system (“big picture” slide) cool model • Dyna language specifies these equations. • Most programs just need to compute some values from other values. Any order is ok. • Some programs also need to update the outputs if the inputs change: • spreadsheets, makefiles, email readers • dynamic graph algorithms • EM and other iterative optimization • leave-one-out training of smoothing params practical equations PCFG pseudocode (execution order) tuned C++ implementation (data structures, etc.) • for width from 2 to n • for i from 0 to n-width • k = i+width • for j from i+1 to k-1 • …

  22. How you build a system (“big picture” slide) cool model practical equations PCFG Compilation strategies (we’ll come back to this) pseudocode (execution order) tuned C++ implementation (data structures, etc.) • for width from 2 to n • for i from 0 to n-width • k = i+width • for j from i+1 to k-1 • …

  23. a “pattern” the capitalized Nmatches anything Writing equations in Dyna • int a. • a = b * c. a will be kept up to date if b or c changes. • b += x.b += y.equivalent tob = x+y. b is a sum of two variables. Also kept up to date. • c += z(1).c += z(2).c += z(3). c += z(“four”).c += z(foo(bar,5)). c += z(N). c is a sum of all nonzero z(…) values. At compile time, we don’t know how many!

  24. sparse dot product of query & document ... + b(“yetis”)*c(“yetis”)+ b(“zebra”)*c(“zebra”) More interesting use of patterns • a = b * c. • scalar multiplication • a(I) = b(I) * c(I). • pointwise multiplication • a += b(I) * c(I).means a = b(I)*c(I) • dot product; could be sparse • a(I,K) += b(I,J) * c(J,K).b(I,J)*c(J,K) • matrix multiplication; could be sparse • J is free on the right-hand side, so we sum over it

  25. definition from other values has a value e.g., a real number Dyna vs. Prolog By now you may see what we’re up to! Prolog has Horn clauses: a(I,K) :- b(I,J) , c(J,K). Dyna has “Horn equations”: a(I,K) += b(I,J) * c(J,K). Like Prolog: Allow nested terms Syntactic sugar for lists, etc. Turing-complete Unlike Prolog: Charts, not backtracking! Compile  efficient C++ classes Integrates with your C++ code

  26. put in axioms (values not defined by the above program) theorem pops out The CKY inside algorithm in Dyna :- double item = 0. :- bool length = false. constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). using namespace cky; chart c; c[rewrite(“s”,”np”,”vp”)] = 0.7; c[word(“Pierre”,0,1)] = 1; c[length(30)] = true; // 30-word sentence cin >> c; // get more axioms from stdin cout << c[goal]; // print total weight of all parses

  27. visual debugger – browse the proof forest ambiguity shared substructure

  28. Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY?

  29. Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). max= max= max= • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY?

  30. + + + Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). max= max= max= log+= log+= log+= • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY?

  31. Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY? c[ word(“Pierre”, 0, 1) ] = 1 state(5) state(9) 0.2 air/0.3 8 9 P/0.5 Pierre/0.2 5

  32. Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY? Just add words one at a time to the chart Check at any time what can be derived from words so far Similarly, dynamic grammars

  33. Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY? Again, no change to the Dyna program

  34. Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY? Basically, just add extra arguments to the terms above

  35. Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY?

  36. need(“s”,0) = true. need(Nonterm,J) |= ?constit(_/[Nonterm|_],_,J). constit(Nonterm/Needed,I,I) += rewrite(Nonterm,Needed) if need(Nonterm,I). constit(Nonterm/Needed,I,K) += constit(Nonterm/[W|Needed],I,J) * word(W,J,K). constit(Nonterm/Needed,I,K) += constit(Nonterm/[X|Needed],I,J) * constit(X/[],J,K). goal += constit(“s”/[],0,N) if length(N). Earley’s algorithm in Dyna constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). magic templates transformation (as noted by Minnen 1996)

  37. Program transformations cool model • Blatz & Eisner (FG 2006): • Lots of equivalent ways to write a system of equations! • Transforming from one to another mayimprove efficiency. • Many parsing “tricks” can be generalized into automatic transformations that help other programs, too! practical equations PCFG pseudocode (execution order) tuned C++ implementation (data structures, etc.) • for width from 2 to n • for i from 0 to n-width • k = i+width • for j from i+1 to k-1 • …

  38. Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY?

  39. folding transformation: asymp. speedup! constit(X\Y,Mid,J) += constit(Z,Mid,J) * rewrite(X,Y,Z). constit(X,I,J) += constit(Y,I,Mid) * constit(X\Y,Mid,J). X X\Y Y I J I Mid Mid J Rule binarization constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). X Y Z Y Z I Mid Mid J

  40. folding transformation: asymp. speedup! constit(X\Y,Mid,J) += constit(Z,Mid,J) * rewrite(X,Y,Z). constit(X,I,J) += constit(Y,I,Mid) * constit(X\Y,Mid,J). constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z) constit(Y,I,Mid) constit(Z,Mid,J) * rewrite(X,Y,Z) Rule binarization constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). graphical models constraint programming multi-way database join

  41. More program transformations • Examples that add new semantics • Compute gradient (e.g., derive outside algorithm from inside) • Compute upper bounds for A* (e.g., Klein & Manning ACL’03) • Coarse-to-fine (e.g., Johnson & Charniak NAACL’06) • Examples that preserve semantics • On-demand computation – by analogy with Earley’s algorithm • On-the-fly composition of FSTs • Left-corner filter for parsing • Program specialization as unfolding – e.g., compile out the grammar • Rearranging computations – by analogy with categorial grammar • Folding reinterpreted as slashed categories • “Speculative computation” using slashed categories • abstract away repeated computation to do it once only – by analogy with unary rule closure or epsilon-closure • derives Eisner & Satta ACL’99 O(n3) bilexical parser

  42. How you build a system (“big picture” slide) cool model practical equations PCFG Propagate updates from right-to-left through the equations. a.k.a. “agenda algorithm” “forward chaining” “bottom-up inference” “semi-naïve bottom-up” pseudocode (execution order) tuned C++ implementation (data structures, etc.) • for width from 2 to n • for i from 0 to n-width • k = i+width • for j from i+1 to k-1 • … use a general method

  43. prep(I,3)? vp(5,K)? 0.4 Bottom-up inference agenda of pending updates rules of program pp(I,K) += prep(I,J) * np(J,K) s(I,K) += np(I,J) * vp(J,K) prep(I,3) = ? pp(2,5) += 0.3 s(3,9) += 0.15 s(3,7) += 0.21 prep(2,3) = 1.0 vp(5,K) = ? vp(5,9) = 0.5 np(3,5) += 0.3 vp(5,7) = 0.7 we updatednp(3,5);what else must therefore change? If np(3,5) hadn’t been in the chart already, we would have added it. no more matches to this query np(3,5) = 0.1 +0.3 chart of derived items with current values

  44. How you build a system (“big picture” slide) cool model practical equations PCFG What’s going on under the hood? pseudocode (execution order) tuned C++ implementation (data structures, etc.) • for width from 2 to n • for i from 0 to n-width • k = i+width • for j from i+1 to k-1 • …

  45. efficient priority queue hard-coded pattern matching automatic indexingfor O(1) lookup vp(5,K)? Compiler provides … agenda of pending updates rules of program s(I,K) += np(I,J) * vp(J,K) np(3,5) += 0.3 copy, compare, & hashterms fast, via integerization (interning) efficient storage of terms (use native C++ types, “symbiotic” storage, garbage collection,serialization, …) chart of derived items with current values

  46. n(5,K)? Beware double-counting! agenda of pending updates combining with itself rules of program n(I,K) += n(I,J) * n(J,K) n(5,5) += ? n(5,5) = 0.2 n(5,5) += 0.3 to makeanother copyof itself epsilon constituent If np(3,5) hadn’t been in the chart already, we would have added it. chart of derived items with current values

  47. DynaMITE: training toolkit model parameters(and input sentence) as axiom values Parameter training objective functionas a theorem’s value • Maximize some objective function. • Use Dyna to compute the function. • Then how do you differentiate it? • … for gradient ascent,conjugate gradient, etc. • … gradient also tells us the expected counts for EM! e.g., inside algorithm computes likelihood of the sentence • Two approaches: • Program transformation – automatically derive the “outside” formulas. • Back-propagation – run the agenda algorithm “backwards.” • works even with pruning, early stopping, etc.

  48. What can Dyna do beyond CKY?

  49. Some examples from my lab … • Parsing using … • factored dependency models (Dreyer, Smith, & Smith CONLL’06) • with annealed risk minimization (Smith and Eisner EMNLP’06) • constraints on dependency length (Eisner & Smith IWPT’05) • unsupervised learning of deep transformations(see Eisner EMNLP’02) • lexicalized algorithms(see Eisner & Satta ACL’99, etc.) • Grammar induction using … • partial supervision (Dreyer & Eisner EMNLP’06) • structural annealing (Smith & Eisner ACL’06) • contrastive estimation (Smith & Eisner GIA’05) • deterministicannealing(Smith & Eisner ACL’04) • Machine translation using … • Very large neighborhood search of permutations (Eisner & Tromble, NAACL-W’06) • Loosely syntax-based MT (Smith & Eisner in prep.) • Synchronous cross-lingual parsing (Smith & Smith EMNLP’04) • Finite-state methods for morphology, phonology, IE, even syntax … • Unsupervised cognate discovery (Schafer & Yarowsky ’05, ’06) • Unsupervised log-linear models via contrastive estimation(Smith & Eisner ACL’05) • Context-based morph. disambiguation (Smith, Smith & Tromble EMNLP’05) • Trainable (in)finite-state machines (see Eisner ACL’02, EMNLP’02, …) • Finite-state machines with very large alphabets (see Eisner ACL’97) • Finite-state machines over weird semirings (see Eisner ACL’02, EMNLP’03) • Teaching(Eisner JHU’05-06; Smith & Tromble JHU’04) Easy to try stuff out! Programs are very short & easy to change! - see also Eisner ACL’03)

  50. Can it express everything in NLP?  • Remember, it integrates tightly with C++, so you only have to use it where it’s helpful,and write the rest in C++. Small is beautiful. • We’re currently extending the class of allowed formulas “beyond the semiring” • cf. Goodman (1999) • will be able to express smoothing, neural nets, etc. • Of course, it is Turing complete … 

More Related