1.55k likes | 1.67k Views
Declarative Specification of NLP Systems. Jason Eisner. student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John Blatz, Roy Tromble. IBM, May 2006. An Anecdote from ACL’05. -Michael Jordan. Just draw a model that actually makes sense for your problem.
E N D
Declarative Specification of NLP Systems JasonEisner student co-authors on various parts of this work: Eric Goldlust, Noah A. Smith, John Blatz, Roy Tromble IBM, May 2006
An Anecdote from ACL’05 -Michael Jordan
Just draw a model that actually makes sense for your problem. Just do Gibbs sampling. Um, it’s only 6 lines in Matlab… An Anecdote from ACL’05 -Michael Jordan
Conclusions to draw from that talk • Mike & his students are great. • Graphical models are great.(because they’re flexible) • Gibbs sampling is great.(because it works with nearly any graphical model) • Matlab is great.(because it frees up Mike and his students to doodle all day and then execute their doodles)
Could NLP be this nice? • Mike & his students are great. • Graphical models are great.(because they’re flexible) • Gibbs sampling is great.(because it works with nearly any graphical model) • Matlab is great.(because it frees up Mike and his students to doodle all day and then execute their doodles)
Could NLP be this nice? Parts of it already are … Language modeling Binary classification (e.g., SVMs) Finite-state transductions Linear-chain graphical models Toolkits available; you don’t have to be an expert But other parts aren’t … Context-free and beyond Machine translation Efficient parsers and MT systems are complicated and painful to write
Could NLP be this nice? This talk: A toolkit that’s general enough for these cases. (stretches from finite-state to Turing machines) “Dyna” But other parts aren’t … Context-free and beyond Machine translation Efficient parsers and MT systems are complicated and painful to write
Warning • Lots more beyond this talk see the EMNLP’05 and FG’06 papers see http://dyna.org (download + documentation) sign up for updates by email wait for the totally revamped next version
the case forLittle Languages declarative programming small is beautiful
Sapir-Whorf hypothesis • Language shapes thought • At least, it shapes conversation • Computer language shapes thought • At least, it shapes experimental research • Lots of cute ideas that we never pursue • Or if we do pursue them, it takes 6-12 months to implement on large-scale data • Have we turned into a lab science?
Declarative Specifications • State what is to be done • (How should the computer do it? Turn that over to a general “solver” that handles the specification language.) • Hundreds of domain-specific “little languages” out there. Some have sophisticated solvers.
dot (www.graphviz.org) digraph g { graph [rankdir = "LR"]; node [fontsize = "16“ shape = "ellipse"]; edge []; "node0" [label = "<f0> 0x10ba8| <f1>"shape = "record"]; "node1" [label = "<f0> 0xf7fc4380| <f1> | <f2> |-1"shape = "record"]; … "node0":f0 -> "node1":f0 [id = 0]; "node0":f1 -> "node2":f0 [id = 1]; "node1":f0 -> "node3":f0 [id = 2]; … } nodes edges What’s the hard part? Making a nice layout! Actually, it’s NP-hard …
c4 <<c4 d4 e4>> { f4 <<c4 d4 e4>> } << g2 \\ { f4 <<c4 d4 e4>> } >> LilyPond (www.lilypond.org)
Declarative Specs in NLP • Regular expression (for a FST toolkit) • Grammar (for a parser) • Feature set (for a maxent distribution, SVM, etc.) • Graphical model (DBNs for ASR, IE, etc.) Claim of this talk: Sometimes it’s best to peek under the shiny surface. Declarative methods are still great, but should be layered:we need them one level lower, too.
Not always flexible enough … Need to open up the parser and rejigger it. Declarative specification of algorithms. New toolkit Not always flexible enough … Need to open up the learner and rejigger it. Declarative specification of objective functions. Existingtoolkits Declarative Specs in NLP • Regular expression (for a FST toolkit) • Grammar (for a parser) • Feature set (for a maxent distribution, SVM, etc.)
How you build a system (“big picture” slide) cool model practical equations PCFG pseudocode (execution order) tuned C++ implementation (data structures, etc.) • for width from 2 to n • for i from 0 to n-width • k = i+width • for j from i+1 to k-1 • …
Wait a minute … Didn’t I just implement something like this last month? • chart management / indexing • cache-conscious data structures • prioritization of partial solutions (best-first, A*) • parameter management • inside-outside formulas • different algorithms for training and decoding • conjugate gradient, annealing, ... • parallelization? I thought computers were supposed to automate drudgery
How you build a system (“big picture” slide) cool model • Dyna language specifies these equations. • Most programs just need to compute some values from other values. Any order is ok. • Some programs also need to update the outputs if the inputs change: • spreadsheets, makefiles, email readers • dynamic graph algorithms • EM and other iterative optimization • leave-one-out training of smoothing params practical equations PCFG pseudocode (execution order) tuned C++ implementation (data structures, etc.) • for width from 2 to n • for i from 0 to n-width • k = i+width • for j from i+1 to k-1 • …
How you build a system (“big picture” slide) cool model practical equations PCFG Compilation strategies (we’ll come back to this) pseudocode (execution order) tuned C++ implementation (data structures, etc.) • for width from 2 to n • for i from 0 to n-width • k = i+width • for j from i+1 to k-1 • …
a “pattern” the capitalized Nmatches anything Writing equations in Dyna • int a. • a = b * c. a will be kept up to date if b or c changes. • b += x.b += y.equivalent tob = x+y. b is a sum of two variables. Also kept up to date. • c += z(1).c += z(2).c += z(3). c += z(“four”).c += z(foo(bar,5)). c += z(N). c is a sum of all nonzero z(…) values. At compile time, we don’t know how many!
sparse dot product of query & document ... + b(“yetis”)*c(“yetis”)+ b(“zebra”)*c(“zebra”) More interesting use of patterns • a = b * c. • scalar multiplication • a(I) = b(I) * c(I). • pointwise multiplication • a += b(I) * c(I).means a = b(I)*c(I) • dot product; could be sparse • a(I,K) += b(I,J) * c(J,K).b(I,J)*c(J,K) • matrix multiplication; could be sparse • J is free on the right-hand side, so we sum over it
definition from other values has a value e.g., a real number Dyna vs. Prolog By now you may see what we’re up to! Prolog has Horn clauses: a(I,K) :- b(I,J) , c(J,K). Dyna has “Horn equations”: a(I,K) += b(I,J) * c(J,K). Like Prolog: Allow nested terms Syntactic sugar for lists, etc. Turing-complete Unlike Prolog: Charts, not backtracking! Compile efficient C++ classes Integrates with your C++ code
put in axioms (values not defined by the above program) theorem pops out The CKY inside algorithm in Dyna :- double item = 0. :- bool length = false. constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). using namespace cky; chart c; c[rewrite(“s”,”np”,”vp”)] = 0.7; c[word(“Pierre”,0,1)] = 1; c[length(30)] = true; // 30-word sentence cin >> c; // get more axioms from stdin cout << c[goal]; // print total weight of all parses
visual debugger – browse the proof forest ambiguity shared substructure
Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY?
Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). max= max= max= • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY?
+ + + Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). max= max= max= log+= log+= log+= • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY?
Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY? c[ word(“Pierre”, 0, 1) ] = 1 state(5) state(9) 0.2 air/0.3 8 9 P/0.5 Pierre/0.2 5
Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY? Just add words one at a time to the chart Check at any time what can be derived from words so far Similarly, dynamic grammars
Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY? Again, no change to the Dyna program
Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY? Basically, just add extra arguments to the terms above
Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY?
need(“s”,0) = true. need(Nonterm,J) |= ?constit(_/[Nonterm|_],_,J). constit(Nonterm/Needed,I,I) += rewrite(Nonterm,Needed) if need(Nonterm,I). constit(Nonterm/Needed,I,K) += constit(Nonterm/[W|Needed],I,J) * word(W,J,K). constit(Nonterm/Needed,I,K) += constit(Nonterm/[X|Needed],I,J) * constit(X/[],J,K). goal += constit(“s”/[],0,N) if length(N). Earley’s algorithm in Dyna constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). magic templates transformation (as noted by Minnen 1996)
Program transformations cool model • Blatz & Eisner (FG 2006): • Lots of equivalent ways to write a system of equations! • Transforming from one to another mayimprove efficiency. • Many parsing “tricks” can be generalized into automatic transformations that help other programs, too! practical equations PCFG pseudocode (execution order) tuned C++ implementation (data structures, etc.) • for width from 2 to n • for i from 0 to n-width • k = i+width • for j from i+1 to k-1 • …
Related algorithms in Dyna? constit(X,I,J) += word(W,I,J) * rewrite(X,W). constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). goal += constit(“s”,0,N) if length(N). • Viterbi parsing? • Logarithmic domain? • Lattice parsing? • Incremental (left-to-right) parsing? • Log-linear parsing? • Lexicalized or synchronous parsing? • Earley’s algorithm? • Binarized CKY?
folding transformation: asymp. speedup! constit(X\Y,Mid,J) += constit(Z,Mid,J) * rewrite(X,Y,Z). constit(X,I,J) += constit(Y,I,Mid) * constit(X\Y,Mid,J). X X\Y Y I J I Mid Mid J Rule binarization constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). X Y Z Y Z I Mid Mid J
folding transformation: asymp. speedup! constit(X\Y,Mid,J) += constit(Z,Mid,J) * rewrite(X,Y,Z). constit(X,I,J) += constit(Y,I,Mid) * constit(X\Y,Mid,J). constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z) constit(Y,I,Mid) constit(Z,Mid,J) * rewrite(X,Y,Z) Rule binarization constit(X,I,J) += constit(Y,I,Mid) * constit(Z,Mid,J) * rewrite(X,Y,Z). graphical models constraint programming multi-way database join
More program transformations • Examples that add new semantics • Compute gradient (e.g., derive outside algorithm from inside) • Compute upper bounds for A* (e.g., Klein & Manning ACL’03) • Coarse-to-fine (e.g., Johnson & Charniak NAACL’06) • Examples that preserve semantics • On-demand computation – by analogy with Earley’s algorithm • On-the-fly composition of FSTs • Left-corner filter for parsing • Program specialization as unfolding – e.g., compile out the grammar • Rearranging computations – by analogy with categorial grammar • Folding reinterpreted as slashed categories • “Speculative computation” using slashed categories • abstract away repeated computation to do it once only – by analogy with unary rule closure or epsilon-closure • derives Eisner & Satta ACL’99 O(n3) bilexical parser
How you build a system (“big picture” slide) cool model practical equations PCFG Propagate updates from right-to-left through the equations. a.k.a. “agenda algorithm” “forward chaining” “bottom-up inference” “semi-naïve bottom-up” pseudocode (execution order) tuned C++ implementation (data structures, etc.) • for width from 2 to n • for i from 0 to n-width • k = i+width • for j from i+1 to k-1 • … use a general method
prep(I,3)? vp(5,K)? 0.4 Bottom-up inference agenda of pending updates rules of program pp(I,K) += prep(I,J) * np(J,K) s(I,K) += np(I,J) * vp(J,K) prep(I,3) = ? pp(2,5) += 0.3 s(3,9) += 0.15 s(3,7) += 0.21 prep(2,3) = 1.0 vp(5,K) = ? vp(5,9) = 0.5 np(3,5) += 0.3 vp(5,7) = 0.7 we updatednp(3,5);what else must therefore change? If np(3,5) hadn’t been in the chart already, we would have added it. no more matches to this query np(3,5) = 0.1 +0.3 chart of derived items with current values
How you build a system (“big picture” slide) cool model practical equations PCFG What’s going on under the hood? pseudocode (execution order) tuned C++ implementation (data structures, etc.) • for width from 2 to n • for i from 0 to n-width • k = i+width • for j from i+1 to k-1 • …
efficient priority queue hard-coded pattern matching automatic indexingfor O(1) lookup vp(5,K)? Compiler provides … agenda of pending updates rules of program s(I,K) += np(I,J) * vp(J,K) np(3,5) += 0.3 copy, compare, & hashterms fast, via integerization (interning) efficient storage of terms (use native C++ types, “symbiotic” storage, garbage collection,serialization, …) chart of derived items with current values
n(5,K)? Beware double-counting! agenda of pending updates combining with itself rules of program n(I,K) += n(I,J) * n(J,K) n(5,5) += ? n(5,5) = 0.2 n(5,5) += 0.3 to makeanother copyof itself epsilon constituent If np(3,5) hadn’t been in the chart already, we would have added it. chart of derived items with current values
DynaMITE: training toolkit model parameters(and input sentence) as axiom values Parameter training objective functionas a theorem’s value • Maximize some objective function. • Use Dyna to compute the function. • Then how do you differentiate it? • … for gradient ascent,conjugate gradient, etc. • … gradient also tells us the expected counts for EM! e.g., inside algorithm computes likelihood of the sentence • Two approaches: • Program transformation – automatically derive the “outside” formulas. • Back-propagation – run the agenda algorithm “backwards.” • works even with pruning, early stopping, etc.
Some examples from my lab … • Parsing using … • factored dependency models (Dreyer, Smith, & Smith CONLL’06) • with annealed risk minimization (Smith and Eisner EMNLP’06) • constraints on dependency length (Eisner & Smith IWPT’05) • unsupervised learning of deep transformations(see Eisner EMNLP’02) • lexicalized algorithms(see Eisner & Satta ACL’99, etc.) • Grammar induction using … • partial supervision (Dreyer & Eisner EMNLP’06) • structural annealing (Smith & Eisner ACL’06) • contrastive estimation (Smith & Eisner GIA’05) • deterministicannealing(Smith & Eisner ACL’04) • Machine translation using … • Very large neighborhood search of permutations (Eisner & Tromble, NAACL-W’06) • Loosely syntax-based MT (Smith & Eisner in prep.) • Synchronous cross-lingual parsing (Smith & Smith EMNLP’04) • Finite-state methods for morphology, phonology, IE, even syntax … • Unsupervised cognate discovery (Schafer & Yarowsky ’05, ’06) • Unsupervised log-linear models via contrastive estimation(Smith & Eisner ACL’05) • Context-based morph. disambiguation (Smith, Smith & Tromble EMNLP’05) • Trainable (in)finite-state machines (see Eisner ACL’02, EMNLP’02, …) • Finite-state machines with very large alphabets (see Eisner ACL’97) • Finite-state machines over weird semirings (see Eisner ACL’02, EMNLP’03) • Teaching(Eisner JHU’05-06; Smith & Tromble JHU’04) Easy to try stuff out! Programs are very short & easy to change! - see also Eisner ACL’03)
Can it express everything in NLP? • Remember, it integrates tightly with C++, so you only have to use it where it’s helpful,and write the rest in C++. Small is beautiful. • We’re currently extending the class of allowed formulas “beyond the semiring” • cf. Goodman (1999) • will be able to express smoothing, neural nets, etc. • Of course, it is Turing complete …