940 likes | 1.15k Views
Dependency Parsing by Belief Propagation. David A. Smith (JHU UMass Amherst) Jason Eisner (Johns Hopkins University). 1. Outline. Edge-factored parsing Dependency parses Scoring the competing parses: Edge features Finding the best parse Higher-order parsing
E N D
Dependency Parsingby Belief Propagation David A. Smith (JHU UMass Amherst) Jason Eisner (Johns Hopkins University) 1
Outline • Edge-factored parsing • Dependency parses • Scoring the competing parses: Edge features • Finding the best parse • Higher-order parsing • Throwing in more features: Graphical models • Finding the best parse: Belief propagation • Experiments • Conclusions Old New!
Outline • Edge-factored parsing • Dependency parses • Scoring the competing parses: Edge features • Finding the best parse • Higher-order parsing • Throwing in more features: Graphical models • Finding the best parse: Belief propagation • Experiments • Conclusions Old New!
Part-of-speech tagging Word dependency parsing Word Dependency Parsing Raw sentence He reckons the current account deficit will narrow to only 1.8 billion in September. POS-tagged sentence He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP . Word dependency parsed sentence Hereckonsthecurrentaccountdeficitwillnarrowtoonly1.8billioninSeptember. MOD MOD COMP SUBJ MOD SUBJ COMP SPEC S-COMP ROOT slide adapted from Yuji Matsumoto
loopy belief propagation What does parsing have to do with belief propagation? loopy propagation belief
Outline • Edge-factored parsing • Dependency parses • Scoring the competing parses: Edge features • Finding the best parse • Higher-order parsing • Throwing in more features: Graphical models • Finding the best parse: Belief propagation • Experiments • Conclusions Old New!
p(D | A,B,C)? … p(D | A,B) * p(C | A,B,D)? p(D | A,B,C)? Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) • In the beginning, we used generative models. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * … each choice depends on a limited part of the history but which dependencies to allow? what if they’re all worthwhile? 7
Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) • In the beginning, we used generative models. • Solution: Log-linear (max-entropy) modeling • Features may interact in arbitrary ways • Iterative scaling keeps adjusting the feature weightsuntil the model agrees with the training data. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * … which dependencies to allow? (given limited training data) (1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) * … throw them all in! 8
v a n preferred find tags …find preferred links… How about structured outputs? • Log-linear models great for n-way classification • Also good for predicting sequences • Also good for dependency parsing but to allow fast dynamic programming, only use n-gram features but to allow fast dynamic programming or MST parsing,only use single-edge features 9
…find preferred links… How about structured outputs? but to allow fast dynamic programming or MST parsing,only use single-edge features 10
“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? yes, lots of green ... Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? jasný den (“bright day”) Byl jasný studený dubnový den a hodiny odbíjely třináctou
“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? jasný N (“bright NOUN”) jasný den (“bright day”) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C
“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? jasný N (“bright NOUN”) jasný den (“bright day”) A N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C
“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? jasný N (“bright NOUN”) jasný den (“bright day”) A N preceding conjunction A N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C
“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? not as good, lots of red ... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C
“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? jasný hodiny (“bright clocks”) ... undertrained ... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C
jasný hodiny (“bright clocks”) ... undertrained ... “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? jasn hodi (“bright clock,”stems only) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin
jasný hodiny (“bright clocks”) ... undertrained ... “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? jasn hodi (“bright clock,”stems only) Aplural Nsingular Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin
jasný hodiny (“bright clocks”) ... undertrained ... “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? jasn hodi (“bright clock,”stems only) A Nwhere N followsa conjunction Aplural Nsingular Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin
“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Which edge is better? • “bright day” or “bright clocks”? Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin
our current weight vector “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Which edge is better? • Score of an edge e = features(e) • Standard algos validparse with max total score Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin
our current weight vector can‘t have both(no crossing links) Can’t have all three(no cycles) Edge-Factored Parsers (McDonald et al. 2005) • Which edge is better? • Score of an edge e = features(e) • Standard algos validparse with max total score can’t have both(one parent per word) Thus, an edge may lose (or win) because of a consensus of other edges.
Outline • Edge-factored parsing • Dependency parses • Scoring the competing parses: Edge features • Finding the best parse • Higher-order parsing • Throwing in more features: Graphical models • Finding the best parse: Belief propagation • Experiments • Conclusions Old New!
The cat in the hat wore a stovepipe. ROOT let’s vertically stretch this graph drawing ROOT wore cat stovepipe The in a hat the Finding Highest-Scoring Parse • Convert to context-free grammar (CFG) • Then use dynamic programming each subtree is a linguistic constituent (here a noun phrase)
ROOT wore cat stovepipe The in a hat the Finding Highest-Scoring Parse • Convert to context-free grammar (CFG) • Then use dynamic programming • CKY algorithm for CFG parsing is O(n3) • Unfortunately, O(n5) in this case • to score “cat wore” link, not enough to know this is NP • must know it’s rooted at “cat” • so expand nonterminal set by O(n): {NPthe, NPcat, NPhat, ...} • so CKY’s “grammar constant” is no longer constant each subtree is a linguistic constituent (here a noun phrase)
ROOT wore cat stovepipe The in a hat the Finding Highest-Scoring Parse • Convert to context-free grammar (CFG) • Then use dynamic programming • CKY algorithm for CFG parsing is O(n3) • Unfortunately, O(n5) in this case • Solution: Use a different decomposition (Eisner 1996) • Back to O(n3) each subtree is a linguistic constituent (here a noun phrase)
The cat in the hat wore a stovepipe. ROOT Spans vs. constituents Two kinds of substring. • Constituent of the tree: links to the rest only through its headword (root). • Span of the tree: links to the rest only through its endwords. The cat in the hat wore a stovepipe. ROOT
The cat in the hat wore a stovepipe. ROOT The cat hat wore Decomposing a tree into spans + cat in the hat wore a stovepipe. ROOT + wore a stovepipe. ROOT cat in the hat wore + in the hat wore cat in + in the hat
require “outside” probabilities of constituents, spans, or links Finding Highest-Scoring Parse • Convert to context-free grammar (CFG) • Then use dynamic programming • CKY algorithm for CFG parsing is O(n3) • Unfortunately, O(n5) in this case • Solution: Use a different decomposition (Eisner 1996) • Back to O(n3) • Can play usual tricks for dynamic programming parsing • Further refining the constituents or spans • Allow prob. model to keep track of even more internal information • A*, best-first, coarse-to-fine • Training by EM etc.
our current weight vector can‘t have both(no crossing links) Can’t have all three(no cycles) Hard Constraints on Valid Trees • Score of an edge e = features(e) • Standard algos validparse with max total score can’t have both(one parent per word) Thus, an edge may lose (or win) because of a consensus of other edges.
can‘t have both(no crossing links) Non-Projective Parses ROOT I ‘ll give a talk tomorrow on bootstrapping subtree rooted at “talk”is a discontiguous noun phrase The “projectivity” restriction. Do we really want it?
Non-Projective Parses ROOT I ‘ll give a talk tomorrow on bootstrapping occasional non-projectivity in English ROOT ista meam norit gloria canitiem thatNOM myACC may-know gloryNOM going-grayACC That glory may-know my going-gray (i.e., it shall last till I go gray) frequent non-projectivity in Latin, etc.
Finding highest-scoring non-projective tree • Consider the sentence “John saw Mary” (left). • The Chu-Liu-Edmonds algorithm finds the maximum-weight spanning tree (right) – may be non-projective. • Can be found in time O(n2). 9 root root 10 10 30 30 9 saw saw 20 0 30 30 Mary Mary John John Every node selects best parent If cycles, contract them and repeat 11 3 slide thanks to Dragomir Radev
Summing over all non-projective trees Finding highest-scoring non-projective tree • Consider the sentence “John saw Mary” (left). • The Chu-Liu-Edmonds algorithm finds the maximum-weight spanning tree (right) – may be non-projective. • Can be found in time O(n2). • How about total weight Z of all trees? • How about outside probabilities or gradients? • Can be found in time O(n3) by matrix determinants and inverses (Smith & Smith, 2007). slide thanks to Dragomir Radev
Graph Theory to the Rescue! O(n3) time! Tutte’s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. Exactly the Z we need!
Building the Kirchoff (Laplacian) Matrix • Negate edge scores • Sum columns (children) • Strike root row/col. • Take determinant N.B.: This allows multiple children of root, but see Koo et al. 2007.
Why Should This Work? Clear for 1x1 matrix; use induction Chu-Liu-Edmonds analogy: Every node selects best parent If cycles, contract and recur Undirected case; special root cases for directed
Outline • Edge-factored parsing • Dependency parses • Scoring the competing parses: Edge features • Finding the best parse • Higher-order parsing • Throwing in more features: Graphical models • Finding the best parse: Belief propagation • Experiments • Conclusions Old New!
Exactly Finding the Best Parse With arbitrary features, runtime blows up Projective parsing: O(n3) by dynamic programming Non-projective: O(n2) by minimum spanning tree grandp.+ siblingbigrams sibling pairs (non-adjacent) POS trigrams grandparents O(n4) O(n5) O(n3g6) … O(2n) • any of the above features • soft penalties for crossing links • pretty much anything else! …find preferred links… NP-hard but to allow fast dynamic programming or MST parsing,only use single-edge features 40
none of these exploit tree structure of parses as the first-order methods do Let’s reclaim our freedom (again!) This paper in a nutshell • Output probability is a product of local factors • Throw in any factors we want! (log-linear model) • How could we find best parse? • Integer linear programming (Riedel et al., 2006) • doesn’t give us probabilities when training or parsing • MCMC • Slow to mix? High rejection rate because of hard TREE constraint? • Greedy hill-climbing (McDonald & Pereira 2006) (1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) * … 41
Let’s reclaim our freedom (again!) Output probability is a product of local factors Throw in any factors we want! (log-linear model) Let local factors negotiate via “belief propagation” Links (and tags) reinforce or suppress one another Each iteration takes total time O(n2) or O(n3) Converges to a pretty good (but approx.) global parse certain global factors ok too each global factor can be handled fast via some traditional parsing algorithm (e.g., inside-outside) This paper in a nutshell (1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) * … 42
Let’s reclaim our freedom (again!) This paper in a nutshell New!
Outline • Edge-factored parsing • Dependency parses • Scoring the competing parses: Edge features • Finding the best parse • Higher-order parsing • Throwing in more features: Graphical models • Finding the best parse: Belief propagation • Experiments • Conclusions Old New!
Local factors in a graphical model • First, a familiar example • Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) … … v v v preferred find tags Observed input sentence (shaded) 45
Local factors in a graphical model • First, a familiar example • Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) Another possible tagging … … v a n preferred find tags Observed input sentence (shaded) 46
Local factors in a graphical model • First, a familiar example • Conditional Random Field (CRF) for POS tagging ”Binary” factor that measures compatibility of 2 adjacent tags Model reusessame parameters at this position … … preferred find tags 47
Local factors in a graphical model • First, a familiar example • Conditional Random Field (CRF) for POS tagging “Unary” factor evaluates this tag Its values depend on corresponding word … … can’t be adj preferred find tags 48
Local factors in a graphical model • First, a familiar example • Conditional Random Field (CRF) for POS tagging “Unary” factor evaluates this tag Its values depend on corresponding word … … preferred find tags (could be made to depend onentire observed sentence) 49
Local factors in a graphical model • First, a familiar example • Conditional Random Field (CRF) for POS tagging “Unary” factor evaluates this tag Different unary factor at each position … … preferred find tags 50