Dependency Parsing by Belief Propagation

Dependency Parsingby Belief Propagation David A. Smith (JHU  UMass Amherst) Jason Eisner (Johns Hopkins University) 1

Outline • Edge-factored parsing • Dependency parses • Scoring the competing parses: Edge features • Finding the best parse • Higher-order parsing • Throwing in more features: Graphical models • Finding the best parse: Belief propagation • Experiments • Conclusions Old New!

Part-of-speech tagging Word dependency parsing Word Dependency Parsing Raw sentence He reckons the current account deficit will narrow to only 1.8 billion in September. POS-tagged sentence He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP . Word dependency parsed sentence Hereckonsthecurrentaccountdeficitwillnarrowtoonly1.8billioninSeptember. MOD MOD COMP SUBJ MOD SUBJ COMP SPEC S-COMP ROOT slide adapted from Yuji Matsumoto

loopy belief propagation What does parsing have to do with belief propagation? loopy propagation belief

p(D | A,B,C)? … p(D | A,B) * p(C | A,B,D)? p(D | A,B,C)? Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) • In the beginning, we used generative models. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * … each choice depends on a limited part of the history but which dependencies to allow? what if they’re all worthwhile? 7

Great ideas in NLP: Log-linear models (Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972) • In the beginning, we used generative models. • Solution: Log-linear (max-entropy) modeling • Features may interact in arbitrary ways • Iterative scaling keeps adjusting the feature weightsuntil the model agrees with the training data. p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * … which dependencies to allow? (given limited training data) (1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) * … throw them all in! 8

v a n preferred find tags …find preferred links… How about structured outputs? • Log-linear models great for n-way classification • Also good for predicting sequences • Also good for dependency parsing but to allow fast dynamic programming, only use n-gram features but to allow fast dynamic programming or MST parsing,only use single-edge features 9

…find preferred links… How about structured outputs? but to allow fast dynamic programming or MST parsing,only use single-edge features 10

“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? yes, lots of green ... Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? jasný  den (“bright day”) Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? jasný  N (“bright NOUN”) jasný  den (“bright day”) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C

“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? jasný  N (“bright NOUN”) jasný  den (“bright day”) A N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C

“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Is this a good edge? jasný  N (“bright NOUN”) jasný  den (“bright day”) A N preceding conjunction A N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C

“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? not as good, lots of red ... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C

“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? jasný  hodiny (“bright clocks”) ... undertrained ... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C

jasný  hodiny (“bright clocks”) ... undertrained ... “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? jasn hodi (“bright clock,”stems only) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin

jasný  hodiny (“bright clocks”) ... undertrained ... “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? jasn hodi (“bright clock,”stems only) Aplural Nsingular Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin

jasný  hodiny (“bright clocks”) ... undertrained ... “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • How about this competing edge? jasn hodi (“bright clock,”stems only) A Nwhere N followsa conjunction Aplural Nsingular Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin

“It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Which edge is better? • “bright day” or “bright clocks”? Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin

our current weight vector “It was a bright cold day in April and the clocks were striking thirteen” Edge-Factored Parsers (McDonald et al. 2005) • Which edge is better? • Score of an edge e =   features(e) • Standard algos  validparse with max total score Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin

our current weight vector can‘t have both(no crossing links) Can’t have all three(no cycles) Edge-Factored Parsers (McDonald et al. 2005) • Which edge is better? • Score of an edge e =   features(e) • Standard algos validparse with max total score can’t have both(one parent per word) Thus, an edge may lose (or win) because of a consensus of other edges.

The cat in the hat wore a stovepipe. ROOT let’s vertically stretch this graph drawing ROOT wore cat stovepipe The in a hat the Finding Highest-Scoring Parse • Convert to context-free grammar (CFG) • Then use dynamic programming each subtree is a linguistic constituent (here a noun phrase)

ROOT wore cat stovepipe The in a hat the Finding Highest-Scoring Parse • Convert to context-free grammar (CFG) • Then use dynamic programming • CKY algorithm for CFG parsing is O(n3) • Unfortunately, O(n5) in this case • to score “cat  wore” link, not enough to know this is NP • must know it’s rooted at “cat” • so expand nonterminal set by O(n): {NPthe, NPcat, NPhat, ...} • so CKY’s “grammar constant” is no longer constant  each subtree is a linguistic constituent (here a noun phrase)

ROOT wore cat stovepipe The in a hat the Finding Highest-Scoring Parse • Convert to context-free grammar (CFG) • Then use dynamic programming • CKY algorithm for CFG parsing is O(n3) • Unfortunately, O(n5) in this case • Solution: Use a different decomposition (Eisner 1996) • Back to O(n3) each subtree is a linguistic constituent (here a noun phrase)

The cat in the hat wore a stovepipe. ROOT Spans vs. constituents Two kinds of substring. • Constituent of the tree: links to the rest only through its headword (root). • Span of the tree: links to the rest only through its endwords. The cat in the hat wore a stovepipe. ROOT

The cat in the hat wore a stovepipe. ROOT The cat hat wore Decomposing a tree into spans + cat in the hat wore a stovepipe. ROOT + wore a stovepipe. ROOT cat in the hat wore + in the hat wore cat in + in the hat

require “outside” probabilities of constituents, spans, or links Finding Highest-Scoring Parse • Convert to context-free grammar (CFG) • Then use dynamic programming • CKY algorithm for CFG parsing is O(n3) • Unfortunately, O(n5) in this case • Solution: Use a different decomposition (Eisner 1996) • Back to O(n3) • Can play usual tricks for dynamic programming parsing • Further refining the constituents or spans • Allow prob. model to keep track of even more internal information • A*, best-first, coarse-to-fine • Training by EM etc.

our current weight vector can‘t have both(no crossing links) Can’t have all three(no cycles) Hard Constraints on Valid Trees • Score of an edge e =   features(e) • Standard algos  validparse with max total score can’t have both(one parent per word) Thus, an edge may lose (or win) because of a consensus of other edges.

can‘t have both(no crossing links) Non-Projective Parses ROOT I ‘ll give a talk tomorrow on bootstrapping subtree rooted at “talk”is a discontiguous noun phrase The “projectivity” restriction. Do we really want it?

Non-Projective Parses ROOT I ‘ll give a talk tomorrow on bootstrapping occasional non-projectivity in English ROOT ista meam norit gloria canitiem thatNOM myACC may-know gloryNOM going-grayACC That glory may-know my going-gray (i.e., it shall last till I go gray) frequent non-projectivity in Latin, etc.

Finding highest-scoring non-projective tree • Consider the sentence “John saw Mary” (left). • The Chu-Liu-Edmonds algorithm finds the maximum-weight spanning tree (right) – may be non-projective. • Can be found in time O(n2). 9 root root 10 10 30 30 9 saw saw 20 0 30 30 Mary Mary John John Every node selects best parent If cycles, contract them and repeat 11 3 slide thanks to Dragomir Radev

Summing over all non-projective trees Finding highest-scoring non-projective tree • Consider the sentence “John saw Mary” (left). • The Chu-Liu-Edmonds algorithm finds the maximum-weight spanning tree (right) – may be non-projective. • Can be found in time O(n2). • How about total weight Z of all trees? • How about outside probabilities or gradients? • Can be found in time O(n3) by matrix determinants and inverses (Smith & Smith, 2007). slide thanks to Dragomir Radev

Graph Theory to the Rescue! O(n3) time! Tutte’s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. Exactly the Z we need!

Building the Kirchoff (Laplacian) Matrix • Negate edge scores • Sum columns (children) • Strike root row/col. • Take determinant N.B.: This allows multiple children of root, but see Koo et al. 2007.

Why Should This Work? Clear for 1x1 matrix; use induction Chu-Liu-Edmonds analogy: Every node selects best parent If cycles, contract and recur Undirected case; special root cases for directed

Exactly Finding the Best Parse With arbitrary features, runtime blows up Projective parsing: O(n3) by dynamic programming Non-projective: O(n2) by minimum spanning tree grandp.+ siblingbigrams sibling pairs (non-adjacent) POS trigrams grandparents O(n4) O(n5) O(n3g6) … O(2n) • any of the above features • soft penalties for crossing links • pretty much anything else! …find preferred links… NP-hard but to allow fast dynamic programming or MST parsing,only use single-edge features 40

none of these exploit tree structure of parses as the first-order methods do Let’s reclaim our freedom (again!) This paper in a nutshell • Output probability is a product of local factors • Throw in any factors we want! (log-linear model) • How could we find best parse? • Integer linear programming (Riedel et al., 2006) • doesn’t give us probabilities when training or parsing • MCMC • Slow to mix? High rejection rate because of hard TREE constraint? • Greedy hill-climbing (McDonald & Pereira 2006) (1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) * … 41

Let’s reclaim our freedom (again!) Output probability is a product of local factors Throw in any factors we want! (log-linear model) Let local factors negotiate via “belief propagation” Links (and tags) reinforce or suppress one another Each iteration takes total time O(n2) or O(n3) Converges to a pretty good (but approx.) global parse certain global factors ok too each global factor can be handled fast via some traditional parsing algorithm (e.g., inside-outside) This paper in a nutshell (1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B) * Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) * … 42

Let’s reclaim our freedom (again!) This paper in a nutshell New!

Local factors in a graphical model • First, a familiar example • Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) … … v v v preferred find tags Observed input sentence (shaded) 45

Local factors in a graphical model • First, a familiar example • Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) Another possible tagging … … v a n preferred find tags Observed input sentence (shaded) 46

Local factors in a graphical model • First, a familiar example • Conditional Random Field (CRF) for POS tagging ”Binary” factor that measures compatibility of 2 adjacent tags Model reusessame parameters at this position … … preferred find tags 47

Local factors in a graphical model • First, a familiar example • Conditional Random Field (CRF) for POS tagging “Unary” factor evaluates this tag Its values depend on corresponding word … … can’t be adj preferred find tags 48

Local factors in a graphical model • First, a familiar example • Conditional Random Field (CRF) for POS tagging “Unary” factor evaluates this tag Its values depend on corresponding word … … preferred find tags (could be made to depend onentire observed sentence) 49

Local factors in a graphical model • First, a familiar example • Conditional Random Field (CRF) for POS tagging “Unary” factor evaluates this tag Different unary factor at each position … … preferred find tags 50

Dependency Parsing by Belief Propagation

Dependency Parsing by Belief Propagation

Presentation Transcript

Anytime Lifted Belief Propagation

Belief Propagation

Belief Propagation

Dependency Parsing: Machine Learning Approaches

Dependency Parsing by Belief Propagation

Dependency Parsing

Parsing Categories of Belief

Unsupervised Dependency Parsing

Data-Driven Dependency Parsing

Dependency Parsing

Introduction to Belief Propagation

Belief Propagation

Dependency Parsing

Parallel Splash Belief Propagation

Pearl’s Belief Propagation Algorithm

Inferring Edges by Using Belief Propagation

Parallel Splash Belief Propagation

Lexical Dependency Parsing

Loopy Belief Propagation

Anytime Lifted Belief Propagation

Unsupervised Dependency Parsing