Probabilistic Models of Nonprojective Dependency Trees

Probabilistic Models of Nonprojective Dependency Trees Noah A. Smith Language Technologies Institute Machine Learning Dept. School of Computer Science Carnegie Mellon University David A. Smith Center for Language and Speech Processing Computer Science Dept. Johns Hopkins University EMNLP-CoNLL

See Also On the Complexity of Non-Projective Data-Driven Dependency Parsing R. McDonald and G. Satta IWPT 2007 Structured-Prediction Models via the Matrix-Tree Theorem T. Koo, A. Globerson, X. Carreras and M. Collins EMNLP-CoNLL 2007 Coming Up Next! EMNLP-CoNLL

Nonprojective Syntax ROOT I ‘ll give a talk tomorrow on bootstrapping ROOT ista meam norit gloria canitiem thatNOM myACC may-know gloryNOM going-grayACC That glory shall last till I go gray How would we parse this? EMNLP-CoNLL

Edge-Factored Models (McDonald et al., 2005) Score edges in isolation Find maximum spanning tree with Chu-Liu-Edmonds NP hard to add sibling or degree constraints, hidden node variables children parents Non-neg. score for each edge Find edge sum among legal trees Unlabeled for now What about training? EMNLP-CoNLL

If Only It Were Projective… ROOT I ‘ll give a talk on bootstrapping tomorrow An Inside-Outside algorithm gives us • Normalizing constant for globally normalized models • Posterior probability of edges • Summing over hidden variables But we can’t use Inside-Outside for nonprojective parsing! EMNLP-CoNLL

Graph Theory to the Rescue! O(n3) time! Tutte’s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. Exactly the Z we need! EMNLP-CoNLL

Building the Kirchoff (Laplacian) Matrix • Negate edge scores • Sum columns (children) • Strike root row/col. • Take determinant N.B.: This allows multiple children of root, but see Koo et al. 2007. EMNLP-CoNLL

Why Should This Work? Clear for 1x1 matrix; use induction Chu-Liu-Edmonds analogy: Every node selects best parent If cycles, contract and recur Undirected case; special root cases for directed EMNLP-CoNLL

When You Have a Hammer… Matrix-Tree Theorem • Sequence-normalized log-linear models (Lafferty et al. ‘01) • Minimum Bayes-risk parsing (cf. Goodman ‘96) • Hidden-variable models • O(n) inference with length constraints (cf. N. Smith & Eisner, ‘05) • Minimum risk training (D. Smith & Eisner ‘06) • Tree (Rényi) entropy (Hwa ‘01; S & E ‘07) EMNLP-CoNLL

Analogy to Other Models EMNLP-CoNLL

More Machinery: The Gradient Invert Kirchoff matrix K in O(n3) time via LU factorization Since The edge gradient is also edge posterior probability. Use the chain rule to backpropagate into s(i,j), whatever its internal structure may be. EMNLP-CoNLL

Nonprojective Conditional Log-Linear Training • CoNLL 2006 Danish and Dutch • CoNLL 2007 Arabic and Czech • Features from McDonald et al. 2005 • Compared with MSTParser’s MIRA max-margin training • Trained LL weights with stochastic gradient descent • Same #iterations and stopping criteria as MIRA Significance on paired permutation test EMNLP-CoNLL

Minimum Bayes-Risk Parsing Select the tree, not with the highest probability, but the most expected correct edges. Plug posteriors into MST MIRA doesn’t estimate probs. N.B. One could do mBr inside MIRA. EMNLP-CoNLL

Edge Clustering (Supervised) labeled dependency parsing OBJ SUBJ Franz loves Milena OR A X Sum out all possible edge labelings if we don’t care about labels per se. B Y C Z Franz loves Milena Simple idea: conjoin each model feature with a cluster EMNLP-CoNLL

Edge Clustering No significant gains or losses from clustering EMNLP-CoNLL

NP-A NP-B NP-A What’s Wrong with Edge Clustering? No interaction Edge labels don’t interact Unlike clusters on PCFG nonterminals (e.g. Matsuzaki et al.’05) Cf. small/no gains for unlabeled accuracy from supervised labeled parsers A B Franz loves Milena Interaction in rewrite rule EMNLP-CoNLL

Constraints on Link Length • Max. left/right child distances L and R (Cf. Eisner & N. Smith‘05) • Band-diagonal Kirchoff matrix once root row and column are removed • Inversion in O(min(L3R2, L2R3)n) time Example with L=1, R=2 EMNLP-CoNLL

Conclusions • O(n3) inference for edge-factored nonprojective dependency models • Performance closely comparable to MIRA • Learned edge clustering doesn’t seem to help unlabeled parsing • Many other applications to hit EMNLP-CoNLL

Thanks Jason Eisner Keith Hall Sanjeev Khudanpur The Anonymous Reviewers Ryan McDonald & Michael Collins & colleagues For sharing drafts EMNLP-CoNLL

Probabilistic Models of Nonprojective Dependency Trees

Probabilistic Models of Nonprojective Dependency Trees

Presentation Transcript

More Probabilistic Models

Probabilistic models

Probabilistic Models

Temporal Probabilistic Models

Temporal Probabilistic Models

Constituent and Dependency Trees

Probabilistic Models

Constituent and Dependency Trees

Probabilistic graphical models

Probabilistic Graphical Models

Probabilistic Graphical Models

Probabilistic Suffix Trees

Temporal Probabilistic Models

Temporal Probabilistic Models

Probabilistic Models

Probabilistic Models

Probabilistic Topic Models

Probabilistic Graphical Models

Probabilistic Models

Probabilistic Models

Probabilistic models

Probabilistic Topic Models