190 likes | 313 Views
Probabilistic Models of Nonprojective Dependency Trees. Noah A. Smith Language Technologies Institute Machine Learning Dept. School of Computer Science Carnegie Mellon University. David A. Smith Center for Language and Speech Processing Computer Science Dept. Johns Hopkins University.
E N D
Probabilistic Models of Nonprojective Dependency Trees Noah A. Smith Language Technologies Institute Machine Learning Dept. School of Computer Science Carnegie Mellon University David A. Smith Center for Language and Speech Processing Computer Science Dept. Johns Hopkins University EMNLP-CoNLL
See Also On the Complexity of Non-Projective Data-Driven Dependency Parsing R. McDonald and G. Satta IWPT 2007 Structured-Prediction Models via the Matrix-Tree Theorem T. Koo, A. Globerson, X. Carreras and M. Collins EMNLP-CoNLL 2007 Coming Up Next! EMNLP-CoNLL
Nonprojective Syntax ROOT I ‘ll give a talk tomorrow on bootstrapping ROOT ista meam norit gloria canitiem thatNOM myACC may-know gloryNOM going-grayACC That glory shall last till I go gray How would we parse this? EMNLP-CoNLL
Edge-Factored Models (McDonald et al., 2005) Score edges in isolation Find maximum spanning tree with Chu-Liu-Edmonds NP hard to add sibling or degree constraints, hidden node variables children parents Non-neg. score for each edge Find edge sum among legal trees Unlabeled for now What about training? EMNLP-CoNLL
If Only It Were Projective… ROOT I ‘ll give a talk on bootstrapping tomorrow An Inside-Outside algorithm gives us • Normalizing constant for globally normalized models • Posterior probability of edges • Summing over hidden variables But we can’t use Inside-Outside for nonprojective parsing! EMNLP-CoNLL
Graph Theory to the Rescue! O(n3) time! Tutte’s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. Exactly the Z we need! EMNLP-CoNLL
Building the Kirchoff (Laplacian) Matrix • Negate edge scores • Sum columns (children) • Strike root row/col. • Take determinant N.B.: This allows multiple children of root, but see Koo et al. 2007. EMNLP-CoNLL
Why Should This Work? Clear for 1x1 matrix; use induction Chu-Liu-Edmonds analogy: Every node selects best parent If cycles, contract and recur Undirected case; special root cases for directed EMNLP-CoNLL
When You Have a Hammer… Matrix-Tree Theorem • Sequence-normalized log-linear models (Lafferty et al. ‘01) • Minimum Bayes-risk parsing (cf. Goodman ‘96) • Hidden-variable models • O(n) inference with length constraints (cf. N. Smith & Eisner, ‘05) • Minimum risk training (D. Smith & Eisner ‘06) • Tree (Rényi) entropy (Hwa ‘01; S & E ‘07) EMNLP-CoNLL
Analogy to Other Models EMNLP-CoNLL
More Machinery: The Gradient Invert Kirchoff matrix K in O(n3) time via LU factorization Since The edge gradient is also edge posterior probability. Use the chain rule to backpropagate into s(i,j), whatever its internal structure may be. EMNLP-CoNLL
Nonprojective Conditional Log-Linear Training • CoNLL 2006 Danish and Dutch • CoNLL 2007 Arabic and Czech • Features from McDonald et al. 2005 • Compared with MSTParser’s MIRA max-margin training • Trained LL weights with stochastic gradient descent • Same #iterations and stopping criteria as MIRA Significance on paired permutation test EMNLP-CoNLL
Minimum Bayes-Risk Parsing Select the tree, not with the highest probability, but the most expected correct edges. Plug posteriors into MST MIRA doesn’t estimate probs. N.B. One could do mBr inside MIRA. EMNLP-CoNLL
Edge Clustering (Supervised) labeled dependency parsing OBJ SUBJ Franz loves Milena OR A X Sum out all possible edge labelings if we don’t care about labels per se. B Y C Z Franz loves Milena Simple idea: conjoin each model feature with a cluster EMNLP-CoNLL
Edge Clustering No significant gains or losses from clustering EMNLP-CoNLL
NP-A NP-B NP-A What’s Wrong with Edge Clustering? No interaction Edge labels don’t interact Unlike clusters on PCFG nonterminals (e.g. Matsuzaki et al.’05) Cf. small/no gains for unlabeled accuracy from supervised labeled parsers A B Franz loves Milena Interaction in rewrite rule EMNLP-CoNLL
Constraints on Link Length • Max. left/right child distances L and R (Cf. Eisner & N. Smith‘05) • Band-diagonal Kirchoff matrix once root row and column are removed • Inversion in O(min(L3R2, L2R3)n) time Example with L=1, R=2 EMNLP-CoNLL
Conclusions • O(n3) inference for edge-factored nonprojective dependency models • Performance closely comparable to MIRA • Learned edge clustering doesn’t seem to help unlabeled parsing • Many other applications to hit EMNLP-CoNLL
Thanks Jason Eisner Keith Hall Sanjeev Khudanpur The Anonymous Reviewers Ryan McDonald & Michael Collins & colleagues For sharing drafts EMNLP-CoNLL