CS 479, section 1: Natural Language Processing

This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License. CS 479, section 1:Natural Language Processing Lecture #31: Dependency Parsing Thanks to Joakim Nivre and Sandar Kuebler for many of the materials used in this lecture,with additions by Dan Roth.

Announcements • Final Project • Three options: • Propose (possibly as a team) • No proposal – just decide • Project #4 • Project #5 • Proposals • Early: today • Due: Friday • Note: must discuss with me before submitting written proposal

Objectives • Become acquainted with dependency parsing, in contrast to constituent parsing • See the relationship between the two approaches • Understand an algorithm for non-projective dependency parsing • Have a starting point to understand the rest of the dependency parsing literature • Think about uses of dependency parsing

Your Questions

Big Ideas fromMcDonald et al., 2006?

Big Ideas fromMcDonald et al., 2006? • Dependency parsing • Non-projective vs. projective parse trees • Generalization to other languages • Labeled vs. unlabeled dependencies • Problem: Maximum Spanning Tree • Algorithm: Chu-Liu-Edmonds • Edge scores • Machine Learning: MIRA • Large Margin Learners • Online vs. Batch learning • Feature engineering

Outline • Dependency Parsing: • Formalism • Dependency Parsing algorithms • Semantic Role Labeling • Dependency Formalism

Formalization by Lucien Tesniere [Tesniere, 1959] • Idea known long before (e.g., Panini, India, >2000 yrs ago) • Studied extensively in the Prague School approach in syntax • (in US, research was focused more on constituent formalism)

(or Constituent Structure)

Constituent vs Dependency • There are advantages of dependency structures: • for free (or semi-free) order languages • easier to convert to predicate-argument structure • ... • But there are drawbacks too... • You can try to convert one representation into another • but, in general, these formalisms are not equivalent

Dependency structures for NLP tasks • Most of approaches have been focused on constituent tree-based features • But now dependency parsing is in the spotlight: • Machine Translation (e.g., Menezes & Quirk, 07) • Summarization and sentence compression (e.g., Fillippova & Strube, 08) • Opinion mining, (e.g., Lerman et al, 08) • Information extraction, Question Answering (e.g., Bouma et al, 06)

All these conditions will be violated for semantic dependency graphs we will consider later

You can think of it as (related) planarity

Algorithms • Global inference algorithms: • graph-based approaches • transition-based approaches • We will not consider • rule-based systems • constraint satisfaction

Converting to Constituent Formalism Idea: • Convert dependency structures to constituent structures • easy for projective dependency structures • Apply algorithms for constituent parsing to them • E.g., CKY/ PCKY

Converting to Constituent Formalism • Different independence assumption lead to different statistical models • both accuracy and parsing time (dynamic programming) varies

Features f(i,j) can include dependence on any words in the sentence, i.e. f(i, j, sent) • But still the score decomposes over edges in the graph • Strong independence assumption

Online Learning:Structured Perceptron • Joint feature representation: • we will talk about it more later • Algorithm: Features over edges only Here we run MST or Eisner’s algorithm

Parsing Algorithms • Here, when we say parsing algorithm (=derivation order) we often mean mapping: • Given a tree map it to a sequence of actions which create this tree • Tree T is equivalent to these sequence of actions: • d1, ..., dn • Therefore, P(T) = P(d1, ..., dn) • P(T) = P(d1, ..., dn) = P(d1) P(d2|d1)... P(dn|dn-1, ..., d1) • Ambigous: some times “parsing algorithms” refers to the decoding algorithm to find the most likely sequence You can use classifiers here and search for most likely sequence.

Most algorithms are restricted to projective structures, but not all

It can handle only projective structures

How to learn in this case? • Your training examples are • -- collections of parsing contexts • Your want to predict correct actions • How to define feature representation of • You can think instead of () in terms of: • partial tree corresponding to them • current contents of queue (Q) and stack (S) • The most important features are top of S and front of Q (only between them you can potentially create links) • Inference: • Greedily • With beam search

Results: Transition-based vs Graph-Based • CoNLL-2006 Shared Task, Average over 12 langs (Labeled Attachment Score) • McDonald et al (MST): 80.27 • Nivre et al (Transitions): 80.19 • Results are the same • A lot of research in both directions, • e.g., Latent Variable Models for Transition Based Parsing (Titov and Henderson, 07) – best single-model system in CoNLL-2007 (third overall)

Non-Projective Parsing • Graph-Based Algorithms (McDonald) • Post-Processing of Projective Algorithms (Hall and Novak, 05) • Transition-Based Algorithms which handle non-projectivity (Attardi 06, Titov et al, 08; Nivre et al, 08) • Pseudo Projective Parsing: Removing non-projective (crossing) links and encoding them in labels (Nivre and Nilsson, 05)

Next • Document Clustering • Unsupervised learning • Expectation Maximization (EM) • Machine Translation! • Word alignment • Phrase alignment • Semantics • Co-reference

CS 479, section 1: Natural Language Processing