Dependency Parsing: Machine Learning Approaches

January 7, 2008 Dependency Parsing:Machine Learning Approaches Yuji Matsumoto Graduate School of Information Science Nara Institute of Science and Technology (NAIST, Japan)

Hereckonsthe current account deficitwill narrowtoonly 1.8 billioninSeptember . Basic Language Analyses (POS-tagging, phrase chunking, parsing) Raw sentence He reckons the current account deficit will narrow to only 1.8 billion in September. Part-of-speech tagging POS-tagged sentence He reckons the current account deficit will narrow to only 1.8 billion in September . PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP . Base phrase chunking Base phrase-chunked sentence Hereckonsthe current account deficitwill narrowtoonly 1.8 billioninSeptember . NPVPNPVPPPNPPPNP Dependency parsing Dependency parsed sentence

Word dependency parsing Word Dependency Parsing (unlabeled) Raw sentence He reckons the current account deficit will narrow to only 1.8 billion in September. Part-of-speech tagging POS-tagged sentence He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP . Word dependency parsed sentence Hereckonsthecurrentaccountdeficitwillnarrowtoonly1.8billioninSeptember.

Word dependency parsing MOD MOD COMP SUBJ MOD SUBJ COMP SPEC S-COMP ROOT Word Dependency Parsing (labeled) Raw sentence He reckons the current account deficit will narrow to only 1.8 billion in September. Part-of-speech tagging POS-tagged sentence He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP . Word dependency parsed sentence Hereckonsthecurrentaccountdeficitwillnarrowtoonly1.8billioninSeptember.

A phrase structure tree anda dependency tree ounces

Flattened representation ofa dependency tree ounces

Dependency structure —terminology Label • Child • Dependent • Modifier SUBJ Thisis • Parent • Governor • Head • The direction of arrows may be drawn from head to child • When there is an arrow from w to v, we write w→v. • When there is a path (a series of arrows) from w to v, • we write w→*v.

Definition of Dependency Trees • Single head： Except for the root (EOS), all words have a single parent • Connected： It should be a connected tree • Acyclic： If wi→wj , then it will never be wj→*wi • Projective: If wi →wj , then for all k between i and j, either wk →* wiorwk →* wjholds (non-crossing between dependencies).

Projective dependency tree ounces Projectiveness: all the words between here finally depend on either on “was” or “.” (e.g., light →* was)

NNP VBD DT NN NN WP VBD IN DT NN John saw a man yesterday who walked along the river Non-projective dependency tree root Direction of edges: from a child to the parent

Non-projective dependency tree root *taken from: R. McDonald and F. Pereira, “Online Learning of Approximate Dependency Parsing Algorithms,” European Chapter of Association for Computational Linguistics, 2006. Direction of edges: from a parent to the children

Two Different Strategies for Structured Language Analysis • Sentences have structures • Linear sequences: POS tagging, Phrase/Named Entity chunking • Tree structure: Phrase structure trees, dependency trees • Two statistical approaches to structure analysis • Global optimization • Eg., Hidden Markov Models, Conditional Ramdom Fields for Sequential tagging problems • Probabilistic Context-free parsing • Maximum Spanning Tree Parsing (graph-based) • Repetition of local optimization • Chunking with Support Vector Machine • Deterministic parsing (transition-based)

Statistical dependency parsers • Eisner (COLING 96, Penn Technical Report 96) • Kudo & Matsumoto (VLC 00, CoNLL 02): • Yamada & Matsumoto (IWPT 03) • Nivre (IWPT 03, COLING 04, ACL 05) • Cheng, Asahara, Matsumoto (IJCNLP 04) • McDonald-Crammer-Pereira (ACL 05a, EMNLP 05b, EACL 06) Global optimization Repetition of local optimization

Dependency Parsing used asthe CoNLL Shared Task • CoNLL (Conference on Natural Language Learning) • Multi-lingual Dependency Parsing Track • 10 languages: Arabic, Basque, Catalan, Chinese, Czech, English, Greek, Hungarian, Italian, Turkish • Domain Adaptation Track • Dependency annotated data in one domain and a large unannotated data in other domains (biomedical/chemical abstracts, parent-child dialogue) are available. • Objective: To use large scale unannotated target data to enhance the performance of dependency parser learned in the original domain so as to work well in the new domain. Nivre, J., Hall, J., Kubler, S, McDonald, R., Nilsson, J., Riedel, S., Yuret, D., “The CoNLL 2007 Shared Task on Dependency Parsing,” Proceedings of EMNLP-CoNLL 2007, pp.915-932, June 2007.

Statistical dependency parsers(to be introduced in this lecture) • Kudo & Matsumoto (VLC 00, CoNLL 02): Japanese • Yamada & Matsumoto (IWPT 03) • Nivre (IWPT 03, COLING 04, ACL 05) • McDonald-Crammer-Pereira (EMNLP 05a, ACL 05b, EACL 06) Most of them (except for [Nivre 05] and [McDonald 05a]) Assume projective dependency parsing

Japanese Syntactic Dependency Analysis • Analysis of relationship between phrasal units (“bunsetsu” segments) • Two Constraints: • Each segment modifies one of right-hand side segments (Japanese is head final language) • Dependencies do not cross one another (projectiveness)

私は彼女と京都に行きます (I go to Kyoto with her.) Raw text Morphological analysis and bunsetsu chunking 私は / 彼女と / 京都に / 行きます I with her to Kyoto go Dependency Analysis 私は / 彼女と / 京都に / 行きます An Example of Japanese Syntactic Dependency Analysis

1. Build a Dependency Matrix usingME, DT or SVMs (How probable one segment modifies another) Modifiee 2. Search the optimal dependencies which maximize the sentence probabilities, using CYK or Chart 2 3 4 1 0.1 0.2 0.7 Modifier 2 0.2 0.8 Output 3 1.0 Dependency Matrix 私は 1 / 彼女と 2 / 京都に 3 / 行きます 4 Model 1: Probabilistic Model [Kudo & Matsumoto 00] Input 私は 1 / 彼女と 2 / 京都に 3 / 行きます 4 I-top / with her / to Kyoto-loc / go

Problems of Probabilistic model(1) • Selection of training examples: All pairs of segments in a sentence • Depending pairs → positive examples • Non-depending pairs → negative examples • This produces a total of n(n-1)/2 training examples per sentence (n is the number of segments in a sentence) • In Model 1: • All positive and negative examples are used to learn an SVM • Test example is given to the SVM, its distance from the separating hyperplane is transformed into a pseud-probability using the sigmoid function

Problems of Probabilistic Model • Size of training examples is large • O(n3) time is necessary for complete parsing • The classification cost of SVM is much more expensive than other ML algorithms such as Maximum Entropy model and Decision Trees

Model 2: Cascaded Chunking Model [Kudo & Matsumoto 02] • Parse a sentence deterministically only deciding whether the current segment modifies the segment on its immediate right-hand side • Training examples are extracted using the same parsing algorithm

彼は1　　彼女の2　真心に4　感動した。5 ? ? ? 彼は1　　彼女の2　真心に4　感動した。5 O DD 彼は1　　真心に4　感動した。5 ? ? ? ? ? ? 彼は1　　真心に4　感動した。5 彼は1　　彼女の2　温かい3　真心に4　感動した。5 O D O O DD 彼は1　　　感動した。5 彼は1　　　感動した。5 SVM-learning after accumulation ? D SVMs Example: Training Phase Annotated sentence 彼は1　　彼女の2　温かい3　真心に4　感動した。5 He her warm heart was moved (He was moved by her warm heart.) 彼は1　　彼女の2　温かい3　真心に4　感動した。5 Pairs of tag (D or O) and context(features) are stored as training data for SVMs Training Data

彼は1　　彼女の2　真心に4　感動した。5 ? ? ? 彼は1　　彼女の2　真心に4　感動した。5 O DD 彼は1　　真心に4　感動した。5 ? ? ? ? ? ? 彼は1　　真心に4　感動した。5 彼は1　　彼女の2　温かい3　真心に4　感動した。5 O D O O DD 彼は1　　　感動した。5 彼は1　　　感動した。5 ? D Example: Test Phase Test sentence 彼は1　　彼女の2　温かい3　真心に4　感動した。5 He her warm heart was moved (He was moved by her warm heart.) 彼は1　　彼女の2　温かい3　真心に4　感動した。5 Tag is decided by SVMs built in training phase SVMs

Advantages of Cascaded Chunking model • Efficiency • O(n3) (Probability model) v.s. O(n2) (Cascaded chunking model) • Lower than O(n2) since most segments modify the segment on their immediate right-hand-side • The size of training examples is much smaller • Independence from ML methods • Can be combined with any ML algorithm which works as a binary classifier • Probabilities of dependency are not necessary

B A C Features used in implementation Modify or not? 彼の1友人は2　この本を3　持っている4　女性を5　探している6 His friend-top this book-acc have lady-acc be looking for modifier head His friend is looking for a lady who has this book. • Static Features • modifier/modifiee • Head/Functional Word: surface, POS, POS-subcategory, inflection-type, inflection-form, brackets, quotations, punctuations,… • Between segments: distance, case-particles, brackets, quotations, punctuations • Dynamic Features • A,B: Static features of Functional word • C: Static features of Head word

Settings of Experiments • Kyoto University Corpus 2.0/3.0 • Standard Data Set • Training: 7,958 sentences / Test: 1,246 sentences • Same data as [Uchimoto et al. 98], [Kudo, Matsumoto 00] • Large Data Set • 2-fold Cross-Validation using all 38,383 sentences • Kernel Function: 3rd polynomial • Evaluation method • Dependency accuracy • Sentence accuracy

Results Data Set Standard (8,000 sentences) Large (20,000 sentences) Model Cascaded Chunking Probabilistic Cascaded Chunking Probabilistic Dependency Acc. (%) 89.29 89.09 90.45 N/A Sentence Acc. (%) 47.53 46.17 53.16 N/A # of training sentences 7,956 7,956 19,191 19,191 # of training examples 110,355 459,105 251,254 1,074,316 Training time (hours) 8 336 48 N/A Parsing time (sec./sent.) 0.5 2.1 0.7 N/A

Probabilistic v.s. Cascaded Chunking Models

Smoothing Effect (in cascade model) • No need to cut off low frequent words

Combination of features • Polynomial Kernels for taking into combination of features (tested with a small corpus (2000 sentences))

Deterministic Dependency Parser based on SVM [Yamada & Matsumoto 03] • Three possible actions: • Right: For the two adjacent words, modification goes from left word to the right word • Left: For the two adjacent words, modification goes from right word to the left word • Shift: no action should be taken for the pair, and move the focus to the right • There are two possibilities in this situation: • There is really no modification relation between the pair • There is actually a modification relation between them, but need to wait until the surrounding analysis has been finished • The second situation can be categorized into a different class (called Wait) • Do this process on the input sentence from the beginning to the end, and repeat it until a single word remains

Right action

Left action

Shift action

The features used in learning SVM is used to make classification either in 3 class model (right, left, shift) or in 4 class model (right, left, shift, wait)

SVM Learning of Actions • The best action for each configuration is learned by SVMs • Since the problem is 3-class or 4-class classification problem, either pair-wise or one-vs-rest method is employed • pair-wise method: For each pair of classes, learn an SVM. The best class is decide by voting of all SVMs • One-vs-rest method: For each class, an SVM is learned to discriminate that class from others. The best class is decided by the SVM that gives the highest value

word pair being considered Referred context An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) right with a the boy hits the dog rod

boy the An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) right with a hits the dog rod

boy the An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) shift with a hits the dog rod

boy the An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) right with a hits the dog rod

boy the An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) shift with a hits dog rod the

boy the An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) right with a hits dog rod the

boy dog the the An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) left with hits rod a

boy dog the the An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) shift with hits rod a

boy dog rod the the a An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) left with hits

with boy dog the the rod a An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) left hits

with boy dog the the rod a An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) End of parsing hits

The Accuracy of Parsing Accuracies for: Dependency relation Rood identification Complete analysis • Learned with 30000 English sentences • no children: no child info is considered • word, POS: only word/POS info is used • all: all information is used

Deterministic linear time dependency parser based on Shift-Reduce parsing [Nivre 03,04] • There are two stacks S and Q • Initialization: S[w1] [w2, w3, …, wn]Q • Termination: S[…] []Q • Parsing actions: • SHIFT: S[…] [wi,…]Q → S[…, wi] […]Q • Left-Arc: S[…, wi] [wj, …]Q → S[…] [wj, …]Q • Right-Arc: S[…, wi] [wj,…]Q → S[…, wi, wj] […]Q • Reduce: S[…, wi, wj] […]Q → S[…, wi] […]Q wi wj Though the original parser uses memory-based learning, recent implementation uses SVMs to select actions

Dependency Parsing: Machine Learning Approaches

Dependency Parsing: Machine Learning Approaches

Presentation Transcript

An Introduction to Machine Learning with Perl

Tutorial 4: Functional Dependency

Development politics and political development

Lexical Analysis

Facilitating Learning Teams using the Professional Learning Cycle

Phenomenological Approaches

Some Useful Machine Learning Tools

A New Parallel Framework for Machine Learning

TCS for Machine Learning Scientists

Submodularity in Machine Learning

Machine Learning for Analyzing Brain Activity

Uneven development - Dependency

CS 59000 Statistical Machine learning Lecture 3

CS194-10 Fall 2011 Introduction to Machine Learning Machine Learning: An Overview

Natural Language Processing COMPSCI 423/723 Rohit Kate

Lecture 3 Introduction to Parsing and Top-Down Parsing

Machine Learning Models on Random Graphs

Some Observations on Hindi Dependency Parsing

Using Statistical Machine Learning in Cloud Computing

Chapter 1

Seminar: Statistical NLP