Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

Unsupervised Structure Predictionwith Non-Parallel Multilingual Guidance Shay B. Cohen Dipanjan Das Noah A. Smith Carnegie Mellon University July 27 EMNLP 2011

Goal: Learn linguistic structure for a language without any labeled data in that language The Skibo Castle is close by . VERB . ADJ DET NOUN NOUN ADP Dependency Parsing Part-of-Speech Tagging EMNLP 2011

Multilingual Unsupervised Learning (hard) no parallel data using parallel data supervision in source language(s) supervision in source language(s) joint learning for multiple languages joint learning for multiple languages Yarowsky and Ngai (2001) Cohen and Smith (2009) Snyder et al. (2009) This work! Xi and Hwa (2005) Berg-Kirkpatrick and Klein (2010) Naseem et al. (2010) Smith and Eisner (2009) Das and Petrov (2011) McDonald et al. (2011) EMNLP 2011

In a Nutshell Portuguese parameters Annotated data Unlabeled data in Portuguese = + Spanish Italian Monolingual unsupervised training in Portuguese Coarse, universal parameters Interpolation (unsupervised training) Coarse-to-fine expansion and initialization Coarse, universal parameters coarse parameters of Portuguese EMNLP 2011

Assumptions for a given problem: 1. Underlying model is generative close by The Skibo is Castle HMM Merialdo (1994) EMNLP 2011

Assumptions for a given problem: 1. Underlying model is generative ROOT DMV Klein and Manning (2004) ADP ADJ NOUN VERB NOUN DET EMNLP 2011 6

Assumptions for a given problem: 1. Underlying model is generative Composed of multinomial distributions close by The Skibo is Castle HMM Merialdo (1994) EMNLP 2011 7

Assumptions for a given problem: 1. Underlying model is generative Composed of multinomial distributions ROOT DMV Klein and Manning (2004) ADP ADJ NOUN VERB NOUN DET EMNLP 2011 8

Assumptions for a given problem: 1. Underlying model is generative In general, unlexicalized parameters look like: kth multinomial in the model ith event in the multinomial e.g. transition from ADJ ( )to NOUN ( ) EMNLP 2011 9

Assumptions for a given problem: 1. Underlying model is generative The lexicalized parameters take a similar form (No lexicalized parameters for the DMV) EMNLP 2011 10

Assumptions for a given problem: 1. Underlying model is generative number of times event i of multinomial k fires in the derivation unlexicalized lexicalized EMNLP 2011 11

Assumptions for a given problem: 2. Coarse, universal part-of-speech tags EMNLP 2011

Assumptions for a given problem: 2. Coarse, universal part-of-speech tags For each language , there is a mapping Treebank tagset EMNLP 2011

Assumptions for a given problem: 3. helper languages For each: coarse conversion Coarse treebank Treebank MLE unlexicalized parameters EMNLP 2011

Multilingual Modeling EMNLP 2011

Multilingual Modeling For a target language, unlexicalized parameters: mixture weight for kthmultinomial for the th helper language kth multinomial in the model (say, the transitions from the ADJ tag in an HMM) EMNLP 2011

Multilingual Modeling e.g., two helper languages: Spanish and Italian ADJ → . ADJ → . ADJ → . ADJ → . ADJ → . 0.7 0.3 EMNLP 2011

Multilingual Modeling e.g., two helper languages: Spanish and Italian ADJ → . ADJ → . ADJ → . ADJ → . ADJ → . ? ? unknown EMNLP 2011

Learning and Inference EMNLP 2011

Learning and Inference normal learning EMNLP 2011

Learning and Inference multilingual learning are fixed! EMNLP 2011

Learning and Inference Multilingual learning learning with EM: M-step: Number of times is used in a derivation EMNLP 2011

Learning and Inference Multilingual learning What about feature-rich generative models? Locally normalized log-linear model Berg-Kirkpatrick et al. (2010) EMNLP 2011

Multilingual Modeling e.g., two helper languages: Spanish and Italian ADJ → . ADJ → . ADJ → . ADJ → . ? ? unknown EMNLP 2011

Multilingual Modeling e.g., two helper languages: Spanish and Italian ADJ → . ADJ → . ADJ → . ADJ → . ADJ → . 0.6237 0.3763 learned EMNLP 2011

Learning and Inference Coarse-to-fine expansion (for English) ADJ → . JJR → . JJS → . JJ → . identical copies Step 1 EMNLP 2011

Learning and Inference Coarse-to-fine expansion (for English) JJ → . EMNLP 2011

Learning and Inference Coarse-to-fine expansion (for English) Step 2 Monolingual unsupervised training JJ → . JJ → . Equal division Initializer . . . . . . . . . . . . new, fine EMNLP 2011

Experiments EMNLP 2011

Two Problems • Unsupervised • Part-of-Speech • Tagging • Model: • feature-based HMM • (Berg-Kirkpatrick et al., 2010) • Learning: • L-BFGS • Unsupervised • Dependency • Parsing • Model: • DMV • (Klein and Manning, 2004) • Learning: • EM EMNLP 2011

Languages Target Languages: Bulgarian, Danish, Dutch, Greek, Japanese, Portuguese, Slovene, Spanish, Swedish, and Turkish Helper Languages: English, German, Italian and Czech (CoNLLTreebanks from 2006 and 2007) EMNLP 2011

Results: POS Tagging Full model Uniform mixture parameters (no learning) Monolingual baseline (Berg-Kirkpatrick et al., 2010) (without tag dictionary) EMNLP 2011

Results: POS Tagging (without tag dictionary) EMNLP 2011

Results: Dependency Parsing Phylogenetic Grammar Induction (Berg-Kirkpatrick and Klein, 2010) Posterior Regularization (Gillenwater et al, 2010) Monolingual EM (Klein and Manning, 2004) EMNLP 2011

Results: Dependency Parsing Uniform mixture parameters Coarse-to-fine expansion → monolingual learning Learned mixture parameters Coarse-to-fine expansion → monolingual learning Learned mixture parameters No coarse-to-fine expansion 1. Uniform mixture parameters 2. No coarse-to-fine expansion (no learning) EMNLP 2011

Results: Dependency Parsing EMNLP 2011

Analyzing with Principal Component Analysis Two principal components EMNLP 2011

From Words to Dependencies EMNLP 2011

From Words to Dependencies Use induced tags to induce dependencies In a pipeline Using the posteriors over tagsin a sausage lattice(Cohen and Smith, 2007) EMNLP 2011

From Words to Dependencies Joint Decoding: DET : 0.95 DET : 0.0 DET : 0.01 ADJ: 0.03 ADJ: 0.3 ADJ: 0.1 Parsing a lattice 1 2 3 4 NOUN: 0.02 NOUN: 0.7 NOUN: 0.89 Skibo Castle The DMV EMNLP 2011

Results: Words to Dependencies EMNLP 2011

Results: Words to Dependencies Best average result with gold tags: 62.2 Interesting result: Auto tags perform better for Turkish and Slovene EMNLP 2011

Conclusions EMNLP 2011

Conclusions • Improvements for two major tasks using non-parallel multilingual guidance • In general grammar induction results better than POS tagging • Joint POS and dependency parsing performs surprisingly well • For a few languages, results are better than using gold tags • Joint decoding performs better than a pipeline EMNLP 2011

Questions? EMNLP 2011

Results: POS Tagging (without tag dictionary) EMNLP 2011

Results: POS Tagging (with tag dictionary) EMNLP 2011

Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance

Presentation Transcript

Parallelism: Writing with Parallel Structure

Parallel Structure

Parallel Structure

Multilingual Guidance for Unsupervised Linguistic Structure Prediction

Writing with Concord Parallel Structure

Parallel Structure with Paired Conjuntions

Structure Prediction

Parallel Structure

Parallel structure with comparisons

Parallel Structure

Unsupervised Learning With Non-ignorable Missing Data

Parallel Structure

Parallel Structure

Structure Prediction

Parallel Structure

Writing with Concord: Parallel Structure

Parallel Structure

Structure Prediction

Parallel Structure

Writing with Concord: Parallel Structure