Machine Learning Approaches for Limited Bilingual Data in SMT

Machine Learning approaches for dealing with Limited Bilingual Data in SMT Gholamreza Haffari Simon Fraser University MT Summit, August 2009

Acknowledgments • Special thanks to: Anoop Sarkar • Some slides are adapted or used from • Chris Callison Burch • Trevor Cohn • Dragos Stefan Munteanu

Statistical Machine Translation • Translate from a source language to a target language by computer using a statistical model • MFE is a standard log-linear model MFE Target Lang. E Source Lang. F

t* = arg max t iwi . fi (t,s) Feature functions Weights Log-Linear Models • In the test time, the best output t* for a given s is chosen by

Phrase-based SMT • MFE is composed of two main components: • The language model flm : Takes care of the fluency of the generated translation • The phrase table fpt : Takes care of the content of the source sentence in the generated translation Huge bitext is needed to learn a high quality phrase dictionary

Bilingual Parallel Data Source Text Target Text

This Talk What if we don’t have large bilingual text to learn a good phrase table?

Motivations • Low-density Language pairs • Population speaking the language is small / Limited online resources • Adapting to a new style/domain/topic • Overcome training and testing mismatch

Available Resources • Small bilingual parallel corpora • Large amounts of monolingual data • Comparable corpora • Small translation dictionary • Multilingual parallel corpora which includes multiple source languages but not the target language

parallel sentence extraction large comparable source-target bitext large source monotext semi-supervised/ active learning bilingual dictionary induction paraphrasing triangulation/ co-training source-another another-target source-target bitexts source-another language bitext The Map source-target small bitext MT system

Learning Problems (I) • Supervised Learning: • Given a sample of object-label pairs (xi,yi), find the predictive relationship between object and labels • Un-supervised learning: • Given a sample consisting of only objects, look for interesting structures in the data, and group similar objects

Learning Problems (II) • Now consider a training data consisting of: • Labeled data: Object-label pairs (xi,yi) • Unlabeled data: Objects xj • Leads to the following learning scenarios: • Semi-Supervised Learning: Find the best mapping from objects to labels benefiting from Unlabeled data • Transductive Learning: Find the labels of unlabeled data • Active Learning: Find the mapping while actively query the oracle for the label of unlabeled data

Train M Select The Big Picture Self-Training Data Labeled {(xi,yi)} (bitext) Unlabeled {xj} (monotext)

Mining More Bilingual Parallel Data • Comparable Corpora are texts which are not parallel in the strict sense but convey overlapping information • Wikipedia pages • New agencies: BBC, CNN • From comparable corpora, we can extract sentence pairs which are (approximately) translation of each other

Parallel sentences Un-matched Documents Extracting Parallel Sentences (Munteanu & Marcu, 2005)

Un-matched Documents Article Selection • Select the n-most relevant target-language docs to a source-language document using an information retrieval (IR) system: • Translate each source-lang article into a target-lang query using the bilingual dictionary (Munteanu & Marcu, 2005)

Candidate Sentence Pair Selection • Consider all of the sentence pairs from the source-lang article and relevant target-lang articles. Filter the sentence pairs if: • The ratio of the length is more than 2 • At least half of the words in each sentence does not have a translation in the other sentence (Munteanu & Marcu, 2005)

Parallel Sentence Selection • Each candidate sentence pair (s,t) is classified into c0=‘parallel’ or c1=‘not parallel’ according to the following log-linear model: • The weights  are learned during training phase using training data (Munteanu & Marcu, 2005)

Model Features & Training Data • The features of the log-linear classifier include: • Length of the sentences, as well as their ratio • Percentage of words in one side that do not have translation in the other side / are not connected by alignment links • Training data can be prepared by a parallel corpus containing K sentence pairs • This gives K positive and K2 – K negative examples (which can be filtered further using the previous heuristics) (Munteanu & Marcu, 2005)

Improvement in SMT (Arabic to English) Initial + human translated data Initial + extracted corpus Initial out-of-domain parallel corpus (Munteanu & Marcu, 2005)

Outline • Introduction • Semi-supervised Learning for SMT • Background (EM, Self-training, Co-Training) • SSL for Alignments / Phrases / Sentences • Active Learning for SMT • Single-language pair • Multiple Language Pairs

Inductive vs.Transductive • Transductive: Produce label only for the available unlabeled data. • The output of the method is not a classifier • It’s like writing answers for the take-home exam! • Inductive: Not only produce label for unlabeled data, but also produce a classifier. • It’s like preparation for writing answers for the in-class exam!

Iteration: 0 Iteration: 1 Iteration: 2 + + + A Model trained by SL - - - Self-Training Choose instances labeled with highconfidence …… Add them to the pool of current labeled training data (Yarowsky 1995)

: Log-likelihood of labeled data : Log-likelihood of unlabeled data EM • Use EM to maximize the joint log-likelihood of labeled and unlabeled data: (Dempster et al 1977)

Iteration: 0 Iteration: 1 Iteration: 2 + + + A Model trained by SL w-i w+i - - - EM …… Clone new weighted labeled instances with unlab instances using (probabilisitc) model (Yarowsky 1995)

Co-Training • Instances contain two sufficient sets of features • i.e. an instance is x=(x1,x2) • Each set of features is called a View • Two views are independent given the label: • Two views are consistent: x x2 x1 (Blum & Mitchell 1998)

Iteration: t Iteration: t+1 + + C1: A Classifier trained on view 1 Add self-labeled instances to the pool of training data …… C2: A Classifier trained on view 2 - - Co-Training Allow C1 to label Some instances Allow C2 to label Some instances

Word Alignment & Translation Quality • (Fraser & Marcu 2006a) presented an SSL method for learning a better word alignment • A small / big set of sentence pairs annotated/unannotated with word alignments (~ 100 / ~ 2-3 million) • They showed that improvement in the word alignment caused improvement in the BLEU • The same conclusion was made later in (Ganchev et al 2008) for other translation tasks

Word Alignment Model • Consider the following log-linear model for word alignment: • The feature functions are sub-models used in the IBM model 4, such as • Translation probability t(f|e) • Fertility probs n(|e): number of words  generated by e • …

SS-Word Alignment • (Fraser & Marcu 2006a) tuned the word alignment model parameters  on the small labeled data in a discriminative fashion • With the current , generate the n-best list • Manipulate  so that the best alignment stands out, i.e. the one which maximizes modified f-measure (MERT style alg) • Use  to find the word alignments of the big unlabeled data • Estimate the feature functions’ parameters based on these best (Viterbi) alignments: 1 iteration of the EM algorithm • Repeat the above two steps

Paraphrasing • If a word is unseen then SMT will not be able to translate it • Keep/omit/transliterate source word or use regular expression to translate it (dates, …) • If a phrase is unseen, but its individual words are, then SMT will be less likely to produce a correct translation • The idea: Use paraphrases in the source language to replace unknown words/phrases • Paraphrases are alternative ways of conveying the same information (Callison Burch, 2007)

Coverage Problem in SMT Percentage of Test Item Types vs Corpus Size (Callison Burch, 2007)

Behavior on Unseen Data • A system trained on 10,000 sentences (~200,000 words) may translate: Es positivo llegar a un acuerdo sobre los procedimientos, pero debemos encargarnos de que este sistema no sea susceptible de ser usado como arma pol´ıtica. as It is good reach an agreement on procedures, but we must encargarnos that this system is not susceptible to be usado as political weapon. • Since the translations of encargarnos and usado were not learned, they are either reproduced in the translation, or omitted entirely (Callison Burch, 2007)

Substituting Paraphrases then Translating It is good reach an agreement on procedures, but we must encargarnos that this system is not susceptible to be usado as political weapon. (Callison Burch, 2007)

Substituting Paraphrases then Translating It is good reach an agreement on procedures, but we must guarantee that this system is not susceptible to be used as political weapon. (Callison Burch, 2007)

Learning paraphrases (I) • From monolingual parallel corpora • Multiple source sentences which are conveying the same information • Extract paraphrases seen in the same context in the aligned source sentences (Callison Burch, 2007)

Learning paraphrases (I) • From monolingual parallel corpora • Multiple source sentences which are conveying the same information • Extract paraphrases seen in the same context in the aligned source sentences burst into tears = cried comfort= console (Callison Burch, 2007)

Learning paraphrases (I) • From monolingual parallel corpora • Multiple source sentences which are conveying the same information • Extract paraphrases seen in the same context in the aligned source sentences • Problems with this approach • Monolingual parallel corpora are relatively uncommon • Limits what paraphrases we can generate, e.g. limited number of paraphrases (Callison Burch, 2007)

Learning paraphrases (I) • From monolingual source corpora • For each unknown phrase x, build a distributional profileDPx which shows the co-occurrences of the surrounding words with x • Select the top-k phrases which have the most similar distributional profile with DPx • Is position important when building the profile? Should we simply count words, or use TF/IDF, or …? Which vector similarity measure should be used? • Needs smart tricks to make it scalable (Marton et al 2009)

Learning paraphrases (II) • From bilingual parallel corpora • However no longer we have access to identical contexts • Adopt techniques from phrase-based SMT • Use aligned foreign language phrases as pivot (Callison Burch, 2007)

Paraphrase Probability • Generate multipleparaphrases for a given phrase • We give them probabilities so they can be ranked • Define translation model probability:

Refined Paraphrase Probability • Using multiple bilingual corpora, e.g. English-Spanish, English-German, … • C is the set of bilingual corpora and c is the weight of the corpus c, e.g. we may put more weight on larger corpora • Taking word sense into account • In a paraphrase, replace each word with its word_sense item

p(s2 | s1) If phrase table entry (t,s1) is generated from (t,s2) 1 Otherwise f(t,s1) = Plugging Paraphrases into SMT Model • For each paraphrase s2 having a translation t, we expand the phrase table by adding new entries (t,s1) s1  s2  t • Add a new feature function into the SMT log-linear model to take into account the paraphrase probabilities

Results of Paraphrasing (Callison Burch, 2007)

Improvement in Coverage (Callison Burch, 2007)

Triangulation • We can find additional data by focusing on: • Multi-parallel corpora • Collection of bitexts with some common language(s) (Cohn & Lapata, 2007)

Machine Learning Approaches for Limited Bilingual Data in SMT

Machine Learning Approaches for Limited Bilingual Data in SMT

Presentation Transcript

Simon Fraser University

,Simon Fraser University cxinsfu

SIMON FRASER UNIVERSITY GERONTOLOGY 400

Ron Santos Simon Fraser University

Jane Friesen, Mohsen Javdani and Simon Woodcock Simon Fraser University May 2009

School of Computing Science Simon Fraser University

Gholamreza Haffari Simon Fraser University PhD Seminar, August 2009

By: Simon Fraser

School of Computing Science Simon Fraser University

Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski

School of Computing Science Simon Fraser University

School of Computing Science Simon Fraser University

School of Computing Science Simon Fraser University

School of Computing Science Simon Fraser University

School of Computing Science Simon Fraser University

Simon Fraser University Progress

School of Computing Science Simon Fraser University

School of Computing Science Simon Fraser University

Simon Fraser University Progress

School of Computing Science Simon Fraser University

School of Computing Science Simon Fraser University