Morphological Processing for Statistical Machine Translation

COMS E6998: Topics in Computer Science: Machine Translation February 7, 2013 Reading Set #1 Morphological Processing for Statistical Machine Translation Presenter: Nizar Habash

Papers Discussed • Nizar Habash and Fatiha Sadat. 2006. Arabic Preprocessing Schemes for Statistical Machine Translation. • Nimesh Singh and Nizar Habash. 2012. Hebrew Morphological Preprocessing for Statistical Machine Translation.

Outline • Introduction • Arabic and Hebrew Morphology • Approach • Experimental Settings • Results • Conclusions

The Basic Idea • Reduction of word sparsity improves translation quality • This reduction can be achieved by • increasing training data, or by • morphologically driven preprocessing

Introduction • Morphologically rich languages are especially challenging for SMT • Model sparsity, high OOV rate especially under low-resource conditions • A common solution is to tokenize the source words in a preprocessing step • Lower OOV rate  Better SMT (in terms of BLEU) • Increased token symmetry  Better SMT models • conj+article+noun :: conj article noun • wa+Al+kitAb :: and the book

Introduction • Different tokenizations can be used • No one “correct” tokenization. Tokenizations vary in terms of • Scheme (what) and Technique (how) • Accuracy • Consistency • Sparsity reduction • The two papers consider different preprocessing options and other settings to study SMT from Arabic/Hebrew to English

Linguistic Issues • Arabic & Hebrew are Semitic languages • Root-and-pattern morphology • Extensive use of affixes and clitics • Rich Morphology • Clitics [CONJ+ [PART+ [DET+ BASE +PRON]]]w+ l+ Al+ mktb and+ for+ the+ office • Morphotactics w+l+Al+mktb wllmktbوللمكتب و+ل+ال+مكتب

Linguistic Issues • Orthographic & Morphological Ambiguity • وجدناwjdnA • wjd+nAwajad+nA (we found) • w+jd+nAwa+jad~u+nA (and our grandfather) • בשורהbbšwrh

MT LAB HINTS Arabic Orthographic Ambiguity Extra w+ Repeated Al+ Repeated Al+ wdrst AltAlbAt AlErbyAt ktAbA bAlSynyp w+drs+t Al+Talb+At Al+Erb+y+At ktAb+A b+Al+Syn+y+p And+study+they the+student+f.pl. the+Arab+f.pl. book+a in+the+Chinese The Arab students studied a book in Chinese the+arab students studied a+book in+chinese th+rb stdnts stdd +bk n+chns thrb stdnts stdd bk nchns to+herb so+too+dents studded bake in chains?

MT LAB HINTS Arabic Morphemes circumfix Verbs Nominals everything Clitics are optional, affixes are obligatory!

ApproachHabash&Sadat 2006 / Singh&Habash 2012 • Preprocessing scheme • What to tokenize • Preprocessing Technique • How to tokenize • Regular expressions • Morphological analysis • Morphological tagging / disambiguation • Unsupervised morphological segmentation • Not always independent

Arabic Preprocessing Schemes • ST Simple Tokenization • D1 Decliticizeconjunctions: w+/f+ • D2 D1 + Decliticize particles: b+/l+/k+/s+ • D3 D2 + Decliticize article Al+ and pron’lclitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?

Arabic Preprocessing Techniques • REGEX: Regular Expressions • BAMA: Buckwalter Arabic Morphological Analyzer (Buckwalter 2002; 2004) • Pick first analysis • Use TOKAN (Habash 2006) • A generalized tokenizer • Assumes disambiguated morphological analysis • Declarative specification of any preprocessing scheme • MADA: Morphological Analysis and Disambiguation for Arabic(Habash&Rambow 2005) • Multiple SVM classifiers + combiner • Selects BAMA analysis • Use TOKAN

Hebrew Preprocessing Techniques/Schemes • Regular Expressions • RegEx-S1 = Conjunctions: ו ‘and’ and ש ‘that/who’ • RegEx-S2 = RegEx-S1 and Prepositions: ב ‘in’, כ ‘like/as’, ל ‘to/for’, and מ ‘from’ • RegEx-S3 = RegEx-S2 and the article ה ‘the’ • RegEx-S4 = RegEx-S3 and pronominal enclitics • Morfessor (Creutz and Lagus, 2007) • Morf - Unsupervised splitting into morphemes • Hebrew Morphological Tagger (Adler, 2009) • Htag - Hebrew morphological analysis and disambiguation

Tokenization System Statistics • Aggressive tokenization schemes have: • More tokens • More change from the baseline (untokenized) • Fewer OOVs (baseline OOV is 7%)

Tokenization System Statistics

Arabic-English Experiments • Portage Phrase-based MT (Sadat et al., 2005) • Training Data: parallel 5 Million words only • All in News genre • Learning curve: 1%, 10% and 100% • Language Modeling: 250 Million words • Development Tuning Data: MT03 Eval Set • Test Data MT04 • Mixed genre: news, speeches, editorials • Metric: BLEU (Papineni et al 2001)

Arabic-English Experiments • Each experiment • Select a preprocessing scheme • Select a preprocessing technique • Some combinations do not exist • REGEX and EN

Arabic-English Results Training 100% 10% BLEU 1% > > MADABAMA REGEX

Hebrew-English Experiments • Phrase-based statistical MT • Moses (Koehn et al., 2007) • MERT (Och, 2003) tuned for BLEU (Papineni et al., 2002) • Language models: English Gigaword (5-gram) plus training (3-gram) • True casing for English output • Training data  850,000 words

Hebrew-English Experiments • Compare seven systems • Vary only preprocessing • Baseline, RegEx-S{1-4}, Morf, and Htag • Metrics • BLEU, NIST (Doddington, 2002), • METEOR (Banerjee & Lavie, 2005)

Results Htag is consistently best, and Morf is consistently second best, in terms of BLEU and NIST

Results Morf has very low OOV, but still does worse than Htag (and even more poorly according to METEOR), indicating that it sometimes over-tokenizes.

Results Within RegEx, BLEU peaks at S2/S3, similar to Arabic D2 (Habash & Sadat, 2006)

Translation Example

Conclusions • Preprocessing is useful for improving Arabic-English & Hebrew-English SMT • But as more data is added, the value diminishes • Tokenization with a morphological tagger does best but requires a lot of linguistic knowledge • Morfessor does quite well with no linguistic information necessary, and significantly reduces OOV (but perhaps erroneously) • Optimal Scheme/Technique choice varies by training data size • In Arabic, for large amounts of training data, splitting off conjunctions and particles performs best • But, for small amount of training data, following an English-like tokenization performs best

Thank you! Questions? Nizar Habash habash@cs.columbia.edu

Morphological Processing for Statistical Machine Translation

Morphological Processing for Statistical Machine Translation

Presentation Transcript

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation System

Morphological Preprocessing for Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Morphological Analysis for Phrase-Based Statistical Machine Translation

Statistical Machine Translation

Morphological Generation of German for Statistical Machine Translation

Statistical Machine Translation

Morphological Analysis for Phrase-Based Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

*Introduction to Natural Language Processing (600.465) Statistical Machine Translation

Statistical Machine Translation

Cluster Computing for Statistical Machine Translation

Domain Adaptation for Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Machine Translation, Statistical Approach

Statistical Machine Translation