Morphological Preprocessing for Statistical Machine Translation

NLP Meeting 10/19/2006 Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University habash@cs.columbia.edu

Road Map • Hybrid MT Research @ Columbia • Morphological Preprocessing for SMT (Habash & Sadat, NAACL 2006) • Combination of Preprocessing Schemes (Sadat & Habash, ACL 2006)

Why Hybrid MT? • StatMT and RuleMT have complementary advantages • RuleMT: Handling of possible but unseen word forms • StatMT: Robust translation of seen words • RuleMT: Better global target syntactic structure • StatMT: Robust local phrase-based translation • RuleMT:Cross-genre generalizations/robustness • StatMT: Robust within-genre translation • StatMT and RuleMT use complementary resources • Parallel corpora vs. dictionaries, parsers, analyzers, linguists • Hybrids can potentially improve over either approach

Hybrid MT Challenges • Linguistic phrase versus StatMT phrase “. on the other hand , the” • Meaningful probabilities for linguistic resources • Increased system complexity • The potential to produce the combined worst rather than the combined best • Low Arabic parsing performance (~70% Parseval F-score) • Statistical hallucinations

Hybrid MT Continuum • “Hybrid” is a moving target • StatMT systems use some rule-based components • Orthographic normalization, number/date translation, etc. • RuleMT systems nowadays use statistical n-gram language modeling • Hybrid MT systems • Different mixes of statistical/rule-based components • Resource availability • General approach directions • Adding rules/linguistics to StatMT systems • Adding statistics/statistical resources to RuleMT systems • Depth of hybridization • Morphology, syntax, semantics

Columbia MT Projects • Arabic-English MT focus • Different hybrid approaches

System Overview RuleMT StatMT * Columbia Contrast Columbia Primary * Koehn Hybrid Scale

Research Directions • Syntactic SMT preprocessing • Syntax-aware phrase extraction • Statistical linearization using richer CFGs • Creation and integration of rule-generated phrase-tables • Lowering dependence on source language resources • Extension to other languages and dialects

Road Map • Hybrid MT Research @ Columbia • Morphological Preprocessing for SMT • Linguistic Issues • Previous Work • Schemes and Techniques • Evaluation • Combination of Preprocessing Schemes

Arabic Linguistic Issues • Rich Morphology • Clitics [CONJ+ [PART+ [DET+ BASE +PRON]]]w+ l+ Al+ mktb and+ for+ the+ office • Morphotactics w+l+Al+mktb  wllmktb وللمكتب و+ل+ال+مكتب • Ambiguity • وجد wjd he found • و+جد w+ jd and+grandfather

Previous Work • Morphological & syntactic preprocessing for SMT • French-English (Berger et al., 1994) • German-English (Nießen and Ney 2000; 2004) • Spanish, Catalan and Serbian to English (Popović and Ney, 2004) • Czech-English (Goldwater and McClosky, 2005) • Arabic-English (Lee, 2004) • We focus on morphological preprocessing • Larger set of conditions: schemes, techniques, learning curve, genre variation • No additional kinds of preprocessing (e.g. dates, numbers)

Preprocessing Schemes Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?

Preprocessing Schemes • ST Simple Tokenization • D1 Decliticize CONJ+ • D2 Decliticize CONJ+, PART+ • D3 Decliticize all clitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?

Preprocessing Schemes • ST Simple Tokenization • D1 Decliticize CONJ+ • D2 Decliticize CONJ+, PART+ • D3 Decliticize all clitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags

Preprocessing Schemes • MT04,1353 sentences, 36000 words

Preprocessing Schemes • Scheme Accuracy • Measured against Penn Arabic Treebank

Preprocessing Techniques • REGEX: Regular Expressions • BAMA: Buckwalter Arabic Morphological Analyzer (Buckwalter 2002; 2004) • Pick first analysis • Use TOKAN (Habash 2006) • A generalized tokenizer • Assumes disambiguated morphological analysis • Declarative specification of any preprocessing scheme • MADA: Morphological Analysis and Disambiguation for Arabic(Habash&Rambow 2005) • Multiple SVM classifiers + combiner • Selects BAMA analysis • Use TOKAN

TOKAN • A generalized tokenizer • Assumes disambiguated morphological analysis • Declarative specification of any tokenization scheme • D1 w+ f+ REST • D2 w+ f+ b+ k+ l+ s+ REST • D3 w+ f+ b+ k+ l+ s+ Al+ REST +P: +O: • TB w+ f+ b+ k+ l+ REST +P: +O: • BWMORPH • L1LEXEME + POS • ENG w+ f+ b+ k+ l+ s+ Al+ LEXEME + BIESPOS +S: • Uses generator (Habash 2006)

Experiments • Portage Phrase-based MT (Sadat et al., 2005) • Training Data: parallel 5 Million words only • All in News genre • Learning curve: 1%, 10% and 100% • Language Modeling: 250 Million words • Development Tuning Data: MT03 Eval Set • Test Data: • MT04 (Mixed genre: news, speeches, editorials) • MT05 (All news)

Experiments (cont’d) • Metric: BLEU (Papineni et al 2001) • 4 references, case insensitive • Each experiment • Select a preprocessing scheme • Select a preprocessing technique • Some combinations do not exist • REGEX and EN

MADA BAMA REGEX MT04 Results Training 100% 10% BLEU 1% > >

MADA BAMA REGEX MT05 Results Training 100% 10% BLEU 1% > >

+ 2% + 12% + 71% + 105% MT04 Genre VariationBest Schemes + Technique EN+MADA @ 1%, D2+MADA @ 100% BLEU

Other Results • Orthographic Normalization generally did better than the baseline ST • statistically significant at 1% training data only • wa+ decliticizationwas generally similar to D1 • Arabic Treebank scheme was similar to D2 • Full lemmatization schemes behaved like EN but always worse • 50% Training data • D2 @ 50% data >= ST @ 100% data • Larger phrases size (14) did not have a significant difference from the size 8 we used

Latest Results (July 2006)

Road Map • Hybrid MT Research @ Columbia • Morphological Preprocessing for SMT • Combination of Preprocessing Schemes

Oracle Combination • Preliminary study: oracle combination • In MT04,100% data, MADA technique, 11 schemes, sentence level selection • Achieved 46.0 Bleu • (24% improvement over best system 37.1)

System Combination • Exploit scheme complementarity to improve MT quality • Explore two methods of system combination • Rescoring-Only Combination (ROC) • Decoding-plus-Rescoring Combination (DRC) • We us all 11 schemes with MADA technique

Rescoring-Only Combination(ROC) • Rescore all the one-best outputs generated from separate scheme-specific systems and return the top choice • Each scheme-specific system uses its own scheme-specific preprocessing, phrase tables and decoding weights

Rescoring-Only Combination (ROC) • Standard combo • Trigram language model, phrase translation model, distortion model, and sentence length • IBM model 1 and 2 probabilities in both directions • Other combo: add more features • Perplexity of source sentence (PPL) against a source LM (in same scheme) • Number of out-of-vocabulary words source sentence (OOV) • Source sentence length (SL) • An encoding of the specific scheme (SC)

Decoding-plus-Rescoring Combination (DRC) • Step 1: Decode • For each preprocessing scheme • Use union of phrase tables from all schemes • Optimize and decode (with same scheme) • Step 2: Rescore • Rescoring the one-best outputs of each preprocessing scheme

Results • MT04 set • Best single scheme D2 scores 37.1

Results • Statistical significance using bootstrap re-sampling (Koehn, 2004)

Conclusions • For large amounts of training data, splitting off conjunctions and particles performs best • For small amount of training data, following an English-like tokenization performs best • Suitable choice of preprocessing scheme and technique yields an important increase in BLEU score if • there is little training data • there is a change in genre between training and test • System combination is potentially highly rewarding especially when combining the phrase tables of different preprocessing schemes.

Future Work • Study additional variant schemes that current results support • Factored translation modeling • Decoder extension to use multiple schemes in parallel • Syntactic preprocessing • Investigate combination techniques at the sentence and sub-sentence levels

Thank you! Questions? Nizar Habash habash@cs.columbia.edu

Morphological Preprocessing for Statistical Machine Translation

Morphological Preprocessing for Statistical Machine Translation

Presentation Transcript

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation System

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Introduction to Statistical Machine Translation

Morphological Analysis for Phrase-Based Statistical Machine Translation

Statistical Machine Translation

Morphological Generation of German for Statistical Machine Translation

Statistical Machine Translation

Morphological Processing for Statistical Machine Translation

Morphological Analysis for Phrase-Based Statistical Machine Translation

Introduction to Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Cluster Computing for Statistical Machine Translation

Domain Adaptation for Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Statistical Machine Translation

Machine Translation, Statistical Approach