1 / 44

Morphological Preprocessing for Statistical Machine Translation

NLP Meeting 10/19/2006. Morphological Preprocessing for Statistical Machine Translation. Nizar Habash Columbia University habash@cs.columbia.edu. Road Map. Hybrid MT Research @ Columbia Morphological Preprocessing for SMT (Habash & Sadat, NAACL 2006) Combination of Preprocessing Schemes

cirila
Download Presentation

Morphological Preprocessing for Statistical Machine Translation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NLP Meeting 10/19/2006 Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University habash@cs.columbia.edu

  2. Road Map • Hybrid MT Research @ Columbia • Morphological Preprocessing for SMT (Habash & Sadat, NAACL 2006) • Combination of Preprocessing Schemes (Sadat & Habash, ACL 2006)

  3. Why Hybrid MT? • StatMT and RuleMT have complementary advantages • RuleMT: Handling of possible but unseen word forms • StatMT: Robust translation of seen words • RuleMT: Better global target syntactic structure • StatMT: Robust local phrase-based translation • RuleMT:Cross-genre generalizations/robustness • StatMT: Robust within-genre translation • StatMT and RuleMT use complementary resources • Parallel corpora vs. dictionaries, parsers, analyzers, linguists • Hybrids can potentially improve over either approach

  4. Hybrid MT Challenges • Linguistic phrase versus StatMT phrase “. on the other hand , the” • Meaningful probabilities for linguistic resources • Increased system complexity • The potential to produce the combined worst rather than the combined best • Low Arabic parsing performance (~70% Parseval F-score) • Statistical hallucinations

  5. Hybrid MT Continuum • “Hybrid” is a moving target • StatMT systems use some rule-based components • Orthographic normalization, number/date translation, etc. • RuleMT systems nowadays use statistical n-gram language modeling • Hybrid MT systems • Different mixes of statistical/rule-based components • Resource availability • General approach directions • Adding rules/linguistics to StatMT systems • Adding statistics/statistical resources to RuleMT systems • Depth of hybridization • Morphology, syntax, semantics

  6. Columbia MT Projects • Arabic-English MT focus • Different hybrid approaches

  7. Columbia MT Projects • Arabic-English MT focus • Different hybrid approaches

  8. System Overview RuleMT StatMT * Columbia Contrast Columbia Primary * Koehn Hybrid Scale

  9. Research Directions • Syntactic SMT preprocessing • Syntax-aware phrase extraction • Statistical linearization using richer CFGs • Creation and integration of rule-generated phrase-tables • Lowering dependence on source language resources • Extension to other languages and dialects

  10. Road Map • Hybrid MT Research @ Columbia • Morphological Preprocessing for SMT • Linguistic Issues • Previous Work • Schemes and Techniques • Evaluation • Combination of Preprocessing Schemes

  11. Arabic Linguistic Issues • Rich Morphology • Clitics [CONJ+ [PART+ [DET+ BASE +PRON]]]w+ l+ Al+ mktb and+ for+ the+ office • Morphotactics w+l+Al+mktb  wllmktb وللمكتب و+ل+ال+مكتب • Ambiguity • وجد wjd he found • و+جد w+ jd and+grandfather

  12. Previous Work • Morphological & syntactic preprocessing for SMT • French-English (Berger et al., 1994) • German-English (Nießen and Ney 2000; 2004) • Spanish, Catalan and Serbian to English (Popović and Ney, 2004) • Czech-English (Goldwater and McClosky, 2005) • Arabic-English (Lee, 2004) • We focus on morphological preprocessing • Larger set of conditions: schemes, techniques, learning curve, genre variation • No additional kinds of preprocessing (e.g. dates, numbers)

  13. Road Map • Hybrid MT Research @ Columbia • Morphological Preprocessing for SMT • Linguistic Issues • Previous Work • Schemes and Techniques • Evaluation • Combination of Preprocessing Schemes

  14. Preprocessing Schemes Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?

  15. Preprocessing Schemes • ST Simple Tokenization • D1 Decliticize CONJ+ • D2 Decliticize CONJ+, PART+ • D3 Decliticize all clitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?

  16. Preprocessing Schemes • ST Simple Tokenization • D1 Decliticize CONJ+ • D2 Decliticize CONJ+, PART+ • D3 Decliticize all clitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?

  17. Preprocessing Schemes • ST Simple Tokenization • D1 Decliticize CONJ+ • D2 Decliticize CONJ+, PART+ • D3 Decliticize all clitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?

  18. Preprocessing Schemes • ST Simple Tokenization • D1 Decliticize CONJ+ • D2 Decliticize CONJ+, PART+ • D3 Decliticize all clitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?

  19. Preprocessing Schemes • ST Simple Tokenization • D1 Decliticize CONJ+ • D2 Decliticize CONJ+, PART+ • D3 Decliticize all clitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?

  20. Preprocessing Schemes • ST Simple Tokenization • D1 Decliticize CONJ+ • D2 Decliticize CONJ+, PART+ • D3 Decliticize all clitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?

  21. Preprocessing Schemes • ST Simple Tokenization • D1 Decliticize CONJ+ • D2 Decliticize CONJ+, PART+ • D3 Decliticize all clitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags

  22. Preprocessing Schemes • MT04,1353 sentences, 36000 words

  23. Preprocessing Schemes • Scheme Accuracy • Measured against Penn Arabic Treebank

  24. Preprocessing Techniques • REGEX: Regular Expressions • BAMA: Buckwalter Arabic Morphological Analyzer (Buckwalter 2002; 2004) • Pick first analysis • Use TOKAN (Habash 2006) • A generalized tokenizer • Assumes disambiguated morphological analysis • Declarative specification of any preprocessing scheme • MADA: Morphological Analysis and Disambiguation for Arabic(Habash&Rambow 2005) • Multiple SVM classifiers + combiner • Selects BAMA analysis • Use TOKAN

  25. TOKAN • A generalized tokenizer • Assumes disambiguated morphological analysis • Declarative specification of any tokenization scheme • D1 w+ f+ REST • D2 w+ f+ b+ k+ l+ s+ REST • D3 w+ f+ b+ k+ l+ s+ Al+ REST +P: +O: • TB w+ f+ b+ k+ l+ REST +P: +O: • BWMORPH • L1LEXEME + POS • ENG w+ f+ b+ k+ l+ s+ Al+ LEXEME + BIESPOS +S: • Uses generator (Habash 2006)

  26. Road Map • Hybrid MT Research @ Columbia • Morphological Preprocessing for SMT • Linguistic Issues • Previous Work • Schemes and Techniques • Evaluation • Combination of Preprocessing Schemes

  27. Experiments • Portage Phrase-based MT (Sadat et al., 2005) • Training Data: parallel 5 Million words only • All in News genre • Learning curve: 1%, 10% and 100% • Language Modeling: 250 Million words • Development Tuning Data: MT03 Eval Set • Test Data: • MT04 (Mixed genre: news, speeches, editorials) • MT05 (All news)

  28. Experiments (cont’d) • Metric: BLEU (Papineni et al 2001) • 4 references, case insensitive • Each experiment • Select a preprocessing scheme • Select a preprocessing technique • Some combinations do not exist • REGEX and EN

  29. MADA BAMA REGEX MT04 Results Training 100% 10% BLEU 1% > >

  30. MADA BAMA REGEX MT05 Results Training 100% 10% BLEU 1% > >

  31. + 2% + 12% + 71% + 105% MT04 Genre VariationBest Schemes + Technique EN+MADA @ 1%, D2+MADA @ 100% BLEU

  32. Other Results • Orthographic Normalization generally did better than the baseline ST • statistically significant at 1% training data only • wa+ decliticizationwas generally similar to D1 • Arabic Treebank scheme was similar to D2 • Full lemmatization schemes behaved like EN but always worse • 50% Training data • D2 @ 50% data >= ST @ 100% data • Larger phrases size (14) did not have a significant difference from the size 8 we used

  33. Latest Results (July 2006)

  34. Road Map • Hybrid MT Research @ Columbia • Morphological Preprocessing for SMT • Combination of Preprocessing Schemes

  35. Oracle Combination • Preliminary study: oracle combination • In MT04,100% data, MADA technique, 11 schemes, sentence level selection • Achieved 46.0 Bleu • (24% improvement over best system 37.1)

  36. System Combination • Exploit scheme complementarity to improve MT quality • Explore two methods of system combination • Rescoring-Only Combination (ROC) • Decoding-plus-Rescoring Combination (DRC) • We us all 11 schemes with MADA technique

  37. Rescoring-Only Combination(ROC) • Rescore all the one-best outputs generated from separate scheme-specific systems and return the top choice • Each scheme-specific system uses its own scheme-specific preprocessing, phrase tables and decoding weights

  38. Rescoring-Only Combination (ROC) • Standard combo • Trigram language model, phrase translation model, distortion model, and sentence length • IBM model 1 and 2 probabilities in both directions • Other combo: add more features • Perplexity of source sentence (PPL) against a source LM (in same scheme) • Number of out-of-vocabulary words source sentence (OOV) • Source sentence length (SL) • An encoding of the specific scheme (SC)

  39. Decoding-plus-Rescoring Combination (DRC) • Step 1: Decode • For each preprocessing scheme • Use union of phrase tables from all schemes • Optimize and decode (with same scheme) • Step 2: Rescore • Rescoring the one-best outputs of each preprocessing scheme

  40. Results • MT04 set • Best single scheme D2 scores 37.1

  41. Results • Statistical significance using bootstrap re-sampling (Koehn, 2004)

  42. Conclusions • For large amounts of training data, splitting off conjunctions and particles performs best • For small amount of training data, following an English-like tokenization performs best • Suitable choice of preprocessing scheme and technique yields an important increase in BLEU score if • there is little training data • there is a change in genre between training and test • System combination is potentially highly rewarding especially when combining the phrase tables of different preprocessing schemes.

  43. Future Work • Study additional variant schemes that current results support • Factored translation modeling • Decoder extension to use multiple schemes in parallel • Syntactic preprocessing • Investigate combination techniques at the sentence and sub-sentence levels

  44. Thank you! Questions? Nizar Habash habash@cs.columbia.edu

More Related