440 likes | 584 Views
NLP Meeting 10/19/2006. Morphological Preprocessing for Statistical Machine Translation. Nizar Habash Columbia University habash@cs.columbia.edu. Road Map. Hybrid MT Research @ Columbia Morphological Preprocessing for SMT (Habash & Sadat, NAACL 2006) Combination of Preprocessing Schemes
E N D
NLP Meeting 10/19/2006 Morphological Preprocessing for Statistical Machine Translation Nizar Habash Columbia University habash@cs.columbia.edu
Road Map • Hybrid MT Research @ Columbia • Morphological Preprocessing for SMT (Habash & Sadat, NAACL 2006) • Combination of Preprocessing Schemes (Sadat & Habash, ACL 2006)
Why Hybrid MT? • StatMT and RuleMT have complementary advantages • RuleMT: Handling of possible but unseen word forms • StatMT: Robust translation of seen words • RuleMT: Better global target syntactic structure • StatMT: Robust local phrase-based translation • RuleMT:Cross-genre generalizations/robustness • StatMT: Robust within-genre translation • StatMT and RuleMT use complementary resources • Parallel corpora vs. dictionaries, parsers, analyzers, linguists • Hybrids can potentially improve over either approach
Hybrid MT Challenges • Linguistic phrase versus StatMT phrase “. on the other hand , the” • Meaningful probabilities for linguistic resources • Increased system complexity • The potential to produce the combined worst rather than the combined best • Low Arabic parsing performance (~70% Parseval F-score) • Statistical hallucinations
Hybrid MT Continuum • “Hybrid” is a moving target • StatMT systems use some rule-based components • Orthographic normalization, number/date translation, etc. • RuleMT systems nowadays use statistical n-gram language modeling • Hybrid MT systems • Different mixes of statistical/rule-based components • Resource availability • General approach directions • Adding rules/linguistics to StatMT systems • Adding statistics/statistical resources to RuleMT systems • Depth of hybridization • Morphology, syntax, semantics
Columbia MT Projects • Arabic-English MT focus • Different hybrid approaches
Columbia MT Projects • Arabic-English MT focus • Different hybrid approaches
System Overview RuleMT StatMT * Columbia Contrast Columbia Primary * Koehn Hybrid Scale
Research Directions • Syntactic SMT preprocessing • Syntax-aware phrase extraction • Statistical linearization using richer CFGs • Creation and integration of rule-generated phrase-tables • Lowering dependence on source language resources • Extension to other languages and dialects
Road Map • Hybrid MT Research @ Columbia • Morphological Preprocessing for SMT • Linguistic Issues • Previous Work • Schemes and Techniques • Evaluation • Combination of Preprocessing Schemes
Arabic Linguistic Issues • Rich Morphology • Clitics [CONJ+ [PART+ [DET+ BASE +PRON]]]w+ l+ Al+ mktb and+ for+ the+ office • Morphotactics w+l+Al+mktb wllmktb وللمكتب و+ل+ال+مكتب • Ambiguity • وجد wjd he found • و+جد w+ jd and+grandfather
Previous Work • Morphological & syntactic preprocessing for SMT • French-English (Berger et al., 1994) • German-English (Nießen and Ney 2000; 2004) • Spanish, Catalan and Serbian to English (Popović and Ney, 2004) • Czech-English (Goldwater and McClosky, 2005) • Arabic-English (Lee, 2004) • We focus on morphological preprocessing • Larger set of conditions: schemes, techniques, learning curve, genre variation • No additional kinds of preprocessing (e.g. dates, numbers)
Road Map • Hybrid MT Research @ Columbia • Morphological Preprocessing for SMT • Linguistic Issues • Previous Work • Schemes and Techniques • Evaluation • Combination of Preprocessing Schemes
Preprocessing Schemes Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?
Preprocessing Schemes • ST Simple Tokenization • D1 Decliticize CONJ+ • D2 Decliticize CONJ+, PART+ • D3 Decliticize all clitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?
Preprocessing Schemes • ST Simple Tokenization • D1 Decliticize CONJ+ • D2 Decliticize CONJ+, PART+ • D3 Decliticize all clitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?
Preprocessing Schemes • ST Simple Tokenization • D1 Decliticize CONJ+ • D2 Decliticize CONJ+, PART+ • D3 Decliticize all clitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?
Preprocessing Schemes • ST Simple Tokenization • D1 Decliticize CONJ+ • D2 Decliticize CONJ+, PART+ • D3 Decliticize all clitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?
Preprocessing Schemes • ST Simple Tokenization • D1 Decliticize CONJ+ • D2 Decliticize CONJ+, PART+ • D3 Decliticize all clitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?
Preprocessing Schemes • ST Simple Tokenization • D1 Decliticize CONJ+ • D2 Decliticize CONJ+, PART+ • D3 Decliticize all clitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?
Preprocessing Schemes • ST Simple Tokenization • D1 Decliticize CONJ+ • D2 Decliticize CONJ+, PART+ • D3 Decliticize all clitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags
Preprocessing Schemes • MT04,1353 sentences, 36000 words
Preprocessing Schemes • Scheme Accuracy • Measured against Penn Arabic Treebank
Preprocessing Techniques • REGEX: Regular Expressions • BAMA: Buckwalter Arabic Morphological Analyzer (Buckwalter 2002; 2004) • Pick first analysis • Use TOKAN (Habash 2006) • A generalized tokenizer • Assumes disambiguated morphological analysis • Declarative specification of any preprocessing scheme • MADA: Morphological Analysis and Disambiguation for Arabic(Habash&Rambow 2005) • Multiple SVM classifiers + combiner • Selects BAMA analysis • Use TOKAN
TOKAN • A generalized tokenizer • Assumes disambiguated morphological analysis • Declarative specification of any tokenization scheme • D1 w+ f+ REST • D2 w+ f+ b+ k+ l+ s+ REST • D3 w+ f+ b+ k+ l+ s+ Al+ REST +P: +O: • TB w+ f+ b+ k+ l+ REST +P: +O: • BWMORPH • L1LEXEME + POS • ENG w+ f+ b+ k+ l+ s+ Al+ LEXEME + BIESPOS +S: • Uses generator (Habash 2006)
Road Map • Hybrid MT Research @ Columbia • Morphological Preprocessing for SMT • Linguistic Issues • Previous Work • Schemes and Techniques • Evaluation • Combination of Preprocessing Schemes
Experiments • Portage Phrase-based MT (Sadat et al., 2005) • Training Data: parallel 5 Million words only • All in News genre • Learning curve: 1%, 10% and 100% • Language Modeling: 250 Million words • Development Tuning Data: MT03 Eval Set • Test Data: • MT04 (Mixed genre: news, speeches, editorials) • MT05 (All news)
Experiments (cont’d) • Metric: BLEU (Papineni et al 2001) • 4 references, case insensitive • Each experiment • Select a preprocessing scheme • Select a preprocessing technique • Some combinations do not exist • REGEX and EN
MADA BAMA REGEX MT04 Results Training 100% 10% BLEU 1% > >
MADA BAMA REGEX MT05 Results Training 100% 10% BLEU 1% > >
+ 2% + 12% + 71% + 105% MT04 Genre VariationBest Schemes + Technique EN+MADA @ 1%, D2+MADA @ 100% BLEU
Other Results • Orthographic Normalization generally did better than the baseline ST • statistically significant at 1% training data only • wa+ decliticizationwas generally similar to D1 • Arabic Treebank scheme was similar to D2 • Full lemmatization schemes behaved like EN but always worse • 50% Training data • D2 @ 50% data >= ST @ 100% data • Larger phrases size (14) did not have a significant difference from the size 8 we used
Road Map • Hybrid MT Research @ Columbia • Morphological Preprocessing for SMT • Combination of Preprocessing Schemes
Oracle Combination • Preliminary study: oracle combination • In MT04,100% data, MADA technique, 11 schemes, sentence level selection • Achieved 46.0 Bleu • (24% improvement over best system 37.1)
System Combination • Exploit scheme complementarity to improve MT quality • Explore two methods of system combination • Rescoring-Only Combination (ROC) • Decoding-plus-Rescoring Combination (DRC) • We us all 11 schemes with MADA technique
Rescoring-Only Combination(ROC) • Rescore all the one-best outputs generated from separate scheme-specific systems and return the top choice • Each scheme-specific system uses its own scheme-specific preprocessing, phrase tables and decoding weights
Rescoring-Only Combination (ROC) • Standard combo • Trigram language model, phrase translation model, distortion model, and sentence length • IBM model 1 and 2 probabilities in both directions • Other combo: add more features • Perplexity of source sentence (PPL) against a source LM (in same scheme) • Number of out-of-vocabulary words source sentence (OOV) • Source sentence length (SL) • An encoding of the specific scheme (SC)
Decoding-plus-Rescoring Combination (DRC) • Step 1: Decode • For each preprocessing scheme • Use union of phrase tables from all schemes • Optimize and decode (with same scheme) • Step 2: Rescore • Rescoring the one-best outputs of each preprocessing scheme
Results • MT04 set • Best single scheme D2 scores 37.1
Results • Statistical significance using bootstrap re-sampling (Koehn, 2004)
Conclusions • For large amounts of training data, splitting off conjunctions and particles performs best • For small amount of training data, following an English-like tokenization performs best • Suitable choice of preprocessing scheme and technique yields an important increase in BLEU score if • there is little training data • there is a change in genre between training and test • System combination is potentially highly rewarding especially when combining the phrase tables of different preprocessing schemes.
Future Work • Study additional variant schemes that current results support • Factored translation modeling • Decoder extension to use multiple schemes in parallel • Syntactic preprocessing • Investigate combination techniques at the sentence and sub-sentence levels
Thank you! Questions? Nizar Habash habash@cs.columbia.edu