310 likes | 504 Views
COMS E6998: Topics in Computer Science: Machine Translation February 7, 2013 Reading Set #1. Morphological Processing for Statistical Machine Translation. Presenter: Nizar Habash. Papers Discussed.
E N D
COMS E6998: Topics in Computer Science: Machine Translation February 7, 2013 Reading Set #1 Morphological Processing for Statistical Machine Translation Presenter: Nizar Habash
Papers Discussed • Nizar Habash and Fatiha Sadat. 2006. Arabic Preprocessing Schemes for Statistical Machine Translation. • Nimesh Singh and Nizar Habash. 2012. Hebrew Morphological Preprocessing for Statistical Machine Translation.
Outline • Introduction • Arabic and Hebrew Morphology • Approach • Experimental Settings • Results • Conclusions
The Basic Idea • Reduction of word sparsity improves translation quality • This reduction can be achieved by • increasing training data, or by • morphologically driven preprocessing
Introduction • Morphologically rich languages are especially challenging for SMT • Model sparsity, high OOV rate especially under low-resource conditions • A common solution is to tokenize the source words in a preprocessing step • Lower OOV rate Better SMT (in terms of BLEU) • Increased token symmetry Better SMT models • conj+article+noun :: conj article noun • wa+Al+kitAb :: and the book
Introduction • Different tokenizations can be used • No one “correct” tokenization. Tokenizations vary in terms of • Scheme (what) and Technique (how) • Accuracy • Consistency • Sparsity reduction • The two papers consider different preprocessing options and other settings to study SMT from Arabic/Hebrew to English
Outline • Introduction • Arabic and Hebrew Morphology • Approach • Experimental Settings • Results • Conclusions
Linguistic Issues • Arabic & Hebrew are Semitic languages • Root-and-pattern morphology • Extensive use of affixes and clitics • Rich Morphology • Clitics [CONJ+ [PART+ [DET+ BASE +PRON]]]w+ l+ Al+ mktb and+ for+ the+ office • Morphotactics w+l+Al+mktb wllmktbوللمكتب و+ل+ال+مكتب
Linguistic Issues • Orthographic & Morphological Ambiguity • وجدناwjdnA • wjd+nAwajad+nA (we found) • w+jd+nAwa+jad~u+nA (and our grandfather) • בשורהbbšwrh
MT LAB HINTS Arabic Orthographic Ambiguity Extra w+ Repeated Al+ Repeated Al+ wdrst AltAlbAt AlErbyAt ktAbA bAlSynyp w+drs+t Al+Talb+At Al+Erb+y+At ktAb+A b+Al+Syn+y+p And+study+they the+student+f.pl. the+Arab+f.pl. book+a in+the+Chinese The Arab students studied a book in Chinese the+arab students studied a+book in+chinese th+rb stdnts stdd +bk n+chns thrb stdnts stdd bk nchns to+herb so+too+dents studded bake in chains?
MT LAB HINTS Arabic Morphemes circumfix Verbs Nominals everything Clitics are optional, affixes are obligatory!
Outline • Introduction • Arabic and Hebrew Morphology • Approach • Experimental Settings • Results • Conclusions
ApproachHabash&Sadat 2006 / Singh&Habash 2012 • Preprocessing scheme • What to tokenize • Preprocessing Technique • How to tokenize • Regular expressions • Morphological analysis • Morphological tagging / disambiguation • Unsupervised morphological segmentation • Not always independent
Arabic Preprocessing Schemes • ST Simple Tokenization • D1 Decliticizeconjunctions: w+/f+ • D2 D1 + Decliticize particles: b+/l+/k+/s+ • D3 D2 + Decliticize article Al+ and pron’lclitics • BW Morphological stem and affixes • EN D3, Lemmatize, English-like POS tags, Subj • ON Orthographic Normalization • WA wa+ decliticization • TB Arabic Treebank • L1 Lemmatize, Arabic POS tags • L2 Lemmatize, English-like POS tags Input: wsyktbhA? ‘and he will write it?’ ST wsyktbhA ? D1 w+ syktbhA ? D2 w+ s+ yktbhA ? D3 w+ s+ yktb +hA ? BW w+ s+ y+ ktb +hA ? EN w+ s+ ktb/VBZ S:3MS +hA ?
Arabic Preprocessing Techniques • REGEX: Regular Expressions • BAMA: Buckwalter Arabic Morphological Analyzer (Buckwalter 2002; 2004) • Pick first analysis • Use TOKAN (Habash 2006) • A generalized tokenizer • Assumes disambiguated morphological analysis • Declarative specification of any preprocessing scheme • MADA: Morphological Analysis and Disambiguation for Arabic(Habash&Rambow 2005) • Multiple SVM classifiers + combiner • Selects BAMA analysis • Use TOKAN
Hebrew Preprocessing Techniques/Schemes • Regular Expressions • RegEx-S1 = Conjunctions: ו ‘and’ and ש ‘that/who’ • RegEx-S2 = RegEx-S1 and Prepositions: ב ‘in’, כ ‘like/as’, ל ‘to/for’, and מ ‘from’ • RegEx-S3 = RegEx-S2 and the article ה ‘the’ • RegEx-S4 = RegEx-S3 and pronominal enclitics • Morfessor (Creutz and Lagus, 2007) • Morf - Unsupervised splitting into morphemes • Hebrew Morphological Tagger (Adler, 2009) • Htag - Hebrew morphological analysis and disambiguation
Tokenization System Statistics • Aggressive tokenization schemes have: • More tokens • More change from the baseline (untokenized) • Fewer OOVs (baseline OOV is 7%)
Outline • Introduction • Arabic and Hebrew Morphology • Approach • Experimental Settings • Results • Conclusions
Arabic-English Experiments • Portage Phrase-based MT (Sadat et al., 2005) • Training Data: parallel 5 Million words only • All in News genre • Learning curve: 1%, 10% and 100% • Language Modeling: 250 Million words • Development Tuning Data: MT03 Eval Set • Test Data MT04 • Mixed genre: news, speeches, editorials • Metric: BLEU (Papineni et al 2001)
Arabic-English Experiments • Each experiment • Select a preprocessing scheme • Select a preprocessing technique • Some combinations do not exist • REGEX and EN
Arabic-English Results Training 100% 10% BLEU 1% > > MADABAMA REGEX
Hebrew-English Experiments • Phrase-based statistical MT • Moses (Koehn et al., 2007) • MERT (Och, 2003) tuned for BLEU (Papineni et al., 2002) • Language models: English Gigaword (5-gram) plus training (3-gram) • True casing for English output • Training data 850,000 words
Hebrew-English Experiments • Compare seven systems • Vary only preprocessing • Baseline, RegEx-S{1-4}, Morf, and Htag • Metrics • BLEU, NIST (Doddington, 2002), • METEOR (Banerjee & Lavie, 2005)
Results Htag is consistently best, and Morf is consistently second best, in terms of BLEU and NIST
Results Morf has very low OOV, but still does worse than Htag (and even more poorly according to METEOR), indicating that it sometimes over-tokenizes.
Results Within RegEx, BLEU peaks at S2/S3, similar to Arabic D2 (Habash & Sadat, 2006)
Outline • Introduction • Arabic and Hebrew Morphology • Approach • Experimental Settings • Results • Conclusions
Conclusions • Preprocessing is useful for improving Arabic-English & Hebrew-English SMT • But as more data is added, the value diminishes • Tokenization with a morphological tagger does best but requires a lot of linguistic knowledge • Morfessor does quite well with no linguistic information necessary, and significantly reduces OOV (but perhaps erroneously) • Optimal Scheme/Technique choice varies by training data size • In Arabic, for large amounts of training data, splitting off conjunctions and particles performs best • But, for small amount of training data, following an English-like tokenization performs best
Thank you! Questions? Nizar Habash habash@cs.columbia.edu