Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi

Accurate Parallel Fragment Extraction from Quasi-Comparable Corporausing Alignment Model and Translation Lexicon Chenhui Chu, Toshiaki Nakazawa, SadaoKurohashi Graduate School of Informatics, Kyoto University IJCNLP2013 (2013/10/17)

Outline • Background • Related Work • Proposed Method • Experiments • Conclusion

Bilingual Corpora [Fung+ 2004] • Lack of parallel corpora • Parallel sentences can be extracted from noisy and comparable corpora • Quasi-comparable corpora more available, however few parallel sentences exist

Parallel Fragments • In quasi-comparable corpora, there could be parallel fragments in comparable sentences • Parallel fragments are also helpful for SMT • We aim to accurately extract parallel fragments from comparable sentences Zh: 应用/铅/离子/选择/电极/电位/滴定/法/测定/甘草/及/其/制品/中/的/甘草/酸 (Applying lead ion selective electrodepotentiometric titration method to determine licorice and its products ‘s glycyrrhizic acid) Ja: ＜/原/報/＞/鉛/イオン/選択/性/電極を/用いる/混合/試料/中/の/…/と/電位/差/滴定/法/の/比較 (<Original Report> lead ion selective electrode used mixed sample ‘s … and potentiometric titration method ‘s comparison)

Parallel Sub-sentential Fragment Extraction[Munteanu+ 2006] • Extract translation lexiconfrom a parallel corpus • Apply a lexicon filter to comparable sentences in two directions independently • Assign initial scores according to the lexicon • Score smoothing to gain new knowledge that does not exist in the lexicon • Extract sub-sentential (not exactly parallel) fragment

Lexicon Filter on Ja-to-Zh Direction 应用铅离子选择电极电位滴定法测定甘草及其制品中的甘草酸＜原報＞鉛イオン選択性電極を用いる混合試料中のと電位差滴定法の比較

Lexicon Filter on Zh-to-JaDirection ＜原報＞鉛イオン選択性電極を用いる混合試料中のと電位差滴定法の比較应用铅离子选择电极电位滴定法测定甘草及其制品中的甘草酸

System Overview Parallel corpus Use a more accurate lexicon filter Lexicon filter Use an alignment model to locate the source and target fragment candidates simultaneously Target corpora SMT Classifier Parallel fragment candidates Parallel fragments Comparable sentences Source corpora Translated sentences (4) Alignment (3) (5) (1) (2) IR: top N results

Parallel Fragment Candidate Detection by Alignment Monotonic, non-NULL and longestaligned fragments more than 3 tokens

Lexicon Filter − Assign Initial Scores Assign scores in two directions to alignedword pairs in the candidates according to translation lexicon

Lexicon Filter −Score Smoothing Only smooth a word with negative score when both the left and rightwords around it have positive scores

Fragment Extraction • Fragments more than 3 tokens with continuous positive • scores in both directions

Outline • Background • Related Work • Proposed Method • Experiments • Parallel Fragment Extraction • Translation • Conclusion

Experimental settings (Parallel Fragment Extraction 1/2) • Parallel corpus: Zh-Ja abstract corpus (680k sentences, scientific domain) • Quasi-Comparable Corpora • Chinese corpora: CNKI (90k articles, 420k sentences, chemistry domain) • Japanese corpora: CiNii (880k articles, 5M sentences, scientific domain) • Comparable sentences: 30k chemistry domain sentences were extracted

Experimental settings (Parallel Fragment Extraction 2/2) • Alignment: GIZA++ with symmetrizationheuristics • Only: only use the extracted comparable sentences • External: together with 11k chemistry domain data in the parallel corpus • Translation lexicon • IBM Model 1 [Brown+ 1993] • Log-Likelihood-Ratio (LLR) [Munteanu+ 2006] • Sub-corpora sampling lexicon (SampLEX) [Vulic+ 2012] • Compare with [Munteanu+ 2006]

Results ※ Accuracy: manually evaluated 100 fragments based on exact match

Experimental Settings (Translation) • Baseline: Zh-Ja paper abstract corpus (680k with 11k chemistry domain sentences) • Tuning: 368 sentences of chemistry domain • Testing: 367 sentences of chemistry domain • Decoder: Moses • Language model: 5–gram language model on the Jaside of the parallel corpus using SRILM • Compare MT performance by appending the extracted fragments to the baseline training data

BLUE-4 for Different Systems * * * * ※ “*” denotes that the result is better than “Baseline” significantly at p < 0.05

Conclusion • We proposed an accurate parallel fragment extraction system using alignment model and translation lexicon • Future Work • A method to deal with ordering • Parallel corpus independent method • Try other language pairs and domains

Thank you for your attention!

Examples of Extracted Fragment Pairs ※ Noise is written in red font • Most noise is due to the noisy translation lexicon (Example 5-7) • Score smoothing also produces some noise (Example 8)

Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi