250 likes | 374 Views
Accurate Parallel Fragment Extraction from Quasi-Comparable Corpora using Alignment Model and Translation Lexicon. Chenhui Chu , Toshiaki Nakazawa , Sadao Kurohashi Graduate School of Informatics, Kyoto University. IJCNLP2013 (2013/10/17). Outline. Background Related Work
E N D
Accurate Parallel Fragment Extraction from Quasi-Comparable Corporausing Alignment Model and Translation Lexicon Chenhui Chu, Toshiaki Nakazawa, SadaoKurohashi Graduate School of Informatics, Kyoto University IJCNLP2013 (2013/10/17)
Outline • Background • Related Work • Proposed Method • Experiments • Conclusion
Outline • Background • Related Work • Proposed Method • Experiments • Conclusion
Bilingual Corpora [Fung+ 2004] • Lack of parallel corpora • Parallel sentences can be extracted from noisy and comparable corpora • Quasi-comparable corpora more available, however few parallel sentences exist
Parallel Fragments • In quasi-comparable corpora, there could be parallel fragments in comparable sentences • Parallel fragments are also helpful for SMT • We aim to accurately extract parallel fragments from comparable sentences Zh: 应用/铅/离子/选择/电极/电位/滴定/法/测定/甘草/及/其/制品/中/的/甘草/酸 (Applying lead ion selective electrodepotentiometric titration method to determine licorice and its products ‘s glycyrrhizic acid) Ja: </原/報/>/鉛/イオン/選択/性/電極を/用いる/混合/試料/中/の/…/と/電位/差/滴定/法/の/比較 (<Original Report> lead ion selective electrode used mixed sample ‘s … and potentiometric titration method ‘s comparison)
Outline • Background • Related Work • Proposed Method • Experiments • Conclusion
Parallel Sub-sentential Fragment Extraction[Munteanu+ 2006] • Extract translation lexiconfrom a parallel corpus • Apply a lexicon filter to comparable sentences in two directions independently • Assign initial scores according to the lexicon • Score smoothing to gain new knowledge that does not exist in the lexicon • Extract sub-sentential (not exactly parallel) fragment
Lexicon Filter on Ja-to-Zh Direction 应 用 铅 离 子 选 择 电 极 电 位 滴 定 法 测 定 甘 草 及 其 制 品 中 的 甘 草 酸 < 原 報 > 鉛 イ オ ン 選 択 性 電 極 を 用 い る 混 合 試 料 中 の と 電 位 差 滴 定 法 の 比 較
Lexicon Filter on Zh-to-JaDirection < 原 報 > 鉛 イ オ ン 選 択 性 電 極 を 用 い る 混 合 試 料 中 の と 電 位 差 滴 定 法 の 比 較 应 用 铅 离 子 选 择 电 极 电 位 滴 定 法 测 定 甘 草 及 其 制 品 中 的 甘 草 酸
Outline • Background • Related Work • Proposed Method • Experiments • Conclusion
System Overview Parallel corpus Use a more accurate lexicon filter Lexicon filter Use an alignment model to locate the source and target fragment candidates simultaneously Target corpora SMT Classifier Parallel fragment candidates Parallel fragments Comparable sentences Source corpora Translated sentences (4) Alignment (3) (5) (1) (2) IR: top N results
Parallel Fragment Candidate Detection by Alignment Monotonic, non-NULL and longestaligned fragments more than 3 tokens
Lexicon Filter − Assign Initial Scores Assign scores in two directions to alignedword pairs in the candidates according to translation lexicon
Lexicon Filter −Score Smoothing Only smooth a word with negative score when both the left and rightwords around it have positive scores
Fragment Extraction • Fragments more than 3 tokens with continuous positive • scores in both directions
Outline • Background • Related Work • Proposed Method • Experiments • Parallel Fragment Extraction • Translation • Conclusion
Experimental settings (Parallel Fragment Extraction 1/2) • Parallel corpus: Zh-Ja abstract corpus (680k sentences, scientific domain) • Quasi-Comparable Corpora • Chinese corpora: CNKI (90k articles, 420k sentences, chemistry domain) • Japanese corpora: CiNii (880k articles, 5M sentences, scientific domain) • Comparable sentences: 30k chemistry domain sentences were extracted
Experimental settings (Parallel Fragment Extraction 2/2) • Alignment: GIZA++ with symmetrizationheuristics • Only: only use the extracted comparable sentences • External: together with 11k chemistry domain data in the parallel corpus • Translation lexicon • IBM Model 1 [Brown+ 1993] • Log-Likelihood-Ratio (LLR) [Munteanu+ 2006] • Sub-corpora sampling lexicon (SampLEX) [Vulic+ 2012] • Compare with [Munteanu+ 2006]
Results ※ Accuracy: manually evaluated 100 fragments based on exact match
Experimental Settings (Translation) • Baseline: Zh-Ja paper abstract corpus (680k with 11k chemistry domain sentences) • Tuning: 368 sentences of chemistry domain • Testing: 367 sentences of chemistry domain • Decoder: Moses • Language model: 5–gram language model on the Jaside of the parallel corpus using SRILM • Compare MT performance by appending the extracted fragments to the baseline training data
BLUE-4 for Different Systems * * * * ※ “*” denotes that the result is better than “Baseline” significantly at p < 0.05
Outline • Background • Related Work • Proposed Method • Experiments • Conclusion
Conclusion • We proposed an accurate parallel fragment extraction system using alignment model and translation lexicon • Future Work • A method to deal with ordering • Parallel corpus independent method • Try other language pairs and domains
Examples of Extracted Fragment Pairs ※ Noise is written in red font • Most noise is due to the noisy translation lexicon (Example 5-7) • Score smoothing also produces some noise (Example 8)