420 likes | 589 Views
Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation. Chenhui Chu , Toshiaki Nakazawa , Daisuke Kawahara, Sadao Kurohashi Graduate School of Informatics, Kyoto University. EAMT2012 ( 2012/05/ 28). Outline.
E N D
Exploiting Shared Chinese Characters in Chinese Word Segmentation Optimization for Chinese-Japanese Machine Translation Chenhui Chu, Toshiaki Nakazawa, Daisuke Kawahara, SadaoKurohashi Graduate School of Informatics, Kyoto University EAMT2012 (2012/05/28)
Outline • Word Segmentation Problems • Common Chinese Characters • Chinese Word Segmentation Optimization • Experiments • Discussion • Related Work • Conclusion and Future Work
Outline • Word Segmentation Problems • Common Chinese Characters • Chinese Word Segmentation Optimization • Experiments • Discussion • Related Work • Conclusion and Future Work
Word Segmentation for Chinese-Japanese MT 小坂先生是日本临床麻醉学会的创始者。 Zh: 小坂先生は日本臨床麻酔学会の創始者である。 Ja: 小/坂/先生/是/日本/临床/麻醉/学会/的/创始者/。 Zh: 小坂/先生/は/日本/臨床/麻酔/学会/の/創始/者/である/。 Ja: Ref: Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists.
Word Segmentation Problems in Chinese-Japanese MT • Unknown words • Affect segmentation accuracy and consistency • Word segmentation granularity • Affect word alignment 小/坂 /先生/是/日本/临床/麻醉/学会/的/ 创始者 /。 Zh: 小坂 /先生/は/日本/臨床/麻酔/学会/の/ 創始/者 /である/。 Ja: Ref: Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists.
Outline • Introduction • Common Chinese Characters • Chinese Word Segmentation Optimization • Experiments • Discussion • Related Work • Conclusion and Future Work
Chinese Characters • Chinese characters are used both in Chinese (Hanzi) and Japanese (Kanji) • There are many common Chinese characters between Hanzi and Kanji • We made a common Chinese characters mapping table for 6,355JIS Kanji(Chu et al., 2012)
Common Chinese Characters Related Studies • Automatic sentence alignment task (Tan et al., 1995) • Dictionary construction (Goh et al., 2005) • Word level semantic relations investigation (Huang et al., 2008) • Phrase alignment (Chu et al., 2011) • This study exploits common Chinese characters in Chinese word segmentation optimization for MT
Outline • Word Segmentation Problems • Common Chinese Characters • Chinese Word Segmentation Optimization • Experiments • Discussion • Related Work • Conclusion and Future Work
Reason for Chinese Word Segmentation Optimization • Segmentation for Japanese is easier than Chinese, because Japanese uses Kana other than Chinese characters • F-score for Japanese segmentation is nearly 99% (Kudo et al., 2004), while that for Chinese is still about 95% (Wang et al., 2011) • Therefore, we only do word segmentation optimization for Chinese, and keep the Japanese segmentation results
Parallel Training Corpus ① Chinese Lexicons Extraction Common Chinese Characters Chinese Lexicons Chinese Annotated Corpus for Chinese Segmenter System Dictionary of Chinese Segmenter ② Chinese Lexicons Incorporation ③ Short Unit Transformation Short Unit Chinese Corpus for Chinese Segmenter System Dictionary with Chinese Lexicons Training Optimized Chinese Segmenter
Parallel Training Corpus ① Chinese Lexicons Extraction Common Chinese Characters Chinese Lexicons Chinese Annotated Corpus for Chinese Segmenter System Dictionary of Chinese Segmenter ② Chinese Lexicons Incorporation ③ Short Unit Transformation Short Unit Chinese Corpus for Chinese Segmenter System Dictionary with Chinese Lexicons Training Optimized Chinese Segmenter
① Chinese Lexicons Extraction • Step 1: Segment Chinese and Japanese sentences in the parallel training corpus • Step 2: Convert Japanese Kanji tokens into Chinese using the mapping table we made (Chu et al., 2012) • Step 3: Extract the converted tokens as Chinese lexicons if they exist in the corresponding Chinese sentence
Extraction Example 小坂 /先生/は/日本/臨床/麻酔/学会/の/ 創始/者 /である/。 Ja: Kanji tokens conversion 小坂/先生/は/日本/临床/麻醉/学会/の/ 创始/者/である/。 Ja: Check 小/坂 / 先生 /是/ 日本 / 临床 / 麻醉 / 学会 /的/ 创始 者 /。 Zh: Extraction 小坂 先生 日本 临床 麻醉 学会 创始 者 Chinese Lexicons : Ref: Mr. Kosaka is the founder of The Japan Society for Clinical Anesthesiologists.
Parallel Training Corpus ① Chinese Lexicons Extraction Common Chinese Characters Chinese Lexicons Chinese Annotated Corpus for Chinese Segmenter System Dictionary of Chinese Segmenter ② Chinese Lexicons Incorporation ③ Short Unit Transformation Short Unit Chinese Corpus for Chinese Segmenter System Dictionary with Chinese Lexicons Training Optimized Chinese Segmenter
② Chinese Lexicons Incorporation • Using a system dictionary is helpful for Chinese word segmentation (Low et al., 2005; Wang et al., 2011) • We incorporate the extracted lexicons into the system dictionary of a Chinese segmenter • Assign POS tags by converting the POS tags assigned by the Japanese segmenter using POS tags mapping table between Chinese and Japanese
Parallel Training Corpus ① Chinese Lexicons Extraction Common Chinese Characters Chinese Lexicons Chinese Annotated Corpus for Chinese Segmenter System Dictionary of Chinese Segmenter ② Chinese Lexicons Incorporation ③ Short Unit Transformation Short Unit Chinese Corpus for Chinese Segmenter System Dictionary with Chinese Lexicons Training Optimized Chinese Segmenter
③ Short Unit Transformation • Adjusting Chinese word segmentation to make tokens 1-to-1 mapping as many as possible between a parallel sentences can improve alignment accuracy (Bai et al., 2008) • Wang et al. (2010) proposed a short unit standard for Chinese word segmentation, which can reduce the number of 1-to-n alignments and improve MT performance
Our Method • We transform the annotated training data of Chinese segmenter utilizing the extracted lexicons 从_P/ 有效性_NN /高_VA/的_DEC/ 格要素_NN /… CTB: Lexicon: 有效 (effective) Lexicon : 要素 (element) 从_P/ 有效_NN/性_NN /高_VA/的_DEC/ 格_NN/要素_NN /… Short: From case element with high effectiveness … Ref:
Constraints • We do not use the extracted lexicons that are composed of onlyone Chinese character 歌(song) ・・・ 歌(song)/ 颂(praise) 歌颂(praise) long token extracted lexicons short unit tokens
Outline • Word Segmentation Problems • Common Chinese Characters • Chinese Word Segmentation Optimization • Experiments • Discussion • Related Work • Conclusion and Future Work
Two Kinds of Experiments • Experiments on Moses • Experiments on Kyoto example-based machine translation (EBMT) system Nakazawa and Kurohashi, 2011) • A dependency tree-based decoder
Experimental Settings on MOSES (2/2) • Baseline: Only using the lexicons extracted from Chinese annotated corpus • Incorporation: Incorporate the extracted Chinese lexicons • Short unit: Incorporate the extracted Chinese lexicons and train the Chinese segmenter on the short unit training data
Results of Chinese-to-Japanese Translation Experiments on MOSES • CTB 7 shows better performance because the size is more than 3 times of NICT Chinese Treebank • Lexicons extracted from a paper abstract domain also work well on other domains (i.e. CTB 7)
Results of Japanese-to-Chinese Translation Experiments on MOSES • Not significant compared to Zh-to-Ja, because our proposed approach does not change the segmentation results of input Japanese sentences
Experimental Settings on EBMT (2/2) • Baseline: Only using the lexicons extracted from Chinese annotated corpus • Short unit: Incorporate the extracted Chinese lexicons and train the Chinese segmenter on the short unit training data
Results of Translation Experiments on EBMT • Translation performance is worse than MOSES, because EBMT suffers from low accuracy of Chinese parser • Improvement by short unit is not significant because the Chinese parser is not trained on short unit segmented training data
Outline • Word Segmentation Problems • Common Chinese Characters • Chinese Word Segmentation Optimization • Experiments • Discussion • Related Work • Conclusion and Future Work
Short Unit Effectiveness on MOSES Baseline (BLEU=49.38) Input: 本/论文/中/,/提议/考虑/现存/实现/方式/的/ 功能 / 适应性 /决定/对策/目标/的/保密/基本/设计法/。 Output: 本/論文/で/は/,/提案/する/ 適応/的 /対策/を/決定/する/セキュリティ/基本/設計/法/を/考える/現存/の/実現/方式/の/ 機能 /を/目標/と/して/いる/. • Short unit (BLEU=56.33) • Input:本/论文/中/,/提议/考虑/现存/实现/方式/的/ 功能 / 适应/性 /决定/对策/目标/的/保密/基本/设计/法/。 Output: 本/論文/で/は/,/提案/する/考え/現存/の/実現/方式/の/ 機能/的 / 適応/性 /を/決定/する/対策/目標/の/セキュリティ/基本/設計/法/を/提案/する/. Reference 本/論文/で/は/,/対策/目標/を/現存/の/実現/方式/の/ 機能/的 / 適合/性 /も/考慮/して/決定/する/セキュリティ/基本/設計/法/を/提案/する/ . • (In this paper, we propose a basic security design method also consider functional suitability of the existing implementation method for determining countermeasures target.)
Number of Extracted Lexicons • The number of extracted lexicons deceased after short unit transformation because duplicated lexicons increased
Short Unit Transformation Percentage • NICT Chinese Treebank • 6,623 tokens out of 257,825 been transformed to 13,469 short unit tokens, the percentage is 2.57% • CTB 7 • 19,983 tokens out of 718,716 been transformed to 41,336 short unit tokens, the percentage is 2.78%
Short Unit Transformation Problems (1/3) • Improper transformation problem 好意(favor) ・・・ 不(not)/ 好意 (favor)/ 思(think) 不好意思 (sorry) long token extracted lexicons short unit tokens
Short Unit Transformation Problems (2/3) • Transformation ambiguity problem 充电(charge) 电器(electric equipment) ・・・ 充电(charge)/ 器(device) 充电器(charger) 充(charge)/ 电器(electric equipment) long token extracted lexicons short unit tokens
Short Unit Transformation Problems (3/3) • POS Tag assignment problem 实验(test) ・・・ 被_NN(be)/ 实验_NN(test)/ 者_NN(person) 被实验者_NN (test subject) long token extracted lexicons short unit tokens The correct POS tag for “被 (be)” should be LB (“被”in long bei-const)
Outline • Word Segmentation Problems • Common Chinese Characters • Chinese Word Segmentation Optimization • Experiments • Discussion • Related Work • Conclusion and Future Work
Bai et al., 2008 • Proposed a method of learning affix rules from a aligned Chinese-English bilingual terminology bank to adjust Chinese word segmentation in the parallel corpus directly
Wang et al., 2010 • Proposed a method based on transfer rules and a transfer database. • The transfer rules are extracted from alignment results of annotated Chinese and segmented Japanese training data • The transfer database is constructed using external lexicons, and is manually modified
Conclusion • We proposed an approach of exploiting common Chinese characters in Chinese word segmentation optimization for Chinese-Japanese MT • Experimental results of Chinese-Japanese MT on a phrase-based SMT system indicated that our approach can improve MT performance significantly
Future Work • Solve the short unit transformation problems • Evaluate the proposed approach on parallel corpus of other domains