160 likes | 172 Views
Chinese Word Segmentation and Statistical Machine Translation. Presenter : Wu, Jia-Hao Authors : RUIQIANG ZHANG , KEIJI YASUDA , EIICHIRO SUMITA. 國立雲林科技大學 National Yunlin University of Science and Technology. TOSLP (2008). Outline. Motivation Objective Methodology
E N D
Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG ZHANG , KEIJI YASUDA , EIICHIRO SUMITA 國立雲林科技大學National Yunlin University of Science and Technology TOSLP (2008)
Outline • Motivation • Objective • Methodology • Dictionary-based • CRF-based • Experiments • Conclusion • Personal Comments
Motivation • Chinese word segmentation is a necessary step in Chinese-English statistical machine translation. • However, there are many choices involved in creating a CWS system such as various specifications and CWS methods. Ex 我們要發展中國家用電器 我們 要 發展 中國 家用電器 WeWant to developChina’sHome electrical appliances. 我們 要 發展中國家 用 電器 WeWant Developing countryTo useElectrical appliances.
Motivation • Chinese word segmentation is a necessary step in Chinese-English statistical machine translation. • However, there are many choices involved in creating a CWS system such as various specifications and CWS methods. Chinese word segmentation Statistical machine translation The ChineseName is called byRome phonetic transcription
Objective • They created 16 CWS schemes under different setting to examine the relationship between CWS and SMT. • The authors also tested two CWS methods that dictionary-based and CRF-based approaches. • The authors propose two approaches for combining advantages of different specifications . • A simple concatenation of training data. • Implementing linear interpolation of multiple translation models.
Methodology-Dictionary-based • The pure dictionary-based CWS does not recognize OOV words. • The authors combined N-gram language model with Dictionary-based word segmentation. • For a give Chinese character sequence , C=c0c1c2…cN • The word sequence , W=wt0wt1wt2…wtM • Which satisfies Out-of-vocabulary δ(u,v) equal to 1 if both arguments are the same , and 0 otherwise.
Methodology-CRF-based IOB Tagging • Each character of a word is labeled. • B if it is the first character of a multiple-character word. • O if the character functions as an independent word • I for other. • Ex:全北京市 is labeled 全/O 北/B 京/I 市/I • The probability of an IOB tag sequence, T=t0t1…tM , given the word sequence W=w0w1…wM bigram features : simply used absolute counts for each feature in the training data and define a cutoff value for each feature type. Unigram features : w0,w-1,w1,w-2,w2,w0w-1,w0w1,w-1w1,w-2w-1,w2w0
Methodology-Achilles • An In-House CWS including Both Dictionary-Based and CRF-Based Approaches. • Dictionary-based • Zero OOV recognition rate. • In-vocabulary rate is higher. • CRF-based • OOV recognition rate higher than Dictionary-based. • Best F-scores.
Methodology-Phrase-Based SMT • The method use a framework of log-linear models to integrate multiple features. • Where fi(F,E) is the logarithmic value of the i-th feature ,and λi is the weight of the i-th feature. The target sentence candidate that maximizes P(E|F) is the solution.
Experiments • The data used in the experiments were provided by LDC , and use the English sentences of the data plus Xinhua news of the LDC Gigaword English corpus. • Implementation of CWS Schemes • Tokens : the total number of words in the training data • Unique word : lexicon size of the segmented training data. • OOVs : the unknown words in the test data.
Experiment • The effect of CWS specifications on SMT.
Experiment - Combining multiple CWS schemes • Effect of Combining Training Data from Multiple CWS Specifications. • Create a new CWS scheme called dict-hybrid by combining AS, CITYU, MSR, PKU. • 49,546,231 tokens , 112,072 unique words for the training data. 693 OOVs for the test data.
Experiment • Effect of Feature Interpolation of Translation Models. • The authors generated multiple translation models by using different word segmenters. • The phrase translation model p(e|f) can be linearly interpolated as • Where pi(e|f) is the phrase translation model corresponding to the i-th CWSs. αi is the weight and S is the total number of models.
Conclusion • The authors analyzed multiple CWS specifications and built a CWS for each one to examine how they affected translations. • They proposed a new approach to linear interpolation of translation features , and improvement in translation and achieved the best BLEU score of all the CWS schemes.
Comments • Advantage • There are many experiments to evaluate their performance. • Drawback • But some interpretation of experiments are complex. • Application • Chinese Word Segmentation. • Statistical Machine Translation.