220 likes | 335 Views
Optimizing Chinese Word Segmentation for MT performance. Pi-Chuan Chang, Michel Galley and Chris Manning Stanford NLP group. =smallpox. =. =. sky. flower. Very brief introduction for Chinese word segmentation. Why segment “words” No spaces between characters
E N D
Optimizing Chinese Word Segmentation for MT performance Pi-Chuan Chang, Michel Galley and Chris Manning Stanford NLP group
=smallpox = = sky flower Very brief introduction for Chinese word segmentation • Why segment “words” • No spaces between characters • Words are better semantic units 天 花 • Most commonly used standard: Chinese Treebank (CTB) • good for POS tagging & parsing
Word segmentation and MT • The Chinese segmentation task has been defined as optimizing for a given segmentation standard • Performance measured by F measure • Better segmentation performance =Better MT performance ? • Our Goal: understand what is important for a segmenter to perform well in MT, and optimize the segmenter for MT.
Things we looked at • Better segmentation performance= Better MT performance ?
Segmentation performance MT performance A segmenter that performs much better on segmentation can be worse for MT than a simple max-match segmenter Why is Max-Match better on MT?
Segmentation performance MT performance • A segmenter that performs much better on segmentation can be worse for MT than a simple max-match segmenter • Why is Max-Match better on MT? Is lower OOV rate the reason?
Things we looked at • Better segmentation performance= Better MT performance ? • Should we minimize the OOV rate of MT test data?
Is OOV rate a good predictor of good segmenter for MT? Not really. Here’s a counter example. * CRF-Lex is a CRF segmenter with strong external lexicon features A system with a much larger MT training set vocabulary and MT test OOV rate can be better. Let’s take a look at CRF-Lex.
CRF-Lex: sequence model + lexicon We build a CRF-based hybrid between a character sequence model and a lexicon-based segmentation model For each character, we look up external lexicons to see if the character is a prefix/infix/suffix of words in the lexicon Our lexicon features are conjunctions of this information on the sequence (current, previous, next character, etc.) a length 3 word starts here 沃尔玛 (Walmart) 在 沃 尔 玛 不 1 1 0 0 1
Things we looked at • Better segmentation performance= Better MT performance ? • Should we minimize the OOV rate of MT test data? • What else could be good predictors? • Consistency of segmentation
Another measure: Segmentation consistency Intuition: if a word is always segmented the same way, we say it’s consistent If, in the gold segmentation data, there are some occurrences of a word “ABC” . . A B C . . . . A B C . . . . . . . . . . A B C . . . A B C . . . . . . . . . . A B C . . . . . . A B C . . . . A B C . . . . A B C . . . . . . . . . . A B C . . . A B C . . . . . . . . . . A B C . . . . . . A B C . . . . A B C . . . . A B C . . . . . . . . . . A B C . . . A B C . . . . . . . . . . A B C . . . . . . A B C . . If the segmenter segments all the occurrences as “A” “BC”, it is still consistent. (but this will be penalized in the Segmentation F1 score) It is consistent if the segmenter segments all the occurrences as “ABC” Inconsistent segmentation
A concrete example: There are 177 occurrences of 人民 ”ren min” (people) Let’s see how consistent each segmenter is For wi = “人民”, CRF-Lex is the most consistent Conditional Entropy: Consistency: measured by Conditional Entropy c1人民 c2 c1 人民c2 c1 人民 c2 c1人 民 c2 Summing over all words entropy
Consistency and MT performanceof 3 segmenters The strong lexicon features greatly improve consistency
Consistency is not enough Character-based is completely consistent, but not the best for MT. Another factor: appropriate word size for target language. Prediction: We can further improve MT performance by finding the optimal word size for translatability
Things we looked at • Better segmentation performance= Better MT performance ? • Should we minimize the OOV rate of MT test data? • What else could be good predictors? • Consistency of segmentation • finding optimal word size for translatability
Tuning word length in CRF Our CRF model makes a binary prediction. 1 means separated from previous character; 0 means continuing. Trained on Chinese Treebank 6 data We change the weight 0of a feature that fires only when the prediction is 1 Extreme case: Very large positive 0 cause the segmenter to predict only 1 (character-based) We find optimal middle ground inside | politics | department Internal Affairs | Department 內 政 部 內政 部 1 0 1 1 1 1 character-based one that optimizes for MT Department of Internal Affairs 內 政 部 1 0 0 CTB Segmentation
Continuum between word- and character-based segmentation best performance for MT Character-based 內政 部 Internal Affairs | Department 內 政 部 inside | politics | department 31.47 30.95 optimize segmentation performance on CTB 內政部 0 Department of Internal Affairs
Things we looked at • Better segmentation performance= Better MT performance ? • Should we minimize the OOV rate of MT test data? • What else could be good predictors? • Consistency of segmentation • finding optimal word size for translatability • Identifying proper nouns can help MT system
Joint training of Word Segmentation and Proper Noun Tagging Unknown word is a problem for MT (and segmentation) systems; Proper noun is an important source of unknown words. Approach: Expand the label sets of CRF to jointly predict word segmentation and detect proper nouns Previously the label set { 0 , 1 } Expanding it to be { NR, OTHER } { 0, 1 }
Results of Joint Segmentation and Proper Noun Tagging Joint segmentation and proper noun tagging gives: Better recall rate of OOV recall rate Thus a better overall F1 score 0.32 BLEU improvement
Conclusions Segmentation performance, vocab size or OOV rate do not accurately predict MT performance Two important factors for MT: Segmenting things consistently Having the appropriate word size for the target language (+0.73 BLEU) Incorporating proper noun detection helps (+0.32 BLEU)