Optimizing Chinese Word Segmentation for MT performance

Optimizing Chinese Word Segmentation for MT performance Pi-Chuan Chang, Michel Galley and Chris Manning Stanford NLP group

=smallpox = = sky flower Very brief introduction for Chinese word segmentation • Why segment “words” • No spaces between characters • Words are better semantic units 天花 • Most commonly used standard: Chinese Treebank (CTB) • good for POS tagging & parsing

Word segmentation and MT • The Chinese segmentation task has been defined as optimizing for a given segmentation standard • Performance measured by F measure • Better segmentation performance =Better MT performance ? • Our Goal: understand what is important for a segmenter to perform well in MT, and optimize the segmenter for MT.

Things we looked at • Better segmentation performance= Better MT performance ?

Segmentation performance  MT performance A segmenter that performs much better on segmentation can be worse for MT than a simple max-match segmenter Why is Max-Match better on MT?

Segmentation performance  MT performance • A segmenter that performs much better on segmentation can be worse for MT than a simple max-match segmenter • Why is Max-Match better on MT? Is lower OOV rate the reason?

Things we looked at • Better segmentation performance= Better MT performance ? • Should we minimize the OOV rate of MT test data?

Is OOV rate a good predictor of good segmenter for MT? Not really. Here’s a counter example. * CRF-Lex is a CRF segmenter with strong external lexicon features A system with a much larger MT training set vocabulary and MT test OOV rate can be better. Let’s take a look at CRF-Lex.

CRF-Lex: sequence model + lexicon We build a CRF-based hybrid between a character sequence model and a lexicon-based segmentation model For each character, we look up external lexicons to see if the character is a prefix/infix/suffix of words in the lexicon Our lexicon features are conjunctions of this information on the sequence (current, previous, next character, etc.) a length 3 word starts here 沃尔玛 (Walmart) 在沃尔玛不 1 1 0 0 1

Things we looked at • Better segmentation performance= Better MT performance ? • Should we minimize the OOV rate of MT test data? • What else could be good predictors? • Consistency of segmentation

Another measure: Segmentation consistency Intuition: if a word is always segmented the same way, we say it’s consistent If, in the gold segmentation data, there are some occurrences of a word “ABC” . . A B C . . . . A B C . . . . . . . . . . A B C . . . A B C . . . . . . . . . . A B C . . . . . . A B C . . . . A B C . . . . A B C . . . . . . . . . . A B C . . . A B C . . . . . . . . . . A B C . . . . . . A B C . . . . A B C . . . . A B C . . . . . . . . . . A B C . . . A B C . . . . . . . . . . A B C . . . . . . A B C . . If the segmenter segments all the occurrences as “A” “BC”, it is still consistent. (but this will be penalized in the Segmentation F1 score) It is consistent if the segmenter segments all the occurrences as “ABC” Inconsistent segmentation

A concrete example: There are 177 occurrences of 人民 ”ren min” (people) Let’s see how consistent each segmenter is For wi = “人民”, CRF-Lex is the most consistent Conditional Entropy: Consistency: measured by Conditional Entropy c1人民 c2 c1 人民c2 c1 人民 c2 c1人民 c2 Summing over all words entropy

Consistency and MT performanceof 3 segmenters The strong lexicon features greatly improve consistency

Consistency is not enough Character-based is completely consistent, but not the best for MT. Another factor: appropriate word size for target language. Prediction: We can further improve MT performance by finding the optimal word size for translatability

Things we looked at • Better segmentation performance= Better MT performance ? • Should we minimize the OOV rate of MT test data? • What else could be good predictors? • Consistency of segmentation • finding optimal word size for translatability

Tuning word length in CRF Our CRF model makes a binary prediction. 1 means separated from previous character; 0 means continuing. Trained on Chinese Treebank 6 data We change the weight 0of a feature that fires only when the prediction is 1 Extreme case: Very large positive 0 cause the segmenter to predict only 1 (character-based) We find optimal middle ground inside | politics | department Internal Affairs | Department 內政部內政部 1 0 1 1 1 1 character-based one that optimizes for MT Department of Internal Affairs 內政部 1 0 0 CTB Segmentation

Continuum between word- and character-based segmentation best performance for MT Character-based 內政部 Internal Affairs | Department 內政部 inside | politics | department 31.47 30.95 optimize segmentation performance on CTB 內政部 0 Department of Internal Affairs

Things we looked at • Better segmentation performance= Better MT performance ? • Should we minimize the OOV rate of MT test data? • What else could be good predictors? • Consistency of segmentation • finding optimal word size for translatability • Identifying proper nouns can help MT system

Joint training of Word Segmentation and Proper Noun Tagging Unknown word is a problem for MT (and segmentation) systems; Proper noun is an important source of unknown words. Approach: Expand the label sets of CRF to jointly predict word segmentation and detect proper nouns Previously the label set { 0 , 1 } Expanding it to be { NR, OTHER }  { 0, 1 }

Results of Joint Segmentation and Proper Noun Tagging Joint segmentation and proper noun tagging gives: Better recall rate of OOV recall rate Thus a better overall F1 score 0.32 BLEU improvement

Conclusions Segmentation performance, vocab size or OOV rate do not accurately predict MT performance Two important factors for MT: Segmenting things consistently Having the appropriate word size for the target language (+0.73 BLEU) Incorporating proper noun detection helps (+0.32 BLEU)

Thank you!

Optimizing Chinese Word Segmentation for MT performance

Optimizing Chinese Word Segmentation for MT performance

Presentation Transcript

Optimizing Network Performance

Optimizing System Performance

Optimizing Cost and Performance for Multihoming

Optimizing Cost and Performance for Multihoming

The Second International Chinese Word Segmentation Bakeoff

Optimizing Performance

Optimizing Performance

Rethinking Chinese Word Segmentation:

Optimizing Herbicide Performance

Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation

Chinese Word Segmentation and Statistical Machine Translation

Optimizing Performance 2

Chinese Word Segmentation Adaptation for Statistical Machine Translation

A New Lexicon Mechanism for Chinese Word Segmentation

Optimizing Cost and Performance for Multihoming

Word Segmentation Models: Overview

A New Lexicon Mechanism for Chinese Word Segmentation

Optimizing Your JavaScript App for Performance

Chinese Word Segmentation and Statistical Machine Translation

Optimizing System Performance