1 / 22

Optimizing Chinese Word Segmentation for MT performance

Optimizing Chinese Word Segmentation for MT performance. Pi-Chuan Chang, Michel Galley and Chris Manning Stanford NLP group. =smallpox. =. =. sky. flower. Very brief introduction for Chinese word segmentation. Why segment “words” No spaces between characters

lottie
Download Presentation

Optimizing Chinese Word Segmentation for MT performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Chinese Word Segmentation for MT performance Pi-Chuan Chang, Michel Galley and Chris Manning Stanford NLP group

  2. =smallpox = = sky flower Very brief introduction for Chinese word segmentation • Why segment “words” • No spaces between characters • Words are better semantic units 天 花 • Most commonly used standard: Chinese Treebank (CTB) • good for POS tagging & parsing

  3. Word segmentation and MT • The Chinese segmentation task has been defined as optimizing for a given segmentation standard • Performance measured by F measure • Better segmentation performance =Better MT performance ? • Our Goal: understand what is important for a segmenter to perform well in MT, and optimize the segmenter for MT.

  4. Things we looked at • Better segmentation performance= Better MT performance ?

  5. Segmentation performance  MT performance A segmenter that performs much better on segmentation can be worse for MT than a simple max-match segmenter Why is Max-Match better on MT?

  6. Segmentation performance  MT performance • A segmenter that performs much better on segmentation can be worse for MT than a simple max-match segmenter • Why is Max-Match better on MT? Is lower OOV rate the reason?

  7. Things we looked at • Better segmentation performance= Better MT performance ? • Should we minimize the OOV rate of MT test data?

  8. Is OOV rate a good predictor of good segmenter for MT? Not really. Here’s a counter example. * CRF-Lex is a CRF segmenter with strong external lexicon features A system with a much larger MT training set vocabulary and MT test OOV rate can be better. Let’s take a look at CRF-Lex.

  9. CRF-Lex: sequence model + lexicon We build a CRF-based hybrid between a character sequence model and a lexicon-based segmentation model For each character, we look up external lexicons to see if the character is a prefix/infix/suffix of words in the lexicon Our lexicon features are conjunctions of this information on the sequence (current, previous, next character, etc.) a length 3 word starts here 沃尔玛 (Walmart) 在 沃 尔 玛 不 1 1 0 0 1

  10. Things we looked at • Better segmentation performance= Better MT performance ? • Should we minimize the OOV rate of MT test data? • What else could be good predictors? • Consistency of segmentation

  11. Another measure: Segmentation consistency Intuition: if a word is always segmented the same way, we say it’s consistent If, in the gold segmentation data, there are some occurrences of a word “ABC” . . A B C . . . . A B C . . . . . . . . . . A B C . . . A B C . . . . . . . . . . A B C . . . . . . A B C . . . . A B C . . . . A B C . . . . . . . . . . A B C . . . A B C . . . . . . . . . . A B C . . . . . . A B C . . . . A B C . . . . A B C . . . . . . . . . . A B C . . . A B C . . . . . . . . . . A B C . . . . . . A B C . . If the segmenter segments all the occurrences as “A” “BC”, it is still consistent. (but this will be penalized in the Segmentation F1 score) It is consistent if the segmenter segments all the occurrences as “ABC” Inconsistent segmentation

  12. A concrete example: There are 177 occurrences of 人民 ”ren min” (people) Let’s see how consistent each segmenter is For wi = “人民”, CRF-Lex is the most consistent Conditional Entropy: Consistency: measured by Conditional Entropy c1人民 c2 c1 人民c2 c1 人民 c2 c1人 民 c2 Summing over all words entropy

  13. Consistency and MT performanceof 3 segmenters The strong lexicon features greatly improve consistency

  14. Consistency is not enough Character-based is completely consistent, but not the best for MT. Another factor: appropriate word size for target language. Prediction: We can further improve MT performance by finding the optimal word size for translatability

  15. Things we looked at • Better segmentation performance= Better MT performance ? • Should we minimize the OOV rate of MT test data? • What else could be good predictors? • Consistency of segmentation • finding optimal word size for translatability

  16. Tuning word length in CRF Our CRF model makes a binary prediction. 1 means separated from previous character; 0 means continuing. Trained on Chinese Treebank 6 data We change the weight 0of a feature that fires only when the prediction is 1 Extreme case: Very large positive 0 cause the segmenter to predict only 1 (character-based) We find optimal middle ground inside | politics | department Internal Affairs | Department 內 政 部 內政 部 1 0 1 1 1 1 character-based one that optimizes for MT Department of Internal Affairs 內 政 部 1 0 0 CTB Segmentation

  17. Continuum between word- and character-based segmentation best performance for MT Character-based 內政 部 Internal Affairs | Department 內 政 部 inside | politics | department 31.47 30.95 optimize segmentation performance on CTB 內政部 0 Department of Internal Affairs

  18. Things we looked at • Better segmentation performance= Better MT performance ? • Should we minimize the OOV rate of MT test data? • What else could be good predictors? • Consistency of segmentation • finding optimal word size for translatability • Identifying proper nouns can help MT system

  19. Joint training of Word Segmentation and Proper Noun Tagging Unknown word is a problem for MT (and segmentation) systems; Proper noun is an important source of unknown words. Approach: Expand the label sets of CRF to jointly predict word segmentation and detect proper nouns Previously the label set { 0 , 1 } Expanding it to be { NR, OTHER }  { 0, 1 }

  20. Results of Joint Segmentation and Proper Noun Tagging Joint segmentation and proper noun tagging gives: Better recall rate of OOV recall rate Thus a better overall F1 score 0.32 BLEU improvement

  21. Conclusions Segmentation performance, vocab size or OOV rate do not accurately predict MT performance Two important factors for MT: Segmenting things consistently Having the appropriate word size for the target language (+0.73 BLEU) Incorporating proper noun detection helps (+0.32 BLEU)

  22. Thank you!

More Related