Chinese Word Segmentation Method for Domain-Special Machine Translation

Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University

Outline Motivation Method of combining multiple segmentation results Experiment & Evaluation Conclusion

Motivation 1/2 ●CTB test data ●OOV：3.47% ●Science annotated data ●OOV：22.4% CTBtraining data

Motivation 2/2 • Background: Development of a domain-specific Chinese-English machine translation system, • Problem: Accuracy of Chinese Word Segmentation (CWS) on large amounts of training text often decreases. • Many errors in translation knowledge extraction • Therefore seriously affects translation quality

Our resolution • Related work • Domain-Adapted Chinese Word Segmentation Based on statistical Features • In previous work, only 1-best result is adopted generally, and ignored the lower ranking result. • Bilingually motivated domain-adapted word segmentation • Many characters are aligned to NULL which decrease accuracy of Chinese segmentation. • Our goal：Extend these method to augment domain adaptation of CWS

Our approach • We propose a linear model to combine multiple Chinese word segmentation results of the two segmenters to augment domain adaptation. • Segmenter based on n-gram features of Chinese raw corpus. • Segmenter based on bilingually motivated features.

Framework Chinese raw corpus Annotated corpus Segmentation result Training CRF model Chinese sentences Linear-model for combining multiple results Results CRF segmenter English Sentences Word alignment result Result Bilingual segmenter

Raw corpus Annotated corpus Test data N-gram statistical features Extracting statistical features Extracting statistical features Training CRF model CRF Decoding CRF segmenter Segmentation result

CRF segmenter • Exploring statistical features of large-scale domain-specific Chinese raw corpus • N-gram frequency feature • N-gram AV (Accessor Variety) feature • Output of CRF models • N-best list of segmentation results • Corresponding probability scores

Observation • Some erroneous segmentations in 1-best result are segmented correctly in the low-ranking results. • We intend to utilize correct parts within the 10-best results and the corresponding probability scores.

Bilingual segmenter • The boundaries of Chinese word are inferable on parallel corpus. • Marked word boundaries in English sentences. • Alignment from English word to Chinese word.

Inference step • Conduct word alignments using GIZA++, regarding each character of Chinese sentence as one word. • For each alignment ai=< ei, C>, if the characters in C are consecutive in the sentence. • Take C as a word • Calculate its confidence score (refer to paper)

Linear model • Calculate score of Cij being a word by combine multiple segmentation results • λ (1≤k≤K) are weights of K segmentation results. • F(i, j)denotes the • score of characters from i to j being a word. • Confk(i,j) (1≤k≤K) is the confidence score of the kth segmentation result. segk(i, j) (1≤k≤K) is a two-valued function.

Decoding • Cij andF(i, j)being represented in a lattice • The best sequence is found by dynamic programming algorithm. • Search a sequence of words with a maximum product of their scores.

Training parameter λ • Initial point λl(1≤l≤K): A point in K-dimensional parameter space is randomly selected. • The parameters λl are optimized through iterative process. • In each step, only one parameter is optimized, while keeping all other parameters fixed.

Experiment setting • Experimental data: NTCIR-10 Chinese-English parallel patent description sentences • Annotation set: randomly selected 300 sentence pairs. • 150 sentences used for training the lattice parameters. • 150 sentences used for evaluation.

Evaluation • We conduct evaluations from two aspects: • Evaluation (1): accuracy of Chinese word segmentation (F-measure) • Evaluation (2): translation quality of MT system (BLEU)

Evaluation(1) Accuracy of Chinese word segmentation

Evaluation(2) • We develop a phrase-based SMT with Moses, using different Chinese segmenters • 1-best of CRF segmenter (baseline) • Linear model (our approach) • Stanford Chinese segmenter • NLPIR Chinese segmenter

Evaluation (2): result • Our approach increased by 0.62% compared to baseline. • Performance of our approach is better than the two popular segmenters.

Result Analysis

Conclusion • We propose a linear model to combine multiple segmentation results from two segmenters to augment domain-adaptation. • one based on n-gram statistical feature of large Chinese raw corpus. • the other one based on bilingually motivated features of parallel corpus. • The experimental results show that both F-measure of CWS result and the BLEU score of SMT are improved.

Thanks!Q&A

Chinese Word Segmentation Method for Domain-Special Machine Translation

Chinese Word Segmentation Method for Domain-Special Machine Translation

Presentation Transcript

An Integrated Phrase Segmentation/Alignment Algorithm for Statistical Machine Translation

The Second International Chinese Word Segmentation Bakeoff

Machine Translation Domain Adaptation

Word Sense Disambiguation for Machine Translation

Optimizing Chinese Word Segmentation for MT performance

Rethinking Chinese Word Segmentation:

Machine Translation Discriminative Word Alignment

Statistical Machine Translation Word Alignment

Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation

Researches on Japanese-Chinese/Chinese-Japanese Machine Translation Systems

Machine Translation Word Alignment

Chinese Word Segmentation and Statistical Machine Translation

Chinese Word Segmentation Adaptation for Statistical Machine Translation

Bayesian Word Alignment for Statistical Machine Translation

A New Lexicon Mechanism for Chinese Word Segmentation

Domain Adaptation for Statistical Machine Translation

chinese translation

How to learn Chinese color word method?

A New Lexicon Mechanism for Chinese Word Segmentation

Chinese Word Segmentation and Statistical Machine Translation

Unsuperv ised Turkish Morphological Segmentation for Statistical Machine Translation

Machine Translation, Free Machine Translation