240 likes | 404 Views
Chinese Word Segmentation Method for Domain-Special Machine Translation. Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University. Outline. Motivation. Method of combining multiple segmentation results. Experiment & Evaluation. Conclusion. Motivation 1/2. ● CTB test data
E N D
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University
Outline Motivation Method of combining multiple segmentation results Experiment & Evaluation Conclusion
Motivation 1/2 ●CTB test data ●OOV:3.47% ●Science annotated data ●OOV:22.4% CTBtraining data
Motivation 2/2 • Background: Development of a domain-specific Chinese-English machine translation system, • Problem: Accuracy of Chinese Word Segmentation (CWS) on large amounts of training text often decreases. • Many errors in translation knowledge extraction • Therefore seriously affects translation quality
Our resolution • Related work • Domain-Adapted Chinese Word Segmentation Based on statistical Features • In previous work, only 1-best result is adopted generally, and ignored the lower ranking result. • Bilingually motivated domain-adapted word segmentation • Many characters are aligned to NULL which decrease accuracy of Chinese segmentation. • Our goal:Extend these method to augment domain adaptation of CWS
Our approach • We propose a linear model to combine multiple Chinese word segmentation results of the two segmenters to augment domain adaptation. • Segmenter based on n-gram features of Chinese raw corpus. • Segmenter based on bilingually motivated features.
Framework Chinese raw corpus Annotated corpus Segmentation result Training CRF model Chinese sentences Linear-model for combining multiple results Results CRF segmenter English Sentences Word alignment result Result Bilingual segmenter
Raw corpus Annotated corpus Test data N-gram statistical features Extracting statistical features Extracting statistical features Training CRF model CRF Decoding CRF segmenter Segmentation result
CRF segmenter • Exploring statistical features of large-scale domain-specific Chinese raw corpus • N-gram frequency feature • N-gram AV (Accessor Variety) feature • Output of CRF models • N-best list of segmentation results • Corresponding probability scores
Observation • Some erroneous segmentations in 1-best result are segmented correctly in the low-ranking results. • We intend to utilize correct parts within the 10-best results and the corresponding probability scores.
Bilingual segmenter • The boundaries of Chinese word are inferable on parallel corpus. • Marked word boundaries in English sentences. • Alignment from English word to Chinese word.
Inference step • Conduct word alignments using GIZA++, regarding each character of Chinese sentence as one word. • For each alignment ai=< ei, C>, if the characters in C are consecutive in the sentence. • Take C as a word • Calculate its confidence score (refer to paper)
Linear model • Calculate score of Cij being a word by combine multiple segmentation results • λ (1≤k≤K) are weights of K segmentation results. • F(i, j)denotes the • score of characters from i to j being a word. • Confk(i,j) (1≤k≤K) is the confidence score of the kth segmentation result. segk(i, j) (1≤k≤K) is a two-valued function.
Decoding • Cij andF(i, j)being represented in a lattice • The best sequence is found by dynamic programming algorithm. • Search a sequence of words with a maximum product of their scores.
Training parameter λ • Initial point λl(1≤l≤K): A point in K-dimensional parameter space is randomly selected. • The parameters λl are optimized through iterative process. • In each step, only one parameter is optimized, while keeping all other parameters fixed.
Experiment setting • Experimental data: NTCIR-10 Chinese-English parallel patent description sentences • Annotation set: randomly selected 300 sentence pairs. • 150 sentences used for training the lattice parameters. • 150 sentences used for evaluation.
Evaluation • We conduct evaluations from two aspects: • Evaluation (1): accuracy of Chinese word segmentation (F-measure) • Evaluation (2): translation quality of MT system (BLEU)
Evaluation(1) Accuracy of Chinese word segmentation
Evaluation(2) • We develop a phrase-based SMT with Moses, using different Chinese segmenters • 1-best of CRF segmenter (baseline) • Linear model (our approach) • Stanford Chinese segmenter • NLPIR Chinese segmenter
Evaluation (2): result • Our approach increased by 0.62% compared to baseline. • Performance of our approach is better than the two popular segmenters.
Conclusion • We propose a linear model to combine multiple segmentation results from two segmenters to augment domain-adaptation. • one based on n-gram statistical feature of large Chinese raw corpus. • the other one based on bilingually motivated features of parallel corpus. • The experimental results show that both F-measure of CWS result and the BLEU score of SMT are improved.