210 likes | 388 Views
Adapting EBMT to Chinese. Joy (Ying Zhang) Joy@cs.cmu.edu Jan 26, 2001. Topics. Project overview EBMT outline Chinese language Improved Segmenter English phrase recognizing and bracketing Statistical dictionary Results Ongoing and future work. Project Overview.
E N D
Adapting EBMT to Chinese Joy (Ying Zhang) Joy@cs.cmu.edu Jan 26, 2001 Adapting EBMT to Chinese joy@cs.cmu.edu
Topics • Project overview • EBMT outline • Chinese language • Improved Segmenter • English phrase recognizing and bracketing • Statistical dictionary • Results • Ongoing and future work Adapting EBMT to Chinese joy@cs.cmu.edu
Project Overview • Part of Lingwear, TIDES • Adapting existing multi-engine Pangloss MT system to Chinese-English • Quick-deploy MT system, develop MT with the smallest amount of human effort and knowledge Adapting EBMT to Chinese joy@cs.cmu.edu
Multi-engine MT system • There are three translation engines in the current system: • EBMT: Example Based Machine Translation • DICTionary: to provide coverage for words not otherwise covered by EBMT, it can be constructed automatically from binlingual corpus • GLOSSaries: from hand-crafted word/phrase bilingual glossaries Adapting EBMT to Chinese joy@cs.cmu.edu
EBMT outline • Concepts • An Example-Based Machine Translation (EBMT) system is given a set of sentences in the source language (from which one is translating) and their corresponding translations in the target language, and uses those examples to translate other, similar source-language sentences into the target language. The basic premise is that, if a previously translated sentence occurs again, the same translation is likely to be correct again. (Ralf. Brown) • Other EBMT systems operate on parse trees, or find the most similar complete sentence and modify its translation based on the differences between the sentence to be translated and the matched example. (Ralf. Brown) • Our system is a shallow EBMT system • Bilingual corpus • Indexing (using dictionary)---Matching • One of the most important issues: increase the performance of MATCHING Adapting EBMT to Chinese joy@cs.cmu.edu
Chinese language • Character • Unit for constructing word, almost each character has a meaning. When constructed with other characters to form a word, the meaning of the word may be different with the meaning of the character • Word: • Usually bigram (two character word), a unigram, trigram or 4-gram, n-gram with n>4 are specific idioms (Data from FDMC 1986) Adapting EBMT to Chinese joy@cs.cmu.edu
Chinese language (cont.) • Problems with words • Vague definition of words • E.g. People’s Republic of China (all these words can be considered as legal words) Adapting EBMT to Chinese joy@cs.cmu.edu
Chinese language (cont.) • Unknown words • New words • Words unique for a certain domain, e.g. legal code Adapting EBMT to Chinese joy@cs.cmu.edu
Chinese language (cont.) • Segmentation • Segmenting words from the sequence of characters • LDC segmenter, using dynamic algorithm, depends on a frequency dictionary • Problem of LDC segmentation • The frequency dictionary can not cover the corpus (miss-segmentation) Adapting EBMT to Chinese joy@cs.cmu.edu
Chinese language (cont.) • Consequence of miss-segmentation • Match?? • The longer the word, the better coverage for EBMT (encapsulating the context into the word) Adapting EBMT to Chinese joy@cs.cmu.edu
Improved Segmenter • Basic ideas: using statistical lexical acquisition to augment the frequency dictionary for the segmenter • Steps: • Using sliding window extract repeating patterns (sequence of characters) from the corpus • Refine patterns to construct longer words/term Adapting EBMT to Chinese joy@cs.cmu.edu
Improved Segmenter (cont.) • Assumptions: • Localization: Same type of word appears more frequently near each other, rather than distributed evenly among the whole corpus Adapting EBMT to Chinese joy@cs.cmu.edu
Improved Segmenter (cont.) Adapting EBMT to Chinese joy@cs.cmu.edu
Improved Segmenter • Assumption: 2. If there will be another pattern appear, it should appear in a range related to the average distance of appeared patterns Adapting EBMT to Chinese joy@cs.cmu.edu
Improved Segmenter • Results: • Hard to evaluate, because the vague definition of words • The effects of improved segmenter can be seen in the improvement of EBMT coverage Adapting EBMT to Chinese joy@cs.cmu.edu
English phrase bracket • Match: • As we increased the length in average the length of Chinese words, to match between the Chinese and English part of corpus, we did the similar thing for English • Recognizing English phrase and bracketing the corpus (replacing the blank with underscore) e.g. the_people’s_republic_of_china (it will be treated as a word) Adapting EBMT to Chinese joy@cs.cmu.edu
Statistical dictionary • Step1: collapsing the inflection form of English phrase/words to one class • Algorithm: Longest common sub string of two phrases should be long enough. Adapting EBMT to Chinese joy@cs.cmu.edu
Statistical dictionary • Step2: building statistical dictionary • Algorithm (with help from Benjamin) S: source language word T: target language word Adapting EBMT to Chinese joy@cs.cmu.edu
Statistical dictionary • Iteration • As the improved segmenter and phrase extraction all work monolingually, there is possibility that Chinese term extracted can not be found with a translation • Using only Chinese words and English phrases that are found with translation to re-segment/re-bracketing the corpus. • Build statistical dictionary again. • Repeat this loop for several times, size of statistical dictionary increased. Adapting EBMT to Chinese joy@cs.cmu.edu
Results • Exp16: Baseline system • Exp15: Base system + improved segmenter • Exp18: Base system + improved segmenter + StatDict • Exp14: Base system + improved segmenter + bracketer + statistical dictionary (3 iterations) Adapting EBMT to Chinese joy@cs.cmu.edu
Ongoing and future work • Feed back from statistical dictionary to segmenter and brackter • Topic detection, corpus clustering • Related work ongoing: • Ralf: Generalization, word clustering • Erik: Relative clause detection and reordering Adapting EBMT to Chinese joy@cs.cmu.edu