210 likes | 325 Views
The current status of Chinese-English EBMT research -where are we now. Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001. Overview of Ch-En EBMT. Adapting EBMT to Chinese Segmentation of Chinese Corpus used Hong Kong legal code (from LDC) Hong Kong news articles (from LDC)
E N D
The current status of Chinese-English EBMT research-where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001
Overview of Ch-En EBMT • Adapting EBMT to Chinese • Segmentation of Chinese • Corpus used • Hong Kong legal code (from LDC) • Hong Kong news articles (from LDC) • In this project: • Robert Frederking, Ralf Brown, Joy, Erik Peterson, Stephan Vogel, Alon Lavie, Lori Levin, Language Technologies Institute School of Computer Science, Carnegie Mellon University
Corpus Statistics • Hong Kong Legal Code: Chinese: 23 MB English: 37.8 MB • Hong Kong News (After cleaning): 7622 Documents Dev-test: Size: 1,331,915 byte , 4,992 sentence pairs Final-test: Size: 1,329,764 byte, 4,866 sentence pairs Training: Size: 25,720,755 byte, 95,752 sentence pairs • Corpus Cleaning • Converted from Big5 to GB • Divided into Training set (90%), Dev-test (5%) and test set (5%) • Sentence level alignment, using Church & Gale Method (by ISI) • Cleaned • Convert two-byte Chinese characters to their cognates Language Technologies Institute School of Computer Science, Carnegie Mellon University
Chinese Segmentation • There are no spaces between Chinese words in written Chinese. • The segmentation problem: Given a sentence with no spaces, break it into words. Definition of Chinese word is vague. Language Technologies Institute School of Computer Science, Carnegie Mellon University
Our Definition of Words/Phrases/Terms • Chinese Characters • The smallest unit in written Chinese is a character, which is represented by 2 bytes in GB-2312 code. • Chinese Words • A word in natural language is the smallest reusable unit which can be used in isolation. • Chinese Phrases • We define a Chinese phrase as a sequence of Chinese words. For each word in the phrase, the meaning of this word is the same as the meaning when the word appears by itself. • Terms • A term is a meaningful constituent. It can be either a word or a phrase. Language Technologies Institute School of Computer Science, Carnegie Mellon University
Complicated Constructions • Transliterated foreign words and names • Abbreviations • Chinese Names • Chinese Numbers Language Technologies Institute School of Computer Science, Carnegie Mellon University
Segmenter • Approaches • Statistical approaches: • Idea: Building collocation models for Chinese characters, such as first-order HMM. Place the space at the place where two characters rarely co-occur. • Cons: • Data sparseness • Cross boundary Language Technologies Institute School of Computer Science, Carnegie Mellon University
Segmenter (2) • Dictionary-based approaches • Idea: Use a dictionary to find the words in the sentence • Forward maximum match / backward maximum match/ or both direction • Cons: • The size and quality of the dictionary used are of great importance: New words, Named-entity • Maximum (greedy) match may cause mis-segmentations Language Technologies Institute School of Computer Science, Carnegie Mellon University
Segmenter (3) • A combination of dictionary and linguistic knowledge • Ideas: Using morphology, POS, grammar and heuristics to aid disambiguation • Pros: high accuracy (possible) • Cons: • Require a dictionary with POS and word-frequency • Computationally expensive Language Technologies Institute School of Computer Science, Carnegie Mellon University
Segmenter (4) • We first used LDC’s segmenter • Currently we are using a forward/backward maximummatch segmenter for baseline. The word frequency dictionary is from LDC • The word frequency dictionary from LDC: 43,959 entries • For HLT 2001, we augmented the frequency dictionary with new words found from the corpus by statistical method Language Technologies Institute School of Computer Science, Carnegie Mellon University
Two-threshold method • Two-threshold for tokenization (finding new words from the corpus) : for MT Summit VIII Language Technologies Institute School of Computer Science, Carnegie Mellon University
For PI Meeting • Baseline System • Using LDC’s frequency word dictionary • Full System • Tokenize new words from the pre-segmented corpus using two-threshold method, augment the frequency dictionary with new words to re-segment the corpus • Bracket English • Using feedback from statDict to adjust segmentation/bracketing • Baseline + Named-Entity • Named-entity tagger by Erik Peterson • Multi-corpora System • Cluster the documents into sub-corpora according to their topics Language Technologies Institute School of Computer Science, Carnegie Mellon University
Evaluation Issues • Automatic Measures • EBMT Source Match • EBMT Source Coverage • EBMT Target Coverage • MEMT (EBMT+DICT) Unigram Coverage • MEMT (EBMT+DICT) PER Language Technologies Institute School of Computer Science, Carnegie Mellon University
Evaluation Issues (2) • Human Evaluations • 4-5 graders each time • 6 categories Language Technologies Institute School of Computer Science, Carnegie Mellon University
After PI Meeting (0) • Study of results reported in PI meeting (http://pizza.is.cs.cmu.edu/research/internal/ebmt/tokenLen/index.htm) • The quality of Named-Entity (Cleaned by Erik) • Performance difference of EBMT while changing the average length of Chinese word token (by changing segmentation) • How to evaluate the performance of the system • Experiment of G-EBMT • Word clustering Language Technologies Institute School of Computer Science, Carnegie Mellon University
After PI Meeting (1) • Changing the average length of Chinese token • No bracket on English • Use a subset of LDC’s frequency dictionary for segmentation • Study the performance of EBMT system on different average Chinese token length Language Technologies Institute School of Computer Science, Carnegie Mellon University
After PI Meeting (2) • Avg. Token Len. Vs. PER Language Technologies Institute School of Computer Science, Carnegie Mellon University
After PI Meeting (3) • Type-Token curve of Chinese and English Language Technologies Institute School of Computer Science, Carnegie Mellon University
Future Research Plan • Generalized EBMT • Word-clustering • Grammar Induction • Using Machine Learning to optimize the parameters used in MEMT • Better Alignment Model: Integrating segmentation, brackting and alignment Language Technologies Institute School of Computer Science, Carnegie Mellon University
New Alignment Model (1) • Using both monolingual and bilingual collocation information to segment and align corpus Language Technologies Institute School of Computer Science, Carnegie Mellon University
References • Tom Emerson, “Segmentation of Chinese Text”. In #38 Volume 12 Issue2 of MultiLingual Computing & Technology published by MultiLingual Computing, Inc. • Ying Zhang, Ralf D. Brown, and Robert E. Frederking. "Adapting an Example-Based Translation System to Chinese". To appear in Proceedings of Human Language Technology Conference 2001(HLT-2001). • Ying Zhang, Ralf D. Brown, Robert E. Frederking and Alon Lavie. "Pre-processing of Bilingual Corpora for Mandarin-English EBMT". Accepted in MT Summit VIII (Santiago de Compostela, Spain, Sep. 2001) Language Technologies Institute School of Computer Science, Carnegie Mellon University