170 likes | 330 Views
Pre-processing of Bilingual Corpora for Mandarin-English EBMT. Ying Zhang, Ralf Brown, Robert Frederking, Alon Lavie (http://www.cs.cmu.edu/~joy). Background . The Example Based Machine Translation System – EBMT (Brown 96; Brown 99) A shallow match system
E N D
Pre-processing of Bilingual Corpora for Mandarin-English EBMT Ying Zhang, Ralf Brown, Robert Frederking, Alon Lavie (http://www.cs.cmu.edu/~joy) Language Technologies Institute School of Computer Science Carnegie Mellon University
Background • The Example Based Machine Translation System – EBMT (Brown 96; Brown 99) • A shallow match system • Extract statistical dictionary from bitext • Word-level alignment • Dictionary and glossary are used to fill the gaps • Using target language trigram to generate the “best” translaton (Hogan & Frederking 1998) Language Technologies Institute School of Computer Science Carnegie Mellon University
Data Used • Hong Kong Legal Code: • Chinese: 23 MB • English: 37.8 MB • Hong Kong News (After cleaning): 7622 Documents • Dev-test: Size: 1,331,915 byte , 4,992 sentence pairs • Final-test: Size: 1,329,764 byte, 4,866 sentence pairs • Training: Size: 25,720,755 byte, 95,752 sentence pairs • Corpus Cleaning • Converted from Big5 to GB • Divided into Training set (90%), Dev-test (5%) and test set (5%) • Sentence level alignment, using Church & Gale Method (by ISI) • Cleaned • Convert two-byte Chinese characters to their cognates Language Technologies Institute School of Computer Science Carnegie Mellon University
Chinese Segmentation • Our EBMT system is word based • Written Chinese has no spaces between words Language Technologies Institute School of Computer Science Carnegie Mellon University
Chinese Segmentation (2) • Why not just using characters? • Mis-match between Chinese and English will be worse Language Technologies Institute School of Computer Science Carnegie Mellon University
Chinese Segmentation (3) • Segmentation Problem: • Given a sentence with no spaces, break it into words. • Segmentation Approaches: • Statistical approach • Dictionary based approach • Combination of dictionary and linguistic knowledge • We used forward/backward maximum match, with LDC’s frequency dictionary for baseline • Suffered from the incomplete coverage of the dictionary on corpus Language Technologies Institute School of Computer Science Carnegie Mellon University
Goal • Extract Chinese terms from the corpus and add them to the frequency dictionary for segmentation • Result of pre-processing: • A segmented/bracketed bilingual corpus • A statistical dictionary Language Technologies Institute School of Computer Science Carnegie Mellon University
Definitions • Vague definitions of Chinese words • Definition used in this paper • Chinese Characters • The smallest unit in written Chinese is a character, which is represented by 2 bytes in GB-2312 code. • Chinese Words • A word in natural language is the smallest reusable unit which can be used in isolation. • Chinese Phrases • We define a Chinese phrase as a sequence of Chinese words. For each word in the phrase, the meaning of this word is the same as the meaning when the word appears by itself. • Terms • A term is a meaningful constituent. It can be either a word or a phrase. Language Technologies Institute School of Computer Science Carnegie Mellon University
Tokenization Techniques (1) • Collocation measure For two adjacent terms: w1 and w2 Where VMI(w1:w2) is a variant of average mutual information: Language Technologies Institute School of Computer Science Carnegie Mellon University
Tokenization Techniques (2) • Dual-threshold for segmenting Language Technologies Institute School of Computer Science Carnegie Mellon University
Tokenization Procedure • Tokenizing on character level cannot produce a highly accurate segmentation • Cross-boundary problem • Instead, tokenize on the segmented corpus using LDC’s segmenter Language Technologies Institute School of Computer Science Carnegie Mellon University
Feedback from Statistical Dictionary • Monolingual tokenization may lead to over segmentation • The statistical dictionary was built from segmented corpus • Using the results of statistical dictionary to adjust the segmentation Language Technologies Institute School of Computer Science Carnegie Mellon University
Flowchart of Pre-processing Language Technologies Institute School of Computer Science Carnegie Mellon University
Results • With proper parameters for two thresholds: • Average length of Chinese terms increased by 60%, 10% for English • Statistical dictionary gained 30% increase in coverage (with the same precision) • Small boost in EBMT overall performance • Automatic evaluation metrics • Human evaluations Language Technologies Institute School of Computer Science Carnegie Mellon University
Ongoing and Future work • Adding word-clustering and grammar induction features • Improving the sub-sentential alignment model by utilizing the bilingual collocation information • Change threshold dynamically according to the current segmentation Language Technologies Institute School of Computer Science Carnegie Mellon University
References (partial) • Ralf D. Brown. 1996. Example-Based Machine Translation in the PanGloss System. In Proceedings of the Sixteenth International Conference on Computational Linguistics, Pages 169-174, Copenhagen, Denmark. http://www.cs.cmu.edu/~ralf/papers.html • Ralf D. Brown. 1997. "Automated Dictionary Extraction for ``Knowledge-Free'' Example-Based Translation". In Proceedings of the Seventh International Conference on Theoretical and Methodological Issues in Machine Translation, p. 111-118. Santa Fe, July 23-25, 1997 • Ralf D. Brown. 1999. Adding Linguistic Knowledge to a Lexical Example-Based Translation System. In Proceedings of the Eighth International Conferences on Theoretical and Methodological Issues in Machine Transaltion (TMI-99), pages 22-32, Chester, England, August. http://www.cs.cmu.edu/~ralf/papers.html • Ralf D. Brown. 2000. Automated Generalization of Translation Examples. In Proceedings of the Eighteenth International Conferences on Computational Linguistics (COLING-2000), pages 125-131 • Tom Emerson, “Segmentation of Chinese Text”. In #38 Volume 12 Issue2 of MultiLingual Computing & Technology published by MultiLingual Computing, Inc. • Christopher Hogan and Robert E. Frederking. 1998. An Evaluation of the Multi-engine MT Architecture. In Machine Translation and the Information Soup: Proceedings of the Third Conference of the Association for Machine Translation in Americas (AMTA ’98), volume 1529 of Lecture Notes in Artificial Intelligence, pages 113-123. Springer-Verlag, Berlin, October Language Technologies Institute School of Computer Science Carnegie Mellon University
The End • Questions and Comments? Language Technologies Institute School of Computer Science Carnegie Mellon University