Pre-processing of Bilingual Corpora for Mandarin-English EBMT

Pre-processing of Bilingual Corpora for Mandarin-English EBMT Ying Zhang, Ralf Brown, Robert Frederking, Alon Lavie (http://www.cs.cmu.edu/~joy) Language Technologies Institute School of Computer Science Carnegie Mellon University

Background • The Example Based Machine Translation System – EBMT (Brown 96; Brown 99) • A shallow match system • Extract statistical dictionary from bitext • Word-level alignment • Dictionary and glossary are used to fill the gaps • Using target language trigram to generate the “best” translaton (Hogan & Frederking 1998) Language Technologies Institute School of Computer Science Carnegie Mellon University

Data Used • Hong Kong Legal Code: • Chinese: 23 MB • English: 37.8 MB • Hong Kong News (After cleaning): 7622 Documents • Dev-test: Size: 1,331,915 byte , 4,992 sentence pairs • Final-test: Size: 1,329,764 byte, 4,866 sentence pairs • Training: Size: 25,720,755 byte, 95,752 sentence pairs • Corpus Cleaning • Converted from Big5 to GB • Divided into Training set (90%), Dev-test (5%) and test set (5%) • Sentence level alignment, using Church & Gale Method (by ISI) • Cleaned • Convert two-byte Chinese characters to their cognates Language Technologies Institute School of Computer Science Carnegie Mellon University

Chinese Segmentation • Our EBMT system is word based • Written Chinese has no spaces between words Language Technologies Institute School of Computer Science Carnegie Mellon University

Chinese Segmentation (2) • Why not just using characters? • Mis-match between Chinese and English will be worse Language Technologies Institute School of Computer Science Carnegie Mellon University

Chinese Segmentation (3) • Segmentation Problem: • Given a sentence with no spaces, break it into words. • Segmentation Approaches: • Statistical approach • Dictionary based approach • Combination of dictionary and linguistic knowledge • We used forward/backward maximum match, with LDC’s frequency dictionary for baseline • Suffered from the incomplete coverage of the dictionary on corpus Language Technologies Institute School of Computer Science Carnegie Mellon University

Goal • Extract Chinese terms from the corpus and add them to the frequency dictionary for segmentation • Result of pre-processing: • A segmented/bracketed bilingual corpus • A statistical dictionary Language Technologies Institute School of Computer Science Carnegie Mellon University

Definitions • Vague definitions of Chinese words • Definition used in this paper • Chinese Characters • The smallest unit in written Chinese is a character, which is represented by 2 bytes in GB-2312 code. • Chinese Words • A word in natural language is the smallest reusable unit which can be used in isolation. • Chinese Phrases • We define a Chinese phrase as a sequence of Chinese words. For each word in the phrase, the meaning of this word is the same as the meaning when the word appears by itself. • Terms • A term is a meaningful constituent. It can be either a word or a phrase. Language Technologies Institute School of Computer Science Carnegie Mellon University

Tokenization Techniques (1) • Collocation measure For two adjacent terms: w1 and w2 Where VMI(w1:w2) is a variant of average mutual information: Language Technologies Institute School of Computer Science Carnegie Mellon University

Tokenization Techniques (2) • Dual-threshold for segmenting Language Technologies Institute School of Computer Science Carnegie Mellon University

Tokenization Procedure • Tokenizing on character level cannot produce a highly accurate segmentation • Cross-boundary problem • Instead, tokenize on the segmented corpus using LDC’s segmenter Language Technologies Institute School of Computer Science Carnegie Mellon University

Feedback from Statistical Dictionary • Monolingual tokenization may lead to over segmentation • The statistical dictionary was built from segmented corpus • Using the results of statistical dictionary to adjust the segmentation Language Technologies Institute School of Computer Science Carnegie Mellon University

Flowchart of Pre-processing Language Technologies Institute School of Computer Science Carnegie Mellon University

Results • With proper parameters for two thresholds: • Average length of Chinese terms increased by 60%, 10% for English • Statistical dictionary gained 30% increase in coverage (with the same precision) • Small boost in EBMT overall performance • Automatic evaluation metrics • Human evaluations Language Technologies Institute School of Computer Science Carnegie Mellon University

Ongoing and Future work • Adding word-clustering and grammar induction features • Improving the sub-sentential alignment model by utilizing the bilingual collocation information • Change threshold dynamically according to the current segmentation Language Technologies Institute School of Computer Science Carnegie Mellon University

References (partial) • Ralf D. Brown. 1996. Example-Based Machine Translation in the PanGloss System. In Proceedings of the Sixteenth International Conference on Computational Linguistics, Pages 169-174, Copenhagen, Denmark. http://www.cs.cmu.edu/~ralf/papers.html • Ralf D. Brown. 1997. "Automated Dictionary Extraction for ``Knowledge-Free'' Example-Based Translation". In Proceedings of the Seventh International Conference on Theoretical and Methodological Issues in Machine Translation, p. 111-118. Santa Fe, July 23-25, 1997 • Ralf D. Brown. 1999. Adding Linguistic Knowledge to a Lexical Example-Based Translation System. In Proceedings of the Eighth International Conferences on Theoretical and Methodological Issues in Machine Transaltion (TMI-99), pages 22-32, Chester, England, August. http://www.cs.cmu.edu/~ralf/papers.html • Ralf D. Brown. 2000. Automated Generalization of Translation Examples. In Proceedings of the Eighteenth International Conferences on Computational Linguistics (COLING-2000), pages 125-131 • Tom Emerson, “Segmentation of Chinese Text”. In #38 Volume 12 Issue2 of MultiLingual Computing & Technology published by MultiLingual Computing, Inc. • Christopher Hogan and Robert E. Frederking. 1998. An Evaluation of the Multi-engine MT Architecture. In Machine Translation and the Information Soup: Proceedings of the Third Conference of the Association for Machine Translation in Americas (AMTA ’98), volume 1529 of Lecture Notes in Artificial Intelligence, pages 113-123. Springer-Verlag, Berlin, October Language Technologies Institute School of Computer Science Carnegie Mellon University

The End • Questions and Comments? Language Technologies Institute School of Computer Science Carnegie Mellon University

Pre-processing of Bilingual Corpora for Mandarin-English EBMT

Pre-processing of Bilingual Corpora for Mandarin-English EBMT

Presentation Transcript

Pre-processing of NIR

Better Together: Large Monolingual, Bilingual and Multimodal Corpora in Natural Language Processing

Data Pre-processing

Sign Language corpora for analysis, processing and evaluation

Pre-Processing of CCD Data

Extracting bilingual terminologies from comparable corpora

Learning Bilingual Lexicons from Monolingual Corpora

English and Chinese (Mandarin) Passive Constructions

EBMT

Using corpora to study Classifiers in Mandarin Chinese

English vs. Mandarin: A Phonetic Comparison

Pre-processing OpenURLs

Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model

Pre-processing

Image Processing Pre - Processing

English Mandarin Translation

SPM Pre-Processing

Pre-processing

Pre-processing OpenURLs