200 likes | 353 Views
Constructing Bilingual Resources for Digital Libraries. Rim, Hae-Chang Korea University 2000.8.10. Contents. Introduction Bilingual resources bilingual dictionary bilingual corpus bilingual thesaurus Our experience bilingual dictionary bilingual corpus bilingual thesaurus Summary.
E N D
Constructing Bilingual Resources for Digital Libraries Rim, Hae-Chang Korea University 2000.8.10
Contents • Introduction • Bilingual resources • bilingual dictionary • bilingual corpus • bilingual thesaurus • Our experience • bilingual dictionary • bilingual corpus • bilingual thesaurus • Summary
Introduction • What is the problem? • language barrier at multilingual digital library. • How to solve the problem? • machine translation(MT) • cross-language information retrieval(CLIR) • Why bilingual resources? • MT and CLIR are based on bilingual resources. • What shall we do? • constructing • Korean-English bilingual dictionary • Korean-English bilingual corpus • Korean-English bilingual thesaurus
MT CLIR Overview bilingual resources DL DL language barrier
Bilingual Resources • Bilingual dictionary • Bilingual corpus • Bilingual thesaurus
Bilingual Dictionary • Definition • dictionary containing words and their translated words. • Application field • CLIR • [Oard 98], [Fujii et al. 99], [Myaeng et al. 99] • MT • Utilization translated words “atmosphere” “waiting” CLIR word “대기” bilingual dictionary “대기1” – “atmosphere” “대기2” – “waiting” MT
Bilingual Corpus (1) • Definition • comparable corpus • a collection of similar texts in different languages • parallel corpus • a collection of texts which have been translated into one or more other language(s). • Ex) Canadian Hansard corpus • Application field • CLIR • [Yang et al. 98] • MT • Example-Based Machine Translation • [Brown 96], [Murata et al. 99], [Shirai et al.97] • [Turcato et al 99]
Bilingual Corpus (2) • Utilization translated words “대기” - “atmosphere” - “waiting” “오염” - “pollution” “대기 오염” “atmosphere pollution” ? “waiting pollution” ? bilingual corpus “the sources of atmosphere pollution may have a global, regional and local character.” “대기 오염의 원인은 전세계적, 국부적, 그리고 지역적인 특징을 가진다.” MT CLIR translated phrase “대기 오염” “atmosphere pollution”
Bilingual Thesaurus (1) • Definition • a collection of words in two languages that are put into groups together according to connections between their meanings • Ex) EuroWordNet • Application field • CLIR • concept-based CLIR • [Gonzalo et al. 98], [Gilarranz et al. 97]
Bilingual Thesaurus (2) • Utilization bilingual thesaurus word “대기” {region, part} {air} {atmosphere, 대기} CLIR word concept “region” “inactivity” {inactivity} {wait,waiting, 대기} {pause}
Our Experience • Bilingual dictionary • Bilingual corpus • Bilingual thesaurus
Bilingual Dictionary • Korean-English bilingual dictionary • size • 2 million entries • application bilingual biographical dictionary “링컨” - “Lincoln” person’s name “링컨” translated person’s name “Lincoln” CLIR MT
Bilingual Corpus • Korean-English bilingual corpus • parallel corpus containing 250,000 words • based on CES(Corpus Encoding Standard) • Corpus construction tools • corpus refining tools • corpus annotating tools • bilingual concordancer
{region, part} {atmosphere, 대기} {region, part} {atmosphere} {air} Bilingual Thesaurus (1) • Goal • Constructing a Korean-English bilingual thesaurus • Approach • assigning Korean words to corresponding English words in WordNet Korean word “대기” WordNet {air} [ Korean-English bilingual thesaurus ]
Bilingual Thesaurus (2) • Current status of the task • under construction
Summary • Surmounting the language barrier • using bilingual resources • Korean-English bilingual resources • Korean-English bilingual dictionary • Korean-English bilingual corpus • Korean-English bilingual thesaurus • Our experience • Korean-English bilingual dictionary • Korean-English bilingual corpus • Korean-English bilingual thesaurus
reference(1) • [Oard 98] Douglas W. Oard, “A Comparative Study of Query and Document Translation for Cross-Language Information Retrieval”, the Third Conference of the Association for Machine Translation in the Americas (AMTA), Philadelphia, PA, October, 1998. • [Fujii et al. 99] Atsushi Fujii, Tetsuya Ishikawa, "Cross-Language Information Retrieval for Technical Documents", Proceedings of the joint ACL SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp.29-37, 1999. • [Myaeng et al. 99] Sung Hyon Myaeng and Myung-gil Jang, "Complementing Dictionary-Based Query Translations with Corpus Statistics for Cross-Language IR", Machine Translation Summit VII, 1999.
reference(2) • [Yang et al. 98] Yiming Yang, Jaime G. Carbonell, Ralf D. Brown, and Robert E.F rederking. "Translingual Information Retrieval: Learning from Bilingual Corpora", In Artificial Intelligence, Special issue: Best of IJCAI-97). Vol. 103 (1998), pp. 323-345 • [Brown 96] Ralf D. Brown, “Example-Based Machine Translation in the Pangloss System”, In Proceedings of the 16th International Conference on Computational Linguistics (COLING-96), pp.169-174, Copenhagen, Denmark, August 5-9, 1996. • [Murata et al. 99] Murata, M, Q. Ma, K.Uchimoto, H. Isahara, "An Example-Based Approach to Japanese-to-English Translation of Tense, Aspect, and Modality", in TMI'99, Chester, UK, August 23, 1999.
reference(3) • [Shirai et al. 97] Shirai, S., F. Bond, and Y. Takahashi. 1997. “A Hybrid Rule and Example based Method for Machine Translation.”In Natural Language Processing Pacific Rim Symposium '97: NLPRS-97. • [Turcato et al. 99] Davide Turcato, Paul McFetridge, Fred Popowich, Janine Toole, "A Unified Example-Based and Lexicalist Approach to Machine Translation", at the 8th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-99) • [Gonzalo et al. 98] Julio Gonzalo, Felisa Verdejo, Carol Peters and Nicoletta Calzolari, “Applying EuroWordNet to Cross-Language Text Retrieval”, Computers and the Humanities, Vol 32, Nos. 2-3, pp. 73-89, 1998.
reference(4) • [Gilarranz et al. 97] Julio Gilarranz, Julio Gonzalo and Felisa Verdejo, "An Approach to Conceptual Text Retrieval Using the EuroWordNet Multilingual Semantic Database", AAAI 97.