170 likes | 185 Views
國立雲林科技大學 National Yunlin University of Science and Technology. Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs. Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Eric Brill Gary Kacmarcik Chris Brockett.
E N D
國立雲林科技大學National Yunlin University of Science and Technology • Automatically Harvesting Katakana-English Term Pairs from Search Engine Query Logs • Advisor:Dr. Hsu • Graduate:Chien-Shing Chen • Author:Eric Brill • Gary Kacmarcik • Chris Brockett Microsoft Research,NLPRS,2001
Outline • N.Y.U.S.T. • I.M. • Motivation • Objective • Introduction • Noisy Channel Error Model • Experiments • Conclusions • Opinion
Motivation • N.Y.U.S.T. • I.M. • Out-Of-Vocabulary words pose a notorious headache for human translators • Stumble block for quality machine translation (MT) and multilingual IR <本拉登, Bin Laden>
Objective • N.Y.U.S.T. • I.M. • back-transliteration from katakana to English • to find <katakana, English> pairs • for application during runtime machine translation (MT) and multilingual IR
Introduction • N.Y.U.S.T. • I.M. • Katakana script is used to represent foreign loan words • Search engines must deal with the flood of OOV words
Noisy channel Error Model • N.Y.U.S.T. • I.M. • |a|,|b|<=4, English romanized-katakana non-match(noisy) is :
Noisy channel Error Model • N.Y.U.S.T. • I.M. a k g s u a l a c t u a l • |a|,|b|<=2, English romanized-katakana non-match is :
Noisy channel Error Model • N.Y.U.S.T. • I.M. • show some high probability edits learned to map English to romanized katakana:
Harvesting Training Data • N.Y.U.S.T. • I.M. • extract a database of English and Japanese queries from the MSN Search query logs 1.Katakana strings in encyclopedia is small and isn’t growing over time. 2.Encyclopedias are static, and don’t contain new names and phrases
Harvesting Training Data • N.Y.U.S.T. • I.M. • 461,567 sentences only 40,127 unique katakana strings • acquire 10,000 new katakana strings each day
Harvesting from Non-Aligned Query Databases • N.Y.U.S.T. • I.M.
Growing a Bilingual Lexicon • N.Y.U.S.T. • I.M. • shinguru shingle & single • faibaa fiver & fiber • pakkuman packman & pac-man • maimu maim -> mime • rainingu raining -> lining • purizumu purism -> prism • posuto past & post • retasu retrace & lettuce • bangaroo kangaroo & bungalow
Improving the Noisy Channel Error Model • N.Y.U.S.T. • I.M. • extract katakana-English word pairs culled from : 1.Kenkyusha New College Japanese-English Dictionary 2.Terms extracted from in house localization databases 3.Iwanami Kokugojiten pocket dictionary • Consist of a rather general collection terms and proper names, mostly geographical names
Improving the Noisy Channel Error Model • N.Y.U.S.T. • I.M. • Baseline plot uses a fairly conservative filter • More aggressive filters (allow more “noisy” data through)
Improving the Noisy channel Error Model • N.Y.U.S.T. • I.M. • Occur at least 100 times (100 Threshold)
Conclusions • N.Y.U.S.T. • I.M. • Robust utility in acquiring <katakana,English> pairs • For IR and MT
Opinion • N.Y.U.S.T. • I.M. • Assist with our research 本拉登 羅馬拼音:ben la ding 外來詞: bin laden