380 likes | 549 Views
The Soul Lost in the Blue Bridge. soul decided blue bridge . Mining Translations for Key Phrases from Web Corpora. Ying Zhang (Joy), Fei Huang Stephan Vogel CMU/LTI MT Lunch Presentation April 19 2005. Outline. Motivation Crosslingual query expansion Key phrase translation extraction
E N D
The Soul Lost in the Blue Bridge soul decided blue bridge . Mining Key Phrase Translations from Web Corpora
Mining Translations for Key Phrases from Web Corpora Ying Zhang (Joy), Fei Huang Stephan Vogel CMU/LTI MT Lunch Presentation April 19 2005
Outline • Motivation • Crosslingual query expansion • Key phrase translation extraction • Experiments • Conclusion and future work Mining Key Phrase Translations from Web Corpora
Key Phrase • Definition • Named-entities: • person, organization and location • Book/movie titles • Terminology (Medical, Sci&Tech, Military, …) • Most of them are compound nouns • The meaning can not be directly derived from its components • Requires more world knowledge to translate • Important for NLP applications: • Machine Translation (MT) • Cross-lingual Information Retrieval (CLIR) • Question-Answering (QA) • Most of them are OOV Mining Key Phrase Translations from Web Corpora
Searching the web for the translation? • Searching the parallel data on the web (e.g. STRAND: Resnik 2003) Mining Key Phrase Translations from Web Corpora
Bilingual Information on the Web • Searching the parallel data on the web (Resnik 2003) • Searching the comparable corpus on the web (Fung 1998) Mining Key Phrase Translations from Web Corpora
Bilingual Information on the Web • Searching the parallel data on the web (Resnik 2003) • Searching the comparable corpus on the web (Fung 1998) • Anchor texts pointing to the same page (Lu 2004) Mining Key Phrase Translations from Web Corpora
Bilingual Information on the Web • Limited bilingual resources as parallel/comparable on the web • STRAND: 3,500 English-Chinese document pairs and fewer than 2,500 for English-French. (Resnik 2003 ) • Comparable corpora: from 10 years Xinhua Chinese and English stories (2GB) only 110K sentence pairs (44MB) are found as “parallel”. (Zhao & Vogel 2002) • Anchor text mining: from 2M web pages, 2.8MB Chinese text and 3.1MB English text found as potential translations. • More bilingual information on the web in the form of mixed language webpage • Parallel text are not needed in most cases • The Chinese authors usually include the original English for the key phrases • For consistency • To give the readers more information • If they are not sure about the translation in Chinese Mining Key Phrase Translations from Web Corpora
Web pages of mixed languages Mining Key Phrase Translations from Web Corpora
Web pages of mixed languages Mining Key Phrase Translations from Web Corpora
Mining translations from mixed-lang. pages • Crawling the Chinese web pages that contain English text. (Zhang and Vines, SIGIR 2004) • Use Google to locate the webpages containing the Chinese terms • English expressions occur next to the Chinese terms are considered as their translations • Crawled 2GB web data, 1,168 distinct English terms found, 61% are correct translations • Searching the Chinese terms among the English pages. (Cheng et al. SIGIR 2004) • Use Google to retrieve “English” pages containing the Chinese terms • Extract translations from the snippets • LiveTrans system Mining Key Phrase Translations from Web Corpora
Mining translations from mixed-lang pages Mining Key Phrase Translations from Web Corpora
Pros and cons of these approaches Mining Key Phrase Translations from Web Corpora
Our approach: cross-lingual query expansion • Query expansion: expanding the original query to better represent user’s “information need” • E.g. expand the query “cmu” to “cmu pittsburgh” if the user wants to find pages about “Carnegie Mellon University” instead of “Central Michigan University”. • Cross-lingual query expansion • Information need: not pages relevant to query Q, but pages containing the translation of Q • How to represent this “need”? • Observation: assuming that Q and Q’ are two relevant Chinese terms, when a webpage contains Q and its translation E, Q’ and its translation E’ are very likely to appear on the same page. • If we know Q’ and its translation E’, expand query Q with E’ • Q’ and E’ are called hint words Mining Key Phrase Translations from Web Corpora
Cross-lingual query expansion • A good Chinese hint word should be: • Relevant to the term to be translated • Easy to translate given the current resources • How to find this Chinese hint word? • Search Google • Select Chinese words Q’ with high frequency • Use only those having translations in the LDC lexicon • Search Google again with Q+E’ pairs Mining Key Phrase Translations from Web Corpora
Cross-lingual Query Expansion Mining Key Phrase Translations from Web Corpora
Cross-lingual Query Expansion Mining Key Phrase Translations from Web Corpora
Cross-lingual Query Expansion Mining Key Phrase Translations from Web Corpora
Cross-lingual Query Expansion Mining Key Phrase Translations from Web Corpora
Comparing with other approaches Next step … Mining Key Phrase Translations from Web Corpora
Outline • Motivation • Crosslingual query expansion • Key phrase translation extraction • Preprocessing • Multiple features • Transliteration model • Translation model • Frequency-distance model • Feature combination • Experiments • Conclusion and future work Mining Key Phrase Translations from Web Corpora
Preprocessing • HTML tag filtering • Chinese word segmentation • Character replacement • Replacing punctuation with separator “|” • Replacing non-query Chinese words with “+”) • Grouping continuous English words into a phrase 《<b>廊桥遗梦</b>》(the bridges of madison county) [review]. 发布者:anjing | 发布时间:2004-01-25 星期日02:13 | 最新更新时间 《<b>廊 桥 遗 梦</b>》(the bridges of madison county) [review]. 发布 者:anjing | 发布 时间:2004-01-25 星期日 02:13 | 最新 更新 时间 | <b>廊 桥 遗 梦</b> | the bridges of madison county | review | ++ +| anjing | ++ ++ | 2004-01-25 +++ 02:13 | ++ ++ ++ | <b>廊桥遗梦</b> | the_bridges_of_madison_county | review | ++ + | anjing | ++ ++ | 2004-01-25 +++ 02 13 | + + ++ ++ Mining Key Phrase Translations from Web Corpora
Phrase Alignment Features • Transliteration model • Capture phonetic similarity • Person, location and brand names • Probabilistic surface string alignment • Romanized source phrases vs. target phrase • Letters are aligned according to their pronunciation similarity (not orthogonal forms) • Letter pronunciation similarities are automatically learned from bilingual NE lists using EM Key phrase alignment path (雅诗兰黛 vs. Estee Lauder) Huang, Vogel and Waibel, Automatic Extraction of Named Entity Translingual Equivalence Based on Multi-feature Cost Minimization, ACL 03 Multilingual NE Recognition Workshop Mining Key Phrase Translations from Web Corpora
Phrase Alignment Features • Translation model • Capture semantic similarity • Organization names • Science and technical terms • Military terms • Calculate phrase translation probability using IBM models 简 氏 防务 周刊 Jane’s defense weekly Mining Key Phrase Translations from Web Corpora
Phrase Alignment Features • Frequency-distance model • Frequency of co-occurrence • Distance within a snippet • si: returned snippet containing (f,e) • fi: ith occurrence of the source phrase in si • d: distance, i.e., how many words in between d1 d2 <b> 马 语 者 </b> | the_horse_whisperer | the_review | ++ ++ ++ ++ + | <b> 马 语 者 </b> | horse_whisperer | ac3 | cd1_verycd_com |+ ++ | peter_hewitt Mining Key Phrase Translations from Web Corpora
Phrase Alignment Features • Feature combination • Confidence measure of the transliteration model • Confidence measure of the translation model • Overall combined feature cost Mining Key Phrase Translations from Web Corpora
Outline • Motivation • Crosslingual query expansion • Key phrase translation extraction • Experiments • Conclusion and future work Mining Key Phrase Translations from Web Corpora
Experiment • Test set • 310 key phrases manually selected from 12 domains • Manual translation as reference • One phrase may have several correct translations Mining Key Phrase Translations from Web Corpora
Inclusion Rate With Hint, Whole Web No Hint, English Pages No Hint, Whole Web Define inclusion rate as: # of phrases whose translation are included in the returned snippets Total # of phrases Mining Key Phrase Translations from Web Corpora
Alignment Accuracy Define alignment accuracy as: # of correct phrase translations # of phrases whose translation can be retrieved in snippets Mining Key Phrase Translations from Web Corpora
Overall Translation Accuracy LiveTrans: an OOV translator using web corpora http://livetrans.iis.sinica.edu.tw/lt.html Mining Key Phrase Translations from Web Corpora
Sample Translation Results Mining Key Phrase Translations from Web Corpora
Conclusion and Future Work • Find key phrase translation via Web mining • Crosslingual query expansion find more relevant webpage snippets • Transliteration, translation and frequency-distance features extract correct translation • Significant improvements over several existing systems • Future work • Experimenting on other language pairs E.g. Arabic. • Select effective hint words based on richer features. • Flexible phrase boundary detection • Apply on MT tasks Mining Key Phrase Translations from Web Corpora
References • Fung, P and Yee, L.Y. An IR Approach for Translating New Words from Nonparallel, Comparable Texts. In Proc. Of COLING-ACL, pp. 414-420, 1998. • F. Huang, S. Vogel and A. Waibel. Automatic extraction of named entity translingual equivalence based on multi-feature cost minimization. In Proceeding of the 41st ACL, Workshop on Multilingual and Mixed-Language Named Entity Recognition, Sapporo, Japan, July 2003. • Lu, W.-H., Chien, L.-F., and Lee, H.-J. Anchor Text Mining for Translation of Web Queries: A Transitive Translation Approach. ACM Transactions on Information Systems 22(2), pp. 242-269, 2004. • P. Resnik and N. A. Smith. The web as a parallel corpus. Comput. Linguist., 29(3):349--380, 2003. • Y. Zhang and P. Vines. Detection and translation of oov terms prior to query time. In SIGIR '04, pages 524--525. ACM Press, 2004. • Y. Zhang, F. Huang and S. Vogel. Mining Translations of OOV Terms from the Web through Cross-lingual Query Expansion. In SIGIR ’05. Mining Key Phrase Translations from Web Corpora