1 / 38

soul decided blue bridge .

The Soul Lost in the Blue Bridge. soul decided blue bridge . Mining Translations for Key Phrases from Web Corpora. Ying Zhang (Joy), Fei Huang Stephan Vogel CMU/LTI MT Lunch Presentation April 19 2005. Outline. Motivation Crosslingual query expansion Key phrase translation extraction

tangia
Download Presentation

soul decided blue bridge .

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Soul Lost in the Blue Bridge soul decided blue bridge . Mining Key Phrase Translations from Web Corpora

  2. Mining Translations for Key Phrases from Web Corpora Ying Zhang (Joy), Fei Huang Stephan Vogel CMU/LTI MT Lunch Presentation April 19 2005

  3. Outline • Motivation • Crosslingual query expansion • Key phrase translation extraction • Experiments • Conclusion and future work Mining Key Phrase Translations from Web Corpora

  4. Key Phrase • Definition • Named-entities: • person, organization and location • Book/movie titles • Terminology (Medical, Sci&Tech, Military, …) • Most of them are compound nouns • The meaning can not be directly derived from its components • Requires more world knowledge to translate • Important for NLP applications: • Machine Translation (MT) • Cross-lingual Information Retrieval (CLIR) • Question-Answering (QA) • Most of them are OOV  Mining Key Phrase Translations from Web Corpora

  5. Searching the web for the translation? • Searching the parallel data on the web (e.g. STRAND: Resnik 2003) Mining Key Phrase Translations from Web Corpora

  6. Bilingual Information on the Web • Searching the parallel data on the web (Resnik 2003) • Searching the comparable corpus on the web (Fung 1998) Mining Key Phrase Translations from Web Corpora

  7. Bilingual Information on the Web • Searching the parallel data on the web (Resnik 2003) • Searching the comparable corpus on the web (Fung 1998) • Anchor texts pointing to the same page (Lu 2004) Mining Key Phrase Translations from Web Corpora

  8. Bilingual Information on the Web • Limited bilingual resources as parallel/comparable on the web  • STRAND: 3,500 English-Chinese document pairs and fewer than 2,500 for English-French. (Resnik 2003 ) • Comparable corpora: from 10 years Xinhua Chinese and English stories (2GB) only 110K sentence pairs (44MB) are found as “parallel”. (Zhao & Vogel 2002) • Anchor text mining: from 2M web pages, 2.8MB Chinese text and 3.1MB English text found as potential translations. • More bilingual information on the web in the form of mixed language webpage • Parallel text are not needed in most cases • The Chinese authors usually include the original English for the key phrases • For consistency • To give the readers more information • If they are not sure about the translation in Chinese Mining Key Phrase Translations from Web Corpora

  9. Web pages of mixed languages Mining Key Phrase Translations from Web Corpora

  10. Web pages of mixed languages Mining Key Phrase Translations from Web Corpora

  11. Mining translations from mixed-lang. pages • Crawling the Chinese web pages that contain English text. (Zhang and Vines, SIGIR 2004) • Use Google to locate the webpages containing the Chinese terms • English expressions occur next to the Chinese terms are considered as their translations • Crawled 2GB web data, 1,168 distinct English terms found, 61% are correct translations • Searching the Chinese terms among the English pages. (Cheng et al. SIGIR 2004) • Use Google to retrieve “English” pages containing the Chinese terms • Extract translations from the snippets • LiveTrans system Mining Key Phrase Translations from Web Corpora

  12. Mining translations from mixed-lang pages Mining Key Phrase Translations from Web Corpora

  13. Pros and cons of these approaches Mining Key Phrase Translations from Web Corpora

  14. Our approach: cross-lingual query expansion • Query expansion: expanding the original query to better represent user’s “information need” • E.g. expand the query “cmu” to “cmu pittsburgh” if the user wants to find pages about “Carnegie Mellon University” instead of “Central Michigan University”. • Cross-lingual query expansion • Information need: not pages relevant to query Q, but pages containing the translation of Q • How to represent this “need”? • Observation: assuming that Q and Q’ are two relevant Chinese terms, when a webpage contains Q and its translation E, Q’ and its translation E’ are very likely to appear on the same page. • If we know Q’ and its translation E’, expand query Q with E’ • Q’ and E’ are called hint words Mining Key Phrase Translations from Web Corpora

  15. Cross-lingual query expansion • A good Chinese hint word should be: • Relevant to the term to be translated • Easy to translate given the current resources • How to find this Chinese hint word? • Search Google • Select Chinese words Q’ with high frequency • Use only those having translations in the LDC lexicon • Search Google again with Q+E’ pairs Mining Key Phrase Translations from Web Corpora

  16. Cross-lingual Query Expansion Mining Key Phrase Translations from Web Corpora

  17. Cross-lingual Query Expansion Mining Key Phrase Translations from Web Corpora

  18. Cross-lingual Query Expansion Mining Key Phrase Translations from Web Corpora

  19. Cross-lingual Query Expansion Mining Key Phrase Translations from Web Corpora

  20. Comparing with other approaches Next step … Mining Key Phrase Translations from Web Corpora

  21. Outline • Motivation • Crosslingual query expansion • Key phrase translation extraction • Preprocessing • Multiple features • Transliteration model • Translation model • Frequency-distance model • Feature combination • Experiments • Conclusion and future work Mining Key Phrase Translations from Web Corpora

  22. Preprocessing • HTML tag filtering • Chinese word segmentation • Character replacement • Replacing punctuation with separator “|” • Replacing non-query Chinese words with “+”) • Grouping continuous English words into a phrase 《<b>廊桥遗梦</b>》(the bridges of madison county) [review]. 发布者:anjing | 发布时间:2004-01-25 星期日02:13 | 最新更新时间 《<b>廊 桥 遗 梦</b>》(the bridges of madison county) [review]. 发布 者:anjing | 发布 时间:2004-01-25 星期日 02:13 | 最新 更新 时间 | <b>廊 桥 遗 梦</b> | the bridges of madison county | review | ++ +| anjing | ++ ++ | 2004-01-25 +++ 02:13 | ++ ++ ++ | <b>廊桥遗梦</b> | the_bridges_of_madison_county | review | ++ + | anjing | ++ ++ | 2004-01-25 +++ 02 13 | + + ++ ++ Mining Key Phrase Translations from Web Corpora

  23. Phrase Alignment Features • Transliteration model • Capture phonetic similarity • Person, location and brand names • Probabilistic surface string alignment • Romanized source phrases vs. target phrase • Letters are aligned according to their pronunciation similarity (not orthogonal forms) • Letter pronunciation similarities are automatically learned from bilingual NE lists using EM Key phrase alignment path (雅诗兰黛 vs. Estee Lauder) Huang, Vogel and Waibel, Automatic Extraction of Named Entity Translingual Equivalence Based on Multi-feature Cost Minimization, ACL 03 Multilingual NE Recognition Workshop Mining Key Phrase Translations from Web Corpora

  24. Phrase Alignment Features • Translation model • Capture semantic similarity • Organization names • Science and technical terms • Military terms • Calculate phrase translation probability using IBM models 简 氏 防务 周刊 Jane’s defense weekly Mining Key Phrase Translations from Web Corpora

  25. Phrase Alignment Features • Frequency-distance model • Frequency of co-occurrence • Distance within a snippet • si: returned snippet containing (f,e) • fi: ith occurrence of the source phrase in si • d: distance, i.e., how many words in between d1 d2 <b> 马 语 者 </b> | the_horse_whisperer | the_review | ++ ++ ++ ++ + | <b> 马 语 者 </b> | horse_whisperer | ac3 | cd1_verycd_com |+ ++ | peter_hewitt Mining Key Phrase Translations from Web Corpora

  26. Phrase Alignment Features • Feature combination • Confidence measure of the transliteration model • Confidence measure of the translation model • Overall combined feature cost Mining Key Phrase Translations from Web Corpora

  27. Outline • Motivation • Crosslingual query expansion • Key phrase translation extraction • Experiments • Conclusion and future work Mining Key Phrase Translations from Web Corpora

  28. Experiment • Test set • 310 key phrases manually selected from 12 domains • Manual translation as reference • One phrase may have several correct translations Mining Key Phrase Translations from Web Corpora

  29. Inclusion Rate With Hint, Whole Web No Hint, English Pages No Hint, Whole Web Define inclusion rate as: # of phrases whose translation are included in the returned snippets Total # of phrases Mining Key Phrase Translations from Web Corpora

  30. Mining Key Phrase Translations from Web Corpora

  31. Mining Key Phrase Translations from Web Corpora

  32. Mining Key Phrase Translations from Web Corpora

  33. Alignment Accuracy Define alignment accuracy as: # of correct phrase translations # of phrases whose translation can be retrieved in snippets Mining Key Phrase Translations from Web Corpora

  34. Overall Translation Accuracy LiveTrans: an OOV translator using web corpora http://livetrans.iis.sinica.edu.tw/lt.html Mining Key Phrase Translations from Web Corpora

  35. Sample Translation Results Mining Key Phrase Translations from Web Corpora

  36. Conclusion and Future Work • Find key phrase translation via Web mining • Crosslingual query expansion find more relevant webpage snippets • Transliteration, translation and frequency-distance features extract correct translation • Significant improvements over several existing systems • Future work • Experimenting on other language pairs E.g. Arabic. • Select effective hint words based on richer features. • Flexible phrase boundary detection • Apply on MT tasks Mining Key Phrase Translations from Web Corpora

  37. Mining Key Phrase Translations from Web Corpora

  38. References • Fung, P and Yee, L.Y. An IR Approach for Translating New Words from Nonparallel, Comparable Texts. In Proc. Of COLING-ACL, pp. 414-420, 1998. • F. Huang, S. Vogel and A. Waibel. Automatic extraction of named entity translingual equivalence based on multi-feature cost minimization. In Proceeding of the 41st ACL, Workshop on Multilingual and Mixed-Language Named Entity Recognition, Sapporo, Japan, July 2003. • Lu, W.-H., Chien, L.-F., and Lee, H.-J. Anchor Text Mining for Translation of Web Queries: A Transitive Translation Approach. ACM Transactions on Information Systems 22(2), pp. 242-269, 2004. • P. Resnik and N. A. Smith. The web as a parallel corpus. Comput. Linguist., 29(3):349--380, 2003. • Y. Zhang and P. Vines. Detection and translation of oov terms prior to query time. In SIGIR '04, pages 524--525. ACM Press, 2004. • Y. Zhang, F. Huang and S. Vogel. Mining Translations of OOV Terms from the Web through Cross-lingual Query Expansion. In SIGIR ’05. Mining Key Phrase Translations from Web Corpora

More Related