1 / 15

Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

This paper explores a segmentation-free method for extracting translations of out-of-vocabulary terms in Chinese-English and English-Chinese cross-language information retrieval (CLIR). By leveraging web text extraction and co-occurrence statistics, the proposed method significantly improves upon previous manual intervention approaches.

chrisstone
Download Presentation

Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using the Web for Automated Translation Extraction in Cross-Language Information Retrieval Advisor : Dr. Hsu Presenter : Zih-Hui Lin Author :Ying Zhang and Phil Vines

  2. Outline • Motivation • Objective • Previous work • Methodology • Experiments and results • Conclusions

  3. Motivation • One of the major remaining reasons that CLIR does not perform as well as monolingual retrieval is the presence of out of vocabulary (OOV) terms. • it will not be recognized, and segmented into either smaller sequences of characters or individual characters • 北野武→(north limit military) • Previous work has either relied on manual intervention or has only been partially successful in solving this problem.

  4. Objective • We propose a segmentation free method which can be applied to both Chinese-English and English-Chinese CLIR, correctly extracting translations of OOV terms from the Web automatically, and thus is a significant improvement on earlier work

  5. English translation extraction in Chinese-English CLIR • Chinese OOV term detection • 北野武(north limit military) → Pvalue given by the HMM will be very low if Pvalue < Pmin →contains OOV terms • web text extraction • we extract strings that contain the Chinese query terms and some English text from the Web. • collection of co-occurrence statistics, • translation selection. 北野武(Kitano Takeshi)c4 c5 c6 e 1 導演北野武 (KitanoTakeshi)c2 c3 c4 c5 c6 e1

  6. 與區域貿易………兩岸經貿關係(Canada and Cross Straits Economic Relations)三組發表九………英茂、加 Sleft (包含20個字) eoov Sright(包含20個字) Longest length Highest frequency Chinese translation extraction in English-Chinese CLIR • Extraction of web text • use Google to fetch the top100 Chinese documents with the English OOV term eoovas the query. • Collection of co-occurrence statistics • accumulate the frequency foov. • considering all substrings in Sleftand Sright, and collecting the frequency fnand the length |sn| of each Chinese substring. • Translation selection • exclude any substring that • already in the translation dictionary • doesn’t occur in the document collection

  7. 30 10 10 Experiments and results Chinese-English CLIR English-Chinese CLIR

  8. Introduction • When translating from Chinese to English, a standard first step is to segment the text into words based on an existing segmentation dictionary. • However where an OOV term occurs, it will not be recognized, and segmented into either smaller sequences of characters or individual characters. • We propose a segmentation free method based on frequency and length analysis and corpus-based disambiguation

  9. Previous work • Dictionary-based translation schemes need to address three major issues • phrase identification and translation • ex. non proliferation treaty and cross straits. • translation ambiguity • using techniques such as term co-occurrence , mutual information or language modeling. • out of vocabulary (OOV) terms. • ex. Dioxin

  10. Previous work- Existing approaches to the OOV problem • Depending on the language, it may be possible to deduce appropriate transliterated translations automatically. • that they successfully applied in English-Arabic CLIR. • However the issue is more difficult in Chinese as many characters have the same sound, and many English syllables do not have equivalent sounds in Chinese, meaning that selecting the correct characters to represent a transliterated word can be problematic. • cross straits(兩岸)、北野武(north limit military)

  11. Previous work- Segmentation free translation extraction • It is common to find a small amount of English text in Chinese web documents, but extremely rare to find Chinese text in English web documents. • We therefore rely on Chinese web documents to extract translations in both directions. • The problem is that the Chinese OOV term we are looking for is currently unknown, and thus we have no information about how it should be segmented. • In previous work, this problem was overcome by manual intervention to provide appropriate segmentation.

  12. 30 10 10 Experiments and results • Chinese-English CLIR • retrieving English documents using Chinese queries.

  13. Experiments and results (cont.) • English-Chinese CLIR • retrieving Chinese documents using English queries. The aim of our work is to find appropriate Chinese translations of English OOV terms

  14. Conclusions • We have also described improved ways to extract the translation of OOV terms from the Web in a way that does not rely on prior segmentation. • Although the Web is constantly changing, we were able to find most OOV terms, many of which related to news events up to 10 years ago.

  15. My opinion • Advantage: Segmentation free translation extraction • Disadvantage: • Apply: 線上翻譯 ………..

More Related