1 / 25

What’s different with Chinese in cross-language IR?

What’s different with Chinese in cross-language IR?. Jian-Yun Nie University of Montreal, Canada. Outline. General characteristics of Chinese Monolingual IR in Chinese CLIR with Chinese OOV: important Problem in Chinese IR Solutions?. 1. General characteristic of Chinese.

jael-benson
Download Presentation

What’s different with Chinese in cross-language IR?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada IRF

  2. Outline • General characteristics of Chinese • Monolingual IR in Chinese • CLIR with Chinese • OOV: important Problem in Chinese IR • Solutions? IRF

  3. 1. General characteristic of Chinese • Sentence = ideograms with no separation 它是一种适于在拖拉机使用的转向球接头,… • Words? 它/是/一种/适于/在/拖拉机/使用/的/转向/球/接头/,… IRF

  4. Word formation • Each character can be a word (人-person) • Most words are composed of two or more characters (人群-mass) • However • No clear definition of the notion of word • 办公楼 (office building)  /办公楼/ or /办公/楼/? • Inconsistency in manual segmentation • Many new words are created (abbreviations) • E.g. 网络 (network) 管理员 (administrator)  网管 (webmaster) IRF

  5. 2. IR using word segmentation Using rules, dictionaries and/or statistics Problems for information retrieval Segmentation Ambiguity: more than 1 segmentation possibility e.g. “发展中国家”  发展中(developing)/国家(country)发展(development)/中(middle)/国家(country)发展(development)/中国(China)/家(family) Different words have similar meaning接头(connector, plug) ↔ 插头(plug) ↔ 插座 (plug) New words can be formed quite freely 接(reception)桶(bucket): Not a common word, but can be used 网(network)店(store): more and more used… 的(of, taxi)车(car): taxi car (?), car of (someone)… IRF

  6. Alternative: n-grams Usually unigrams and bigrams As effective as using a word segmentation Account for some flexibility However Noise: non meaningful combinations Wrong combinations 非酿造型啤酒(non-brewed beer) 非/酿造/型/啤酒 非酿/酿造/造型/型啤/啤酒 Non-meaningful Style, appearance, … IRF

  7. Possible approach: Combining words and n-grams 前年收入有所下降 Score function in language modeling similar to other languages Previous results: Word ~ bigram > unigram Chinese Mono-lingual IR Word: 前年/收入/有所/下降or: 前/年收入/有所/下降 Unigram: 前/年/收/入/有/所/下/降 Bigram: 前年/年收/收入/入有/有所/所下/下降 IRF 7

  8. Our recent tests IRF

  9. Why is this useful? NTCIR 5 Topic 18 烟草商诉讼赔偿 (Tobacco company, suit, compensation) Word:烟草商(Tobacco company)诉讼(suit)赔偿(compensation) Unigram (0.7659) > Word(0.1625) The relevant documents include words 烟草,公司,业者,香烟 ,烟商, but cannot match “烟草商”. NTCIR 5 Topic 24 经济舱综合症候群航班 (Economy class, syndrome, flight) Word:经济(economy)综合症(syndrome)候(wait)航班(flight) Ubigram(.7607)>Word(0.0002) “..综合症候..”  is segmented into “../综合症/候/..” It cannot match “症候” (syndrome). The combination of words with unigrams or bigrams helps IRF

  10. Also works for Korean and Japanese? IRF

  11. 2. CLIR: query translation • Machine translation: rules+dictionaries • Statistical translation model: • Parallel texts • Automatically extract possible translations • Comparison • Stat. TM doe not produce human-readable translations • But can include related words • Usually, word-based translation IRF

  12. Our recent tests: also translate into n-grams “history and civilization” || “历史文明” … English Word history / and / civilization || 历史/史文/文明 … history / and / civilization || 历/史/文/明 … • Chinese Word • Chinese Unigram • Chinese Bigram • Bigram&Unigram … … GIZA++ training GIZA++ training TM (word-to-bigram): p(历史|history) p(史文|history) p(文明|history) TM (word-to-unigram): p(历|history) p(史|history) p(文|history) IRF

  13. Combining different translations English Query Chinese Documents IRF

  14. Bilingual linguistic resources for CLIR An English-Chinese parallel corpus mined from Web about 281,000 parallel sentence pairs LDC English-Chinese bilingual dictionaries 42,000 entries Translation model Combination of the 2 translation models IRF

  15. CLIR results IRF

  16. General observations for Chinese IR • Using both words and n-grams for Chinese IR and Chinese query translation • N-grams can account for flexibility in Chinese words • CLIR with Chinese can also benefit from translations into Chinese n-grams IRF

  17. 4.OOV problem in Chinese OOV (Out-Of-Vocabulary) Problem TREC queries: 63% named entities are OOV Even more on the Web Specialized terms (abbreviations) New words Impossible to collect all terms manually Solutions Parallel texts (translations by n-grams) Mono-lingual corpus IRF

  18. Translation of named entities • Statistical transliteration • Frances Taylor  弗朗西斯泰勒 茀琅希思泰勒 弗郎西丝泰勒 … IRF

  19. IRF

  20. Candidate extraction Templates Four templates to extract candidates c1c2..cn (En) c1c2..cn , En, c’1c’2..c’m c1c2..cn: En c1c2..cn是/即 En Comparing four templates Use template 1 in following experiments IRF

  21. Translation model Train a translation model Candidate List IRF

  22. Dictionary Mining Results • Mining Results • Processed more than 300GB Chinese web pages • 161,117 translation pairs are mined IRF

  23. Coverage of the Dictionary on Query Log Data 9,065 popular English terms from the MSN Chinese search engine IRF

  24. CLIR experiment IRF

  25. Conclusions • In addition to the general approaches, Chinese IR should also consider the characteristics of the language • (also for other Asian languages – Japanese and Korean) • Difficulty in translating new (technical) words and proper names • Exploit parallel/comparable or monolingual texts • Additional problem: make the retrieved document readable • Full text translation • Running sentences in patent: relatively easy • Technical terms: may be difficult with Chinese • Gisting: translation assistance tool, useful for a user with some knowledge of the document language IRF

More Related