250 likes | 627 Views
What’s different with Chinese in cross-language IR?. Jian-Yun Nie University of Montreal, Canada. Outline. General characteristics of Chinese Monolingual IR in Chinese CLIR with Chinese OOV: important Problem in Chinese IR Solutions?. 1. General characteristic of Chinese.
E N D
What’s different with Chinese in cross-language IR? Jian-Yun Nie University of Montreal, Canada IRF
Outline • General characteristics of Chinese • Monolingual IR in Chinese • CLIR with Chinese • OOV: important Problem in Chinese IR • Solutions? IRF
1. General characteristic of Chinese • Sentence = ideograms with no separation 它是一种适于在拖拉机使用的转向球接头,… • Words? 它/是/一种/适于/在/拖拉机/使用/的/转向/球/接头/,… IRF
Word formation • Each character can be a word (人-person) • Most words are composed of two or more characters (人群-mass) • However • No clear definition of the notion of word • 办公楼 (office building) /办公楼/ or /办公/楼/? • Inconsistency in manual segmentation • Many new words are created (abbreviations) • E.g. 网络 (network) 管理员 (administrator) 网管 (webmaster) IRF
2. IR using word segmentation Using rules, dictionaries and/or statistics Problems for information retrieval Segmentation Ambiguity: more than 1 segmentation possibility e.g. “发展中国家” 发展中(developing)/国家(country)发展(development)/中(middle)/国家(country)发展(development)/中国(China)/家(family) Different words have similar meaning接头(connector, plug) ↔ 插头(plug) ↔ 插座 (plug) New words can be formed quite freely 接(reception)桶(bucket): Not a common word, but can be used 网(network)店(store): more and more used… 的(of, taxi)车(car): taxi car (?), car of (someone)… IRF
Alternative: n-grams Usually unigrams and bigrams As effective as using a word segmentation Account for some flexibility However Noise: non meaningful combinations Wrong combinations 非酿造型啤酒(non-brewed beer) 非/酿造/型/啤酒 非酿/酿造/造型/型啤/啤酒 Non-meaningful Style, appearance, … IRF
Possible approach: Combining words and n-grams 前年收入有所下降 Score function in language modeling similar to other languages Previous results: Word ~ bigram > unigram Chinese Mono-lingual IR Word: 前年/收入/有所/下降or: 前/年收入/有所/下降 Unigram: 前/年/收/入/有/所/下/降 Bigram: 前年/年收/收入/入有/有所/所下/下降 IRF 7
Our recent tests IRF
Why is this useful? NTCIR 5 Topic 18 烟草商诉讼赔偿 (Tobacco company, suit, compensation) Word:烟草商(Tobacco company)诉讼(suit)赔偿(compensation) Unigram (0.7659) > Word(0.1625) The relevant documents include words 烟草,公司,业者,香烟 ,烟商, but cannot match “烟草商”. NTCIR 5 Topic 24 经济舱综合症候群航班 (Economy class, syndrome, flight) Word:经济(economy)综合症(syndrome)候(wait)航班(flight) Ubigram(.7607)>Word(0.0002) “..综合症候..” is segmented into “../综合症/候/..” It cannot match “症候” (syndrome). The combination of words with unigrams or bigrams helps IRF
2. CLIR: query translation • Machine translation: rules+dictionaries • Statistical translation model: • Parallel texts • Automatically extract possible translations • Comparison • Stat. TM doe not produce human-readable translations • But can include related words • Usually, word-based translation IRF
Our recent tests: also translate into n-grams “history and civilization” || “历史文明” … English Word history / and / civilization || 历史/史文/文明 … history / and / civilization || 历/史/文/明 … • Chinese Word • Chinese Unigram • Chinese Bigram • Bigram&Unigram … … GIZA++ training GIZA++ training TM (word-to-bigram): p(历史|history) p(史文|history) p(文明|history) TM (word-to-unigram): p(历|history) p(史|history) p(文|history) IRF
Combining different translations English Query Chinese Documents IRF
Bilingual linguistic resources for CLIR An English-Chinese parallel corpus mined from Web about 281,000 parallel sentence pairs LDC English-Chinese bilingual dictionaries 42,000 entries Translation model Combination of the 2 translation models IRF
CLIR results IRF
General observations for Chinese IR • Using both words and n-grams for Chinese IR and Chinese query translation • N-grams can account for flexibility in Chinese words • CLIR with Chinese can also benefit from translations into Chinese n-grams IRF
4.OOV problem in Chinese OOV (Out-Of-Vocabulary) Problem TREC queries: 63% named entities are OOV Even more on the Web Specialized terms (abbreviations) New words Impossible to collect all terms manually Solutions Parallel texts (translations by n-grams) Mono-lingual corpus IRF
Translation of named entities • Statistical transliteration • Frances Taylor 弗朗西斯泰勒 茀琅希思泰勒 弗郎西丝泰勒 … IRF
Candidate extraction Templates Four templates to extract candidates c1c2..cn (En) c1c2..cn , En, c’1c’2..c’m c1c2..cn: En c1c2..cn是/即 En Comparing four templates Use template 1 in following experiments IRF
Translation model Train a translation model Candidate List IRF
Dictionary Mining Results • Mining Results • Processed more than 300GB Chinese web pages • 161,117 translation pairs are mined IRF
Coverage of the Dictionary on Query Log Data 9,065 popular English terms from the MSN Chinese search engine IRF
CLIR experiment IRF
Conclusions • In addition to the general approaches, Chinese IR should also consider the characteristics of the language • (also for other Asian languages – Japanese and Korean) • Difficulty in translating new (technical) words and proper names • Exploit parallel/comparable or monolingual texts • Additional problem: make the retrieved document readable • Full text translation • Running sentences in patent: relatively easy • Technical terms: may be difficult with Chinese • Gisting: translation assistance tool, useful for a user with some knowledge of the document language IRF