280 likes | 469 Views
Lixin Shi and Jian-Yun Nie Dept. d'Informatique et de Recherche Opérationnelle Université de Montréal. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR. 1. Motivation 2. Related Work 3. Using Different Indexing and Translation Unit 4. Experiments
E N D
Lixin Shi and Jian-Yun Nie Dept. d'Informatique et de Recherche Opérationnelle Université de Montréal Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR
1. Motivation 2. Related Work 3. Using Different Indexing and Translation Unit 4. Experiments 5. Conclusion and Future Work Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR
The difference between East-Asian and most European languages • The common problem in East-Asian languages (Chinese, Japanese) processing is the lack of natural word boundaries. • For information retrieval, we have to determine the index unit first. • Using word segmentation • Cutting sentence into n-grams Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation
Word segmentation • Based on rule and/or statistical approaches • Problems in information retrieval • Segmentation Ambiguity: If a document and a query are segmented into different words, it is impossible to match them with each other.“发展中国家”发展中(developing)/国家(country)发展(development)/中(middle)/国家(country)发展(development)/中国(China)/家(family) • Two different words maybe have same meaning or related, especially when they share come common characters.办公室(office) ↔ 办公楼(office building) Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation
Cutting the sentence into n-gram • No need of any linguistic resource • The utilization of unigrams and bigrams has been investigated in several previous studies. • As effective as using a word segmentation • The limitations of previous studies • Only used in monolingual IR • It is not clear if one can use bigrams and characters in CLIR Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation
We focus on • Utilization of words and n-grams as index units for monolingual IR within LM framework. • Comparing the utilization of words and n-grams as translation units in CLIR • we only tested for English-Chinese CLIR Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation
Cross-Language IR • Translation between query and document languages • Basic approach: translation query • MT system • Bilingual dictionary • Parallel corpus • Querysrcparallel corpus retrieval docquerytrgIR (Davis&Ogden’97, Yang et al’98) • Train a probabilistic translation model from parallel corpus, then use the TM for CLIR (Nie et al’99, Gao et al’01,’02, Jin&Chai’05) Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work
LM approach to IR • Query-likelihood retrieval model: Rank in the probability of document D generating query Q (Ponte&Croft’98, Croft’03) • KL-divergence (Relative entropy between query language model and document language model) (Lafferty&Zhai’01,’02) Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work
LM approach to CLIR • Statistical translation model. A query as a distillation or translation from a document (Berger&Lafferty’99) • KL-divergence model for CLIR (Kraaij et al’03) where t is a term in document (target) language; s in query (source) language Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work
Chinese IR • Segment Chinese sentence into index unit • N-gram (Chien’95, Liang&Lee’96) • Word (Kwok’97) • Word + n-gram (Luk et al’02, Nie et al’96,’00) • For Chinese CLIR, words are used as the translation units Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work
Using different indexing units • Single index • Unigram • Bigram • Word • Word+Unigram • Bigram+Unigram “国企研发投资” U: 国/企/研/发/投/资 B: 国企/企研/研发/发投/投资 W: 国企/研发/投资 WU: 国企/研发/投资/国/企/研/发/投/资 BU: 国企/企研/研发/发投/投资/国/企/研/发/投/资 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units
Combining different scores • Multiple indexes • Unigram • Bigram • Word Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units
Using different translation units Chinese Word Chinese Unigram Chinese Bigram Chinese Bigram&Unigram English Word … “history and civilization” || “历史文明” … Parallel Corpus Bigrams … history / and / civilization || 历史/史文/文明 … TM: p(历史|history) p(史文|history) p(文明|history) … GIZA++ training Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units
Using different translation units English Query Chinese Documents • Using the best translation and index unit • Combine multiple index units in the same way as in monolingual IR Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units
Combine bilingual dictionary and parallel corpus • Combining a translation model with a bilingual dictionary can greatly improve the CLIR effectiveness (Nie&Simard’01) • Our experiments show that treating a bilingual dictionary as a parallel text and using the trained TM from it is much better than directly using the items of a dictionary. • As a dictionary is transformed into a TM, it can be combined with a parallel corpus by combining their TMs. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units
Experiment Setting Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments
Chinese Linguistic Resource • The English-Chinese parallel corpus is mined from Web by our automatic mining tool. • From 6 websites: United Nations, Hong Kong, Taiwan, and Mainland China • About 4,000 pairs of pages • After sentence alignment, we have 281,000 parallel sentence pairs • English-Chinese bilingual dictionaries are from LDC. The final merged dictionary contains 42,000 entries. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments
Using different indexing units for C/J/K monolingual IR on NTCIR4/5 • U > B and W in Chinese • Interpolating unigram and bigram (B+U) has the best performance for Chinese and Japanese. • However, BU and B are best for Korean. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments
The results of CJK monolingual IR on NTCIR6 • Our submission: Chinese&Japanese: U+B; Korean K-K-T:BU, K-K-D:U • Our submitted results are lower than average MAPs of NTCIR6. • After apply a simple pseudo relevance feedback the results become more comparable to average MAPs. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments
Interpolating parallel corpus and bilingual dictionary Int(C,D) = 0.7*Corpus+0.3*Dict Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments
English to Chinese CLIR result on NTCIR 3/4/5 • Two-step interpolation of the retrieval scores: Dict(B+U) + Corpus(B+U) • Using bigrams and unigrams as translation units is a reasonable alternative to words. • Combinations of bigrams and unigrams usually produce higher effectiveness. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments
Conclusion • We focused on the comparison between words, n-grams, and their combinations, both for indexing and for query translation. • Our experimental results showed that n-grams are generally as effective as words for monolingual IR and CLIR in Chinese. For Japanese and Korean, n-grams approaches are close to the average results of NTCIR6. • We tested the approach that creates different types of index separately, then groups them during the retrieval process. We found this approach slightly more effective for Chinese and Japanese. • Overall, n-grams can be interesting alternative or complementary indexing and translation units to word. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 5. Conclusion and Future Work
Future work • We noticed that index units are sensitive to query, i.e., an indexing unit is better for some queries, but another index unit is better for some other queries. • How to determine the correct combination? • Combine words and n-grams by semantic information Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 5. Conclusion and Future Work