Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Lixin Shi and Jian-Yun Nie Dept. d'Informatique et de Recherche Opérationnelle Université de Montréal Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

1. Motivation 2. Related Work 3. Using Different Indexing and Translation Unit 4. Experiments 5. Conclusion and Future Work Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

The difference between East-Asian and most European languages • The common problem in East-Asian languages (Chinese, Japanese) processing is the lack of natural word boundaries. • For information retrieval, we have to determine the index unit first. • Using word segmentation • Cutting sentence into n-grams Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation

Word segmentation • Based on rule and/or statistical approaches • Problems in information retrieval • Segmentation Ambiguity: If a document and a query are segmented into different words, it is impossible to match them with each other.“发展中国家”发展中(developing)/国家(country)发展(development)/中(middle)/国家(country)发展(development)/中国(China)/家(family) • Two different words maybe have same meaning or related, especially when they share come common characters.办公室(office) ↔ 办公楼(office building) Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation

Cutting the sentence into n-gram • No need of any linguistic resource • The utilization of unigrams and bigrams has been investigated in several previous studies. • As effective as using a word segmentation • The limitations of previous studies • Only used in monolingual IR • It is not clear if one can use bigrams and characters in CLIR Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation

We focus on • Utilization of words and n-grams as index units for monolingual IR within LM framework. • Comparing the utilization of words and n-grams as translation units in CLIR • we only tested for English-Chinese CLIR Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation

2. Related work

Cross-Language IR • Translation between query and document languages • Basic approach: translation query • MT system • Bilingual dictionary • Parallel corpus • Querysrcparallel corpus retrieval docquerytrgIR (Davis&Ogden’97, Yang et al’98) • Train a probabilistic translation model from parallel corpus, then use the TM for CLIR (Nie et al’99, Gao et al’01,’02, Jin&Chai’05) Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work

LM approach to IR • Query-likelihood retrieval model: Rank in the probability of document D generating query Q (Ponte&Croft’98, Croft’03) • KL-divergence (Relative entropy between query language model and document language model) (Lafferty&Zhai’01,’02) Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work

LM approach to CLIR • Statistical translation model. A query as a distillation or translation from a document (Berger&Lafferty’99) • KL-divergence model for CLIR (Kraaij et al’03) where t is a term in document (target) language; s in query (source) language Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work

Chinese IR • Segment Chinese sentence into index unit • N-gram (Chien’95, Liang&Lee’96) • Word (Kwok’97) • Word + n-gram (Luk et al’02, Nie et al’96,’00) • For Chinese CLIR, words are used as the translation units Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work

3. Using different indexing and translation units

Using different indexing units • Single index • Unigram • Bigram • Word • Word+Unigram • Bigram+Unigram “国企研发投资” U: 国/企/研/发/投/资 B: 国企/企研/研发/发投/投资 W: 国企/研发/投资 WU: 国企/研发/投资/国/企/研/发/投/资 BU: 国企/企研/研发/发投/投资/国/企/研/发/投/资 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units

Combining different scores • Multiple indexes • Unigram • Bigram • Word Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units

Using different translation units Chinese Word Chinese Unigram Chinese Bigram Chinese Bigram&Unigram English Word  … “history and civilization” || “历史文明” … Parallel Corpus Bigrams … history / and / civilization || 历史/史文/文明 … TM: p(历史|history) p(史文|history) p(文明|history) … GIZA++ training Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units

Using different translation units English Query Chinese Documents • Using the best translation and index unit • Combine multiple index units in the same way as in monolingual IR Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units

Combine bilingual dictionary and parallel corpus • Combining a translation model with a bilingual dictionary can greatly improve the CLIR effectiveness (Nie&Simard’01) • Our experiments show that treating a bilingual dictionary as a parallel text and using the trained TM from it is much better than directly using the items of a dictionary. • As a dictionary is transformed into a TM, it can be combined with a parallel corpus by combining their TMs. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units

4. Experiments

Experiment Setting Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

Chinese Linguistic Resource • The English-Chinese parallel corpus is mined from Web by our automatic mining tool. • From 6 websites: United Nations, Hong Kong, Taiwan, and Mainland China • About 4,000 pairs of pages • After sentence alignment, we have 281,000 parallel sentence pairs • English-Chinese bilingual dictionaries are from LDC. The final merged dictionary contains 42,000 entries. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

Using different indexing units for C/J/K monolingual IR on NTCIR4/5 • U > B and W in Chinese • Interpolating unigram and bigram (B+U) has the best performance for Chinese and Japanese. • However, BU and B are best for Korean. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

The results of CJK monolingual IR on NTCIR6 • Our submission: Chinese&Japanese: U+B; Korean K-K-T:BU, K-K-D:U • Our submitted results are lower than average MAPs of NTCIR6. • After apply a simple pseudo relevance feedback the results become more comparable to average MAPs. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

Interpolating parallel corpus and bilingual dictionary Int(C,D) = 0.7*Corpus+0.3*Dict Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

English to Chinese CLIR result on NTCIR 3/4/5 • Two-step interpolation of the retrieval scores: Dict(B+U) + Corpus(B+U) • Using bigrams and unigrams as translation units is a reasonable alternative to words. • Combinations of bigrams and unigrams usually produce higher effectiveness. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

5. Conclusion and future work

Conclusion • We focused on the comparison between words, n-grams, and their combinations, both for indexing and for query translation. • Our experimental results showed that n-grams are generally as effective as words for monolingual IR and CLIR in Chinese. For Japanese and Korean, n-grams approaches are close to the average results of NTCIR6. • We tested the approach that creates different types of index separately, then groups them during the retrieval process. We found this approach slightly more effective for Chinese and Japanese. • Overall, n-grams can be interesting alternative or complementary indexing and translation units to word. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 5. Conclusion and Future Work

Future work • We noticed that index units are sensitive to query, i.e., an indexing unit is better for some queries, but another index unit is better for some other queries. • How to determine the correct combination? • Combine words and n-grams by semantic information Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 5. Conclusion and Future Work

Thanks

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Presentation Transcript

Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR

Cross-language IR and statistical MT

Information Retrieval – Language models for IR

Language Models

Language Models

Language Models

From Cross-language IR to General IR

Cross Language IR

Theories and Models of Language

Language Models for TR

Language Models for TR

Multilinguality and cross-language searching

Cross-Language Retrieval

Cross-Language IR: An Overview

Using Neural Network Language Models for LVCSR

Cross-Language Demo

A New Bigram-PLSA Language Model for Speech Recognition

Language Models

Language and Gender Cross-Culturally

Language and Cross-cultural Communication

Cross-Language Retrieval