1 / 28

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

Lixin Shi and Jian-Yun Nie Dept. d'Informatique et de Recherche Opérationnelle Université de Montréal. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR. 1. Motivation 2. Related Work 3. Using Different Indexing and Translation Unit 4. Experiments

lisbet
Download Presentation

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lixin Shi and Jian-Yun Nie Dept. d'Informatique et de Recherche Opérationnelle Université de Montréal Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

  2. 1. Motivation 2. Related Work 3. Using Different Indexing and Translation Unit 4. Experiments 5. Conclusion and Future Work Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

  3. The difference between East-Asian and most European languages • The common problem in East-Asian languages (Chinese, Japanese) processing is the lack of natural word boundaries. • For information retrieval, we have to determine the index unit first. • Using word segmentation • Cutting sentence into n-grams Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation

  4. Word segmentation • Based on rule and/or statistical approaches • Problems in information retrieval • Segmentation Ambiguity: If a document and a query are segmented into different words, it is impossible to match them with each other.“发展中国家”发展中(developing)/国家(country)发展(development)/中(middle)/国家(country)发展(development)/中国(China)/家(family) • Two different words maybe have same meaning or related, especially when they share come common characters.办公室(office) ↔ 办公楼(office building) Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation

  5. Cutting the sentence into n-gram • No need of any linguistic resource • The utilization of unigrams and bigrams has been investigated in several previous studies. • As effective as using a word segmentation • The limitations of previous studies • Only used in monolingual IR • It is not clear if one can use bigrams and characters in CLIR Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation

  6. We focus on • Utilization of words and n-grams as index units for monolingual IR within LM framework. • Comparing the utilization of words and n-grams as translation units in CLIR • we only tested for English-Chinese CLIR Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 1. Motivation

  7. 2. Related work

  8. Cross-Language IR • Translation between query and document languages • Basic approach: translation query • MT system • Bilingual dictionary • Parallel corpus • Querysrcparallel corpus retrieval docquerytrgIR (Davis&Ogden’97, Yang et al’98) • Train a probabilistic translation model from parallel corpus, then use the TM for CLIR (Nie et al’99, Gao et al’01,’02, Jin&Chai’05) Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work

  9. LM approach to IR • Query-likelihood retrieval model: Rank in the probability of document D generating query Q (Ponte&Croft’98, Croft’03) • KL-divergence (Relative entropy between query language model and document language model) (Lafferty&Zhai’01,’02) Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work

  10. LM approach to CLIR • Statistical translation model. A query as a distillation or translation from a document (Berger&Lafferty’99) • KL-divergence model for CLIR (Kraaij et al’03) where t is a term in document (target) language; s in query (source) language Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work

  11. Chinese IR • Segment Chinese sentence into index unit • N-gram (Chien’95, Liang&Lee’96) • Word (Kwok’97) • Word + n-gram (Luk et al’02, Nie et al’96,’00) • For Chinese CLIR, words are used as the translation units Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2. Related Work

  12. 3. Using different indexing and translation units

  13. Using different indexing units • Single index • Unigram • Bigram • Word • Word+Unigram • Bigram+Unigram “国企研发投资” U: 国/企/研/发/投/资 B: 国企/企研/研发/发投/投资 W: 国企/研发/投资 WU: 国企/研发/投资/国/企/研/发/投/资 BU: 国企/企研/研发/发投/投资/国/企/研/发/投/资 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units

  14. Combining different scores • Multiple indexes • Unigram • Bigram • Word Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units

  15. Using different translation units Chinese Word Chinese Unigram Chinese Bigram Chinese Bigram&Unigram English Word  … “history and civilization” || “历史文明” … Parallel Corpus Bigrams … history / and / civilization || 历史/史文/文明 … TM: p(历史|history) p(史文|history) p(文明|history) … GIZA++ training Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units

  16. Using different translation units English Query Chinese Documents • Using the best translation and index unit • Combine multiple index units in the same way as in monolingual IR Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units

  17. Combine bilingual dictionary and parallel corpus • Combining a translation model with a bilingual dictionary can greatly improve the CLIR effectiveness (Nie&Simard’01) • Our experiments show that treating a bilingual dictionary as a parallel text and using the trained TM from it is much better than directly using the items of a dictionary. • As a dictionary is transformed into a TM, it can be combined with a parallel corpus by combining their TMs. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3. Using different index and translation units

  18. 4. Experiments

  19. Experiment Setting Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

  20. Chinese Linguistic Resource • The English-Chinese parallel corpus is mined from Web by our automatic mining tool. • From 6 websites: United Nations, Hong Kong, Taiwan, and Mainland China • About 4,000 pairs of pages • After sentence alignment, we have 281,000 parallel sentence pairs • English-Chinese bilingual dictionaries are from LDC. The final merged dictionary contains 42,000 entries. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

  21. Using different indexing units for C/J/K monolingual IR on NTCIR4/5 • U > B and W in Chinese • Interpolating unigram and bigram (B+U) has the best performance for Chinese and Japanese. • However, BU and B are best for Korean. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

  22. The results of CJK monolingual IR on NTCIR6 • Our submission: Chinese&Japanese: U+B; Korean K-K-T:BU, K-K-D:U • Our submitted results are lower than average MAPs of NTCIR6. • After apply a simple pseudo relevance feedback the results become more comparable to average MAPs. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

  23. Interpolating parallel corpus and bilingual dictionary Int(C,D) = 0.7*Corpus+0.3*Dict Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

  24. English to Chinese CLIR result on NTCIR 3/4/5 • Two-step interpolation of the retrieval scores: Dict(B+U) + Corpus(B+U) • Using bigrams and unigrams as translation units is a reasonable alternative to words. • Combinations of bigrams and unigrams usually produce higher effectiveness. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4. Experiments

  25. 5. Conclusion and future work

  26. Conclusion • We focused on the comparison between words, n-grams, and their combinations, both for indexing and for query translation. • Our experimental results showed that n-grams are generally as effective as words for monolingual IR and CLIR in Chinese. For Japanese and Korean, n-grams approaches are close to the average results of NTCIR6. • We tested the approach that creates different types of index separately, then groups them during the retrieval process. We found this approach slightly more effective for Chinese and Japanese. • Overall, n-grams can be interesting alternative or complementary indexing and translation units to word. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 5. Conclusion and Future Work

  27. Future work • We noticed that index units are sensitive to query, i.e., an indexing unit is better for some queries, but another index unit is better for some other queries. • How to determine the correct combination? • Combine words and n-grams by semantic information Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 5. Conclusion and Future Work

  28. Thanks

More Related