Word Pairs in Language Modeling for Information Retrieval

Word Pairs in Language Modeling for Information Retrieval Carmen Alvarez, Philippe Langlais, Jian-Yun Nie Université de Montréal

Presentation Outline • Introduction • Previous language modeling approaches in IR • Unigram model • Bigram model • Lexical affinities in language models for IR • Conclusion RIAO 2004

Introduction: General approach to language modeling in IR RIAO 2004

Unigram model The relevance score of a document d for a query q=w1,…,wN: p(solar powered cars) = p(solar) × p(powered) × p(cars) How to determine the probability of a word given a document? • MLE approach: relative frequency: • Problem: score = 0 if the document does not contain one of the query terms • Smoothing: Combine the document language model with a corpus language model RIAO 2004

Word dependencies • Independence hypothesis: an oversimplification • Ex: ‘Information Retrieval’, ‘Middle East’, ‘Intellectual Property’ • One possible solution: a conditional bigram model: p(wi|wi-1) • Smooth the MLE bigram probabilities with a unigram model • Our implementation: bigram model provides slight improvement over the unigram model RIAO 2004

Problems with the bigram model • Useful for other LM applications (ex: speech recognition) but not necessarily IR • Problem 1: word order‘apartment rentals’, ‘renting apartments’ • Problem 2: adjacency‘solar powered cars’, ‘solar cars’ • Previous approaches addressing these problems: • Srikanth and Srihari (2002): biterm model approach, slight improvements over conditional bigram approach • Pickens (2000): non-adjacent bigrams. Order is still important for the domain (musical documents) RIAO 2004

Lexical affinities • Word pairs within a window of n words • Martin et al (1983): 98% of lexical relations are between words separated by a maximum distance of 5 words • Maarek et al (1991): Search and clustering project for a software documentation library • Resolving powerρ of a pair for a document d: • Comparable to tf ×idf • Frequency of the pair in the document d (tf) • Global frequency of the individual words (the information of the pair, comparable to idf) RIAO 2004

Lexical affinities: examples Observations: • Resolving power does not correspond directly to frequency • Certain words appear in the majority of the pairs (court, union) RIAO 2004

Filtering pairs by resolving power RIAO 2004

Filtering the pairs by window size • Certain insignificant word pairs are removed: “intense senate” • Certain significant word pairs are also removed: “house senate” • Some insignificant word pairs are still present: “senate thursday” RIAO 2004

Model MLA • Represent word pairs as single words.Ex: “prime minister” “minister,prime” • Modify the observed word pair counts, as a function of the resolving power and a parameter β • Construct a “unigram” model that contains both the single words and the pairs (represented as single words) Example: A document: “solar powered car”, β=0.5. RIAO 2004

Model MLA: Smoothing and relevance score • Smoothing (comparable to the unigram approach): Smooth with a corpus model which also contains both single words and pairs • Relevance score similar to classical unigram model: RIAO 2004

Results with model MLA • Model MLA outperforms the unigram and bigram models for certain configurations. • The best results are obtained when the contribution of pairs in the relevance score is low (β between 0.0001 and 0.04) RIAO 2004

Conclusion • Lexical affinities improve the performance of unigram and bigram models, but their contribution to the score must be relatively low, in comparison with the contribution of single words • Future work • Part-of-speech tagger to extract the word pairs • Parser • Choose between pairs and single terms (reduce redundancy) • Learn the parameters per document (or per word) with an EM algorithm, rather than globally/empirically RIAO 2004

Word Pairs in Language Modeling for Information Retrieval

Word Pairs in Language Modeling for Information Retrieval

Presentation Transcript

Language Models for Information Retrieval

Natural Language Processing for Information Retrieval

Cross-Language Information Retrieval

Cumulative Progress in Language Models for Information Retrieval

Information Retrieval – Language models for IR

Cross-Language Information Retrieval

Exploring Sentence Level Query Expansion in Language Modeling Based Information Retrieval

Natural Language Processing for Information Retrieval

Troublesome Word Pairs

A Language Modeling Approach to Information Retrieval

Statistical Language Modeling for Speech Recognition and Information Retrieval

Language Modeling Frameworks for Information Retrieval

Challenges in Information Retrieval and Language Modeling

Dependence Language Model for Information Retrieval

Modeling Diversity in Information Retrieval

Information Retrieval Modeling

Model-based Feedback in the Language Modeling Approach to Information Retrieval

Statistical Language Modeling for Speech Recognition and Information Retrieval