140 likes | 160 Views
Explore the use of word pairs in language modeling for information retrieval, comparing unigram and bigram models, discussing lexical affinities, and proposing a model that combines both single words and pairs for improved performance. Results show the potential of this approach in enhancing retrieval accuracy.
E N D
Word Pairs in Language Modeling for Information Retrieval Carmen Alvarez, Philippe Langlais, Jian-Yun Nie Université de Montréal
Presentation Outline • Introduction • Previous language modeling approaches in IR • Unigram model • Bigram model • Lexical affinities in language models for IR • Conclusion RIAO 2004
Introduction: General approach to language modeling in IR RIAO 2004
Unigram model The relevance score of a document d for a query q=w1,…,wN: p(solar powered cars) = p(solar) × p(powered) × p(cars) How to determine the probability of a word given a document? • MLE approach: relative frequency: • Problem: score = 0 if the document does not contain one of the query terms • Smoothing: Combine the document language model with a corpus language model RIAO 2004
Word dependencies • Independence hypothesis: an oversimplification • Ex: ‘Information Retrieval’, ‘Middle East’, ‘Intellectual Property’ • One possible solution: a conditional bigram model: p(wi|wi-1) • Smooth the MLE bigram probabilities with a unigram model • Our implementation: bigram model provides slight improvement over the unigram model RIAO 2004
Problems with the bigram model • Useful for other LM applications (ex: speech recognition) but not necessarily IR • Problem 1: word order‘apartment rentals’, ‘renting apartments’ • Problem 2: adjacency‘solar powered cars’, ‘solar cars’ • Previous approaches addressing these problems: • Srikanth and Srihari (2002): biterm model approach, slight improvements over conditional bigram approach • Pickens (2000): non-adjacent bigrams. Order is still important for the domain (musical documents) RIAO 2004
Lexical affinities • Word pairs within a window of n words • Martin et al (1983): 98% of lexical relations are between words separated by a maximum distance of 5 words • Maarek et al (1991): Search and clustering project for a software documentation library • Resolving powerρ of a pair for a document d: • Comparable to tf ×idf • Frequency of the pair in the document d (tf) • Global frequency of the individual words (the information of the pair, comparable to idf) RIAO 2004
Lexical affinities: examples Observations: • Resolving power does not correspond directly to frequency • Certain words appear in the majority of the pairs (court, union) RIAO 2004
Filtering pairs by resolving power RIAO 2004
Filtering the pairs by window size • Certain insignificant word pairs are removed: “intense senate” • Certain significant word pairs are also removed: “house senate” • Some insignificant word pairs are still present: “senate thursday” RIAO 2004
Model MLA • Represent word pairs as single words.Ex: “prime minister” “minister,prime” • Modify the observed word pair counts, as a function of the resolving power and a parameter β • Construct a “unigram” model that contains both the single words and the pairs (represented as single words) Example: A document: “solar powered car”, β=0.5. RIAO 2004
Model MLA: Smoothing and relevance score • Smoothing (comparable to the unigram approach): Smooth with a corpus model which also contains both single words and pairs • Relevance score similar to classical unigram model: RIAO 2004
Results with model MLA • Model MLA outperforms the unigram and bigram models for certain configurations. • The best results are obtained when the contribution of pairs in the relevance score is low (β between 0.0001 and 0.04) RIAO 2004
Conclusion • Lexical affinities improve the performance of unigram and bigram models, but their contribution to the score must be relatively low, in comparison with the contribution of single words • Future work • Part-of-speech tagger to extract the word pairs • Parser • Choose between pairs and single terms (reduce redundancy) • Learn the parameters per document (or per word) with an EM algorithm, rather than globally/empirically RIAO 2004