1 / 14

Word Pairs in Language Modeling for Information Retrieval

Word Pairs in Language Modeling for Information Retrieval. Carmen Alvarez, Philippe Langlais, Jian-Yun Nie Université de Montréal. Presentation Outline. Introduction Previous language modeling approaches in IR Unigram model Bigram model Lexical affinities in language models for IR

mhagedorn
Download Presentation

Word Pairs in Language Modeling for Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Word Pairs in Language Modeling for Information Retrieval Carmen Alvarez, Philippe Langlais, Jian-Yun Nie Université de Montréal

  2. Presentation Outline • Introduction • Previous language modeling approaches in IR • Unigram model • Bigram model • Lexical affinities in language models for IR • Conclusion RIAO 2004

  3. Introduction: General approach to language modeling in IR RIAO 2004

  4. Unigram model The relevance score of a document d for a query q=w1,…,wN: p(solar powered cars) = p(solar) × p(powered) × p(cars) How to determine the probability of a word given a document? • MLE approach: relative frequency: • Problem: score = 0 if the document does not contain one of the query terms • Smoothing: Combine the document language model with a corpus language model RIAO 2004

  5. Word dependencies • Independence hypothesis: an oversimplification • Ex: ‘Information Retrieval’, ‘Middle East’, ‘Intellectual Property’ • One possible solution: a conditional bigram model: p(wi|wi-1) • Smooth the MLE bigram probabilities with a unigram model • Our implementation: bigram model provides slight improvement over the unigram model RIAO 2004

  6. Problems with the bigram model • Useful for other LM applications (ex: speech recognition) but not necessarily IR • Problem 1: word order‘apartment rentals’, ‘renting apartments’ • Problem 2: adjacency‘solar powered cars’, ‘solar cars’ • Previous approaches addressing these problems: • Srikanth and Srihari (2002): biterm model approach, slight improvements over conditional bigram approach • Pickens (2000): non-adjacent bigrams. Order is still important for the domain (musical documents) RIAO 2004

  7. Lexical affinities • Word pairs within a window of n words • Martin et al (1983): 98% of lexical relations are between words separated by a maximum distance of 5 words • Maarek et al (1991): Search and clustering project for a software documentation library • Resolving powerρ of a pair for a document d: • Comparable to tf ×idf • Frequency of the pair in the document d (tf) • Global frequency of the individual words (the information of the pair, comparable to idf) RIAO 2004

  8. Lexical affinities: examples Observations: • Resolving power does not correspond directly to frequency • Certain words appear in the majority of the pairs (court, union) RIAO 2004

  9. Filtering pairs by resolving power RIAO 2004

  10. Filtering the pairs by window size • Certain insignificant word pairs are removed: “intense senate” • Certain significant word pairs are also removed: “house senate” • Some insignificant word pairs are still present: “senate thursday” RIAO 2004

  11. Model MLA • Represent word pairs as single words.Ex: “prime minister” “minister,prime” • Modify the observed word pair counts, as a function of the resolving power and a parameter β • Construct a “unigram” model that contains both the single words and the pairs (represented as single words) Example: A document: “solar powered car”, β=0.5. RIAO 2004

  12. Model MLA: Smoothing and relevance score • Smoothing (comparable to the unigram approach): Smooth with a corpus model which also contains both single words and pairs • Relevance score similar to classical unigram model: RIAO 2004

  13. Results with model MLA • Model MLA outperforms the unigram and bigram models for certain configurations. • The best results are obtained when the contribution of pairs in the relevance score is low (β between 0.0001 and 0.04) RIAO 2004

  14. Conclusion • Lexical affinities improve the performance of unigram and bigram models, but their contribution to the score must be relatively low, in comparison with the contribution of single words • Future work • Part-of-speech tagger to extract the word pairs • Parser • Choose between pairs and single terms (reduce redundancy) • Learn the parameters per document (or per word) with an EM algorithm, rather than globally/empirically RIAO 2004

More Related