1 / 28

Relevance Models [draft]

Relevance Models [draft]. Ciprian Raileanu (raileanu@mpi-inf.mpg.de). Two Approaches to Information Retrieval. Q = (q1,…,qk). Probabilistic Approach to IR (Robertson & Sparck Jones, 1976 and 2001) “Given this query, what is the probability that this document is relevant?”. Q = (q1,…,qk).

manju
Download Presentation

Relevance Models [draft]

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Relevance Models [draft] Ciprian Raileanu (raileanu@mpi-inf.mpg.de)

  2. Two Approaches to Information Retrieval Q = (q1,…,qk) Probabilistic Approach to IR (Robertson & Sparck Jones, 1976 and 2001) “Given this query, what is the probability that this document is relevant?” Q = (q1,…,qk) Language Modeling Approach (Ponte & Croft, 1998) “Given this document, what is a query to which this document is relevant?”

  3. Overview • Two papers: • Probabilistic Relevance Models Based on Document and Query Generation (John Lafferty & Cheng Xiang Zhai, CMU) • Relevance-Based Language Models (Victor Lavrenko & Bruce Croft, U Mass Amherst)

  4. Intro • Recently Language Models have enjoyed a lot of attention and performed quite well in practice • However, the underlying semantics of the model have been unclear for a while, as it seems to ignore the important notion of relevance • Lafferty and Zhai from CMU have propose a unified framework; they showed that the traditional probabilistic approach and the language modeling approach are in fact equivalent from a probabilistic point of view. This also answers the relevance issue behind language models

  5. [ P ( D , Q | r ) * P ( r )] P ( D , Q ) P ( r | D , Q ) = log log P ( r | D , Q ) [ P ( D , Q | r ) * P ( r )] P ( D , Q ) P ( D , Q | r ) * P ( r ) = log P ( D , Q | r ) * P ( r ) [P(D | Q, r) * P(Q | r)] * P(r) = log [P(D | Q, r ) * P(Q | r )] * P( r ) P ( D | Q , r ) * [ P(Q | r) * P(r) P(Q) ] = log P ( D | Q , r ) * [ P(Q | r ) * P( r ) P(Q) ] P ( D | Q , r ) * P ( r | Q ) = log P ( D | Q , r ) * P ( r | Q ) P ( D | Q , r ) P ( r | Q ) = + log log P ( D | Q , r ) P ( r | Q ) P ( D | Q , r ) » log P ( D | Q , r ) The Robertson-Sparck Jones Model (1)

  6. The Robertson-Sparck Jones Model (2) • Assuming that a document D is a collection of attributes (words), D = (A1, …, An) • And further assuming the independence of these attributes, P(D|Q,R) can be rewritten: Where R is a random variable that denotes relevance. • Then, the Robertson-Sparck Jones formula can be rewritten as:

  7. [ P ( D , Q | r ) * P ( r )] P ( D , Q ) P ( r | D , Q ) = log log P ( r | D , Q ) [ P ( D , Q | r ) * P ( r )] P ( D , Q ) P ( D , Q | r ) * P ( r ) = log P ( D , Q | r ) * P ( r ) [P(Q | D, r) * P(D | r)] * P(r) = log [P(Q | D, r ) * P(D | r )] * P( r ) P ( Q | D , r ) * [ P(D | r) * P(r) P(D) ] = log P ( Q | D , r ) * [ P(D | r ) * P( r ) P(D) ] P ( Q | D , r ) * P ( r | D ) = log P ( Q | D , r ) * P ( r | D ) P ( Q | D , r ) P ( r | D ) = + log log P ( Q | D , r ) P ( r | D ) The Language Modeling Approach (1)

  8. The Language Modeling Approach (2) • Assumption 1: A document D is independent of the query Q, given irrelevance:

  9. P ( r | D , Q ) P ( Q | D , r ) P ( r | D ) = + log log log P ( r | D , Q ) P ( Q | D , r ) P ( r | D ) P ( Q | D , r ) P ( r | D ) = + log log ( Q | r ) P ( r | D ) P P ( r | D ) » + log P ( Q | D , r ) log P ( r | D ) The Language Modeling Approach (3) • Under the previous assumption the ranking formula becomes:

  10. The Language Modeling Approach (4) • Assumption 2: A document D is independent of the relevance R:

  11. P ( r | D , Q ) P ( r | D ) » + log log P ( Q | D , r ) log P ( r | D , Q ) P ( r | D ) P ( r ) » + log P ( Q | D , r ) log P ( r ) » log P ( Q | D , r ) The Language Modeling Approach (5) • Under the previous assumption, and using the ranking formula derived after making Assumption 1 we obtain:

  12. P ( r | D , Q ) » log log P ( Q | D , r ) P ( r | D , Q ) m Õ » log P ( Ai | D , r ) = 1 i m å » log P ( Ai | D , r ) = 1 i The Language Modeling Approach (6) • Now interpreting a query Q as a collection of attributes (query terms), Q = (A1, …, An) and furthermore assuming attribute independence, the latter ranking formula becomes:

  13. [ P ( D , Q | r ) * P ( r )] P ( D , Q ) P ( r | D , Q ) = log log P ( r | D , Q ) [ P ( D , Q | r ) * P ( r )] P ( D , Q ) P ( D , Q | r ) * P ( r ) = log P ( D , Q | r ) * P ( r ) [P(Q [P(D | | D, Q, r) r) * * P(Q P(D | | r)] r)] * * P(r) P(r) = = log log [P(D [P(Q | | D, Q, r r ) ) * * P(D P(Q | | r r )] )] * * P( P( r r ) ) Comparing the Two Methods

  14. Outro • The probabilistic approach to information retrieval and the language modeling approach are equivalent from a probabilistic point of view • However, the two models are still different from a statistical point of view; this becomes apparent when we need to estimate the model parameters • One particular difficulty with the Robertson & Spark Jones model is estimating without any training data the probability P(w|r), that is the probability of seeing the word w in all the relevant documents • If we have a good estimation of the above probability, we can expect excellent performance, since the model is in fact a Naïve Bayes Classifier (best classifier from a statistical point of view)

  15. Probability Estimation • Recall the result of the Robertson-Sparck Jones ranking formula with feature independence assumption: Where a document D is modeled as a collection of independent words Ai: D = (A1, …,An) • This prompts us, again, to find a reasonable estimate for probabilities of the form P(w|r) • Hard to estimate this, since in practice we don’t have information on the relevant documents (this is what we are actually trying to find) • Heuristics methods have been proposed for approximating this probability. We seek a well founded theoretical method as an alternative.

  16. A New Approach (1) • Is there a result grounded in probability theory that can help us approximate this probability without prior training data? • Yes. Proposed by Lavrenko and Croft (U Massachusetts Amherst) • Model: queries and relevant documents are random samples from an underlying relevance model R • A relevance model (as defined by Lavrenko and Croft), is a mechanism that determines the probability P(w|r) of observing a word w in a document relevant to a particular information need • It also assigns the probabilities P(Q|r) to the various queries that might be issued by the user for that specific information need

  17. A New Approach (2) • The approach previously described can be summarized pictorially: • Note that this is different from the language modeling framework: • We don’t assume that the query is a random sample from a specific document, but instead we assume that both the query and the documents are samples from an unknown relevance model R • Also note that in this approach the sampling process can be different for queries and documents

  18. A New Approach (3) • Let Q be a query of the form Q = (q1,…,qk), where qi is a word • Assume we have an unknown process R (a black box) from which we repeatedly sample words. After k samplings, we observe the words q1,…, qk • What is the probability that the next word we pull out from R will be w? • To ensure the proper additivity of the model, we normalize the above relation by summing over all the words w in the vocabulary: • Now the challenge lies in estimating the joint probability P(w,q1,…,qk) • For this purpose we use two techniques: Identically Independently Distributed Sampling and Conditional Sampling

  19. Method I :Identically Independently Distributed Sampling • Assume the query words q1,..,qk and the words in relevant documents are sampled identically and independently (i.i.d.) from a unigram distribution M, • The sampling process proceeds as follows: we choose a distribution M and with probability P(M) and sample from it k+1 times. Then the total probability of observing w together with q1,…,qk is given by the weighted sum: • Since we assumed that qi and w are sampled i.i.d. we write this as: • And plugging this into the initial equation we obtain the final result:

  20. Method II :Conditional Sampling • Now consider a different approach. We fix the value of w according to some prior probability P(w). • Sample query words qi from Mi with probability P(qi|Mi). In essence we consider the query words to be independent of each other, but we keep their dependence of w: • An expected value calculation over the universe of unigram models yields: • And plugging this into the initial equation we obtain the final result:

  21. Comparison of the Two Methods Identically Independently Distributed Sampling (left) VS. Conditional Sampling (right) • The I.I.D. Sampling model makes a stronger independence assumption • The Conditional Sampling model is less constrained by allowing the query words to come from different distributions • In practice the second model performs better, it is more robust and less sensitive to the choice of universe distribution • From this point on the focus will be on the Conditional Sampling model; it will also be the method of choice for benchmarking

  22. Conditional Sampling Estimation Details • Recall the estimation provided by this method: • Now explicitly computing the terms involved: • With this estimation, the classic approach to information retrieval outperforms language models and other sophisticated approaches

  23. Experimental Results (1)

  24. Experimental Results (2)

  25. Experimental Results (3)

  26. Experimental Results (4)

  27. Experimental Results (5)

  28. Summary • Probabilistic Models and Language Models share the same underlying probabilistic foundation • However, the statistical estimation of the parameters differentiate them • Tradition probabilistic models have had difficulties estimating the model information without any training data • A novel approach proposed by Lavrenko and Croft allows for a good estimation of the probabilistic model parameters • With this estimation probabilistic methods perform better in practice than models based on language modeling or other more complicated models

More Related