280 likes | 417 Views
Relevance Models [draft]. Ciprian Raileanu (raileanu@mpi-inf.mpg.de). Two Approaches to Information Retrieval. Q = (q1,…,qk). Probabilistic Approach to IR (Robertson & Sparck Jones, 1976 and 2001) “Given this query, what is the probability that this document is relevant?”. Q = (q1,…,qk).
E N D
Relevance Models [draft] Ciprian Raileanu (raileanu@mpi-inf.mpg.de)
Two Approaches to Information Retrieval Q = (q1,…,qk) Probabilistic Approach to IR (Robertson & Sparck Jones, 1976 and 2001) “Given this query, what is the probability that this document is relevant?” Q = (q1,…,qk) Language Modeling Approach (Ponte & Croft, 1998) “Given this document, what is a query to which this document is relevant?”
Overview • Two papers: • Probabilistic Relevance Models Based on Document and Query Generation (John Lafferty & Cheng Xiang Zhai, CMU) • Relevance-Based Language Models (Victor Lavrenko & Bruce Croft, U Mass Amherst)
Intro • Recently Language Models have enjoyed a lot of attention and performed quite well in practice • However, the underlying semantics of the model have been unclear for a while, as it seems to ignore the important notion of relevance • Lafferty and Zhai from CMU have propose a unified framework; they showed that the traditional probabilistic approach and the language modeling approach are in fact equivalent from a probabilistic point of view. This also answers the relevance issue behind language models
[ P ( D , Q | r ) * P ( r )] P ( D , Q ) P ( r | D , Q ) = log log P ( r | D , Q ) [ P ( D , Q | r ) * P ( r )] P ( D , Q ) P ( D , Q | r ) * P ( r ) = log P ( D , Q | r ) * P ( r ) [P(D | Q, r) * P(Q | r)] * P(r) = log [P(D | Q, r ) * P(Q | r )] * P( r ) P ( D | Q , r ) * [ P(Q | r) * P(r) P(Q) ] = log P ( D | Q , r ) * [ P(Q | r ) * P( r ) P(Q) ] P ( D | Q , r ) * P ( r | Q ) = log P ( D | Q , r ) * P ( r | Q ) P ( D | Q , r ) P ( r | Q ) = + log log P ( D | Q , r ) P ( r | Q ) P ( D | Q , r ) » log P ( D | Q , r ) The Robertson-Sparck Jones Model (1)
The Robertson-Sparck Jones Model (2) • Assuming that a document D is a collection of attributes (words), D = (A1, …, An) • And further assuming the independence of these attributes, P(D|Q,R) can be rewritten: Where R is a random variable that denotes relevance. • Then, the Robertson-Sparck Jones formula can be rewritten as:
[ P ( D , Q | r ) * P ( r )] P ( D , Q ) P ( r | D , Q ) = log log P ( r | D , Q ) [ P ( D , Q | r ) * P ( r )] P ( D , Q ) P ( D , Q | r ) * P ( r ) = log P ( D , Q | r ) * P ( r ) [P(Q | D, r) * P(D | r)] * P(r) = log [P(Q | D, r ) * P(D | r )] * P( r ) P ( Q | D , r ) * [ P(D | r) * P(r) P(D) ] = log P ( Q | D , r ) * [ P(D | r ) * P( r ) P(D) ] P ( Q | D , r ) * P ( r | D ) = log P ( Q | D , r ) * P ( r | D ) P ( Q | D , r ) P ( r | D ) = + log log P ( Q | D , r ) P ( r | D ) The Language Modeling Approach (1)
The Language Modeling Approach (2) • Assumption 1: A document D is independent of the query Q, given irrelevance:
P ( r | D , Q ) P ( Q | D , r ) P ( r | D ) = + log log log P ( r | D , Q ) P ( Q | D , r ) P ( r | D ) P ( Q | D , r ) P ( r | D ) = + log log ( Q | r ) P ( r | D ) P P ( r | D ) » + log P ( Q | D , r ) log P ( r | D ) The Language Modeling Approach (3) • Under the previous assumption the ranking formula becomes:
The Language Modeling Approach (4) • Assumption 2: A document D is independent of the relevance R:
P ( r | D , Q ) P ( r | D ) » + log log P ( Q | D , r ) log P ( r | D , Q ) P ( r | D ) P ( r ) » + log P ( Q | D , r ) log P ( r ) » log P ( Q | D , r ) The Language Modeling Approach (5) • Under the previous assumption, and using the ranking formula derived after making Assumption 1 we obtain:
P ( r | D , Q ) » log log P ( Q | D , r ) P ( r | D , Q ) m Õ » log P ( Ai | D , r ) = 1 i m å » log P ( Ai | D , r ) = 1 i The Language Modeling Approach (6) • Now interpreting a query Q as a collection of attributes (query terms), Q = (A1, …, An) and furthermore assuming attribute independence, the latter ranking formula becomes:
[ P ( D , Q | r ) * P ( r )] P ( D , Q ) P ( r | D , Q ) = log log P ( r | D , Q ) [ P ( D , Q | r ) * P ( r )] P ( D , Q ) P ( D , Q | r ) * P ( r ) = log P ( D , Q | r ) * P ( r ) [P(Q [P(D | | D, Q, r) r) * * P(Q P(D | | r)] r)] * * P(r) P(r) = = log log [P(D [P(Q | | D, Q, r r ) ) * * P(D P(Q | | r r )] )] * * P( P( r r ) ) Comparing the Two Methods
Outro • The probabilistic approach to information retrieval and the language modeling approach are equivalent from a probabilistic point of view • However, the two models are still different from a statistical point of view; this becomes apparent when we need to estimate the model parameters • One particular difficulty with the Robertson & Spark Jones model is estimating without any training data the probability P(w|r), that is the probability of seeing the word w in all the relevant documents • If we have a good estimation of the above probability, we can expect excellent performance, since the model is in fact a Naïve Bayes Classifier (best classifier from a statistical point of view)
Probability Estimation • Recall the result of the Robertson-Sparck Jones ranking formula with feature independence assumption: Where a document D is modeled as a collection of independent words Ai: D = (A1, …,An) • This prompts us, again, to find a reasonable estimate for probabilities of the form P(w|r) • Hard to estimate this, since in practice we don’t have information on the relevant documents (this is what we are actually trying to find) • Heuristics methods have been proposed for approximating this probability. We seek a well founded theoretical method as an alternative.
A New Approach (1) • Is there a result grounded in probability theory that can help us approximate this probability without prior training data? • Yes. Proposed by Lavrenko and Croft (U Massachusetts Amherst) • Model: queries and relevant documents are random samples from an underlying relevance model R • A relevance model (as defined by Lavrenko and Croft), is a mechanism that determines the probability P(w|r) of observing a word w in a document relevant to a particular information need • It also assigns the probabilities P(Q|r) to the various queries that might be issued by the user for that specific information need
A New Approach (2) • The approach previously described can be summarized pictorially: • Note that this is different from the language modeling framework: • We don’t assume that the query is a random sample from a specific document, but instead we assume that both the query and the documents are samples from an unknown relevance model R • Also note that in this approach the sampling process can be different for queries and documents
A New Approach (3) • Let Q be a query of the form Q = (q1,…,qk), where qi is a word • Assume we have an unknown process R (a black box) from which we repeatedly sample words. After k samplings, we observe the words q1,…, qk • What is the probability that the next word we pull out from R will be w? • To ensure the proper additivity of the model, we normalize the above relation by summing over all the words w in the vocabulary: • Now the challenge lies in estimating the joint probability P(w,q1,…,qk) • For this purpose we use two techniques: Identically Independently Distributed Sampling and Conditional Sampling
Method I :Identically Independently Distributed Sampling • Assume the query words q1,..,qk and the words in relevant documents are sampled identically and independently (i.i.d.) from a unigram distribution M, • The sampling process proceeds as follows: we choose a distribution M and with probability P(M) and sample from it k+1 times. Then the total probability of observing w together with q1,…,qk is given by the weighted sum: • Since we assumed that qi and w are sampled i.i.d. we write this as: • And plugging this into the initial equation we obtain the final result:
Method II :Conditional Sampling • Now consider a different approach. We fix the value of w according to some prior probability P(w). • Sample query words qi from Mi with probability P(qi|Mi). In essence we consider the query words to be independent of each other, but we keep their dependence of w: • An expected value calculation over the universe of unigram models yields: • And plugging this into the initial equation we obtain the final result:
Comparison of the Two Methods Identically Independently Distributed Sampling (left) VS. Conditional Sampling (right) • The I.I.D. Sampling model makes a stronger independence assumption • The Conditional Sampling model is less constrained by allowing the query words to come from different distributions • In practice the second model performs better, it is more robust and less sensitive to the choice of universe distribution • From this point on the focus will be on the Conditional Sampling model; it will also be the method of choice for benchmarking
Conditional Sampling Estimation Details • Recall the estimation provided by this method: • Now explicitly computing the terms involved: • With this estimation, the classic approach to information retrieval outperforms language models and other sophisticated approaches
Summary • Probabilistic Models and Language Models share the same underlying probabilistic foundation • However, the statistical estimation of the parameters differentiate them • Tradition probabilistic models have had difficulties estimating the model information without any training data • A novel approach proposed by Lavrenko and Croft allows for a good estimation of the probabilistic model parameters • With this estimation probabilistic methods perform better in practice than models based on language modeling or other more complicated models