Relevance Models [draft]

Relevance Models [draft] Ciprian Raileanu (raileanu@mpi-inf.mpg.de)

Two Approaches to Information Retrieval Q = (q1,…,qk) Probabilistic Approach to IR (Robertson & Sparck Jones, 1976 and 2001) “Given this query, what is the probability that this document is relevant?” Q = (q1,…,qk) Language Modeling Approach (Ponte & Croft, 1998) “Given this document, what is a query to which this document is relevant?”

Overview • Two papers: • Probabilistic Relevance Models Based on Document and Query Generation (John Lafferty & Cheng Xiang Zhai, CMU) • Relevance-Based Language Models (Victor Lavrenko & Bruce Croft, U Mass Amherst)

Intro • Recently Language Models have enjoyed a lot of attention and performed quite well in practice • However, the underlying semantics of the model have been unclear for a while, as it seems to ignore the important notion of relevance • Lafferty and Zhai from CMU have propose a unified framework; they showed that the traditional probabilistic approach and the language modeling approach are in fact equivalent from a probabilistic point of view. This also answers the relevance issue behind language models

The Robertson-Sparck Jones Model (2) • Assuming that a document D is a collection of attributes (words), D = (A1, …, An) • And further assuming the independence of these attributes, P(D|Q,R) can be rewritten: Where R is a random variable that denotes relevance. • Then, the Robertson-Sparck Jones formula can be rewritten as:

The Language Modeling Approach (2) • Assumption 1: A document D is independent of the query Q, given irrelevance:

The Language Modeling Approach (4) • Assumption 2: A document D is independent of the relevance R:

P ( r | D , Q ) » log log P ( Q | D , r ) P ( r | D , Q ) m Õ » log P ( Ai | D , r ) = 1 i m å » log P ( Ai | D , r ) = 1 i The Language Modeling Approach (6) • Now interpreting a query Q as a collection of attributes (query terms), Q = (A1, …, An) and furthermore assuming attribute independence, the latter ranking formula becomes:

Outro • The probabilistic approach to information retrieval and the language modeling approach are equivalent from a probabilistic point of view • However, the two models are still different from a statistical point of view; this becomes apparent when we need to estimate the model parameters • One particular difficulty with the Robertson & Spark Jones model is estimating without any training data the probability P(w|r), that is the probability of seeing the word w in all the relevant documents • If we have a good estimation of the above probability, we can expect excellent performance, since the model is in fact a Naïve Bayes Classifier (best classifier from a statistical point of view)

Probability Estimation • Recall the result of the Robertson-Sparck Jones ranking formula with feature independence assumption: Where a document D is modeled as a collection of independent words Ai: D = (A1, …,An) • This prompts us, again, to find a reasonable estimate for probabilities of the form P(w|r) • Hard to estimate this, since in practice we don’t have information on the relevant documents (this is what we are actually trying to find) • Heuristics methods have been proposed for approximating this probability. We seek a well founded theoretical method as an alternative.

A New Approach (1) • Is there a result grounded in probability theory that can help us approximate this probability without prior training data? • Yes. Proposed by Lavrenko and Croft (U Massachusetts Amherst) • Model: queries and relevant documents are random samples from an underlying relevance model R • A relevance model (as defined by Lavrenko and Croft), is a mechanism that determines the probability P(w|r) of observing a word w in a document relevant to a particular information need • It also assigns the probabilities P(Q|r) to the various queries that might be issued by the user for that specific information need

A New Approach (2) • The approach previously described can be summarized pictorially: • Note that this is different from the language modeling framework: • We don’t assume that the query is a random sample from a specific document, but instead we assume that both the query and the documents are samples from an unknown relevance model R • Also note that in this approach the sampling process can be different for queries and documents

A New Approach (3) • Let Q be a query of the form Q = (q1,…,qk), where qi is a word • Assume we have an unknown process R (a black box) from which we repeatedly sample words. After k samplings, we observe the words q1,…, qk • What is the probability that the next word we pull out from R will be w? • To ensure the proper additivity of the model, we normalize the above relation by summing over all the words w in the vocabulary: • Now the challenge lies in estimating the joint probability P(w,q1,…,qk) • For this purpose we use two techniques: Identically Independently Distributed Sampling and Conditional Sampling

Method I :Identically Independently Distributed Sampling • Assume the query words q1,..,qk and the words in relevant documents are sampled identically and independently (i.i.d.) from a unigram distribution M, • The sampling process proceeds as follows: we choose a distribution M and with probability P(M) and sample from it k+1 times. Then the total probability of observing w together with q1,…,qk is given by the weighted sum: • Since we assumed that qi and w are sampled i.i.d. we write this as: • And plugging this into the initial equation we obtain the final result:

Method II :Conditional Sampling • Now consider a different approach. We fix the value of w according to some prior probability P(w). • Sample query words qi from Mi with probability P(qi|Mi). In essence we consider the query words to be independent of each other, but we keep their dependence of w: • An expected value calculation over the universe of unigram models yields: • And plugging this into the initial equation we obtain the final result:

Comparison of the Two Methods Identically Independently Distributed Sampling (left) VS. Conditional Sampling (right) • The I.I.D. Sampling model makes a stronger independence assumption • The Conditional Sampling model is less constrained by allowing the query words to come from different distributions • In practice the second model performs better, it is more robust and less sensitive to the choice of universe distribution • From this point on the focus will be on the Conditional Sampling model; it will also be the method of choice for benchmarking

Conditional Sampling Estimation Details • Recall the estimation provided by this method: • Now explicitly computing the terms involved: • With this estimation, the classic approach to information retrieval outperforms language models and other sophisticated approaches

Experimental Results (1)

Summary • Probabilistic Models and Language Models share the same underlying probabilistic foundation • However, the statistical estimation of the parameters differentiate them • Tradition probabilistic models have had difficulties estimating the model information without any training data • A novel approach proposed by Lavrenko and Croft allows for a good estimation of the probabilistic model parameters • With this estimation probabilistic methods perform better in practice than models based on language modeling or other more complicated models

Relevance Models [draft]