The Probabilistic Model

The Probabilistic Model

Objective: to capture the IR problem using a probabilistic framework; Given a user query, there is an ideal answer set; Querying as specification of the properties of this ideal answer set (clustering); But, what are these properties? Guess at the beginning what they could be (i.e., guess initial description of ideal answer set); Improve by iteration. Probabilistic Model

Probabilistic Model • An initial set of documents is retrieved somehow; • User inspects these docs looking for the relevant ones (only top 10-20 need to be inspected); • IR system uses this information to refine description of ideal answer set; • By repeating this process, it is expected that the description of the ideal answer set will improve; • Have always in mind the need to guess at the very beginning the description of the ideal answer set; • Description of ideal answer set is modeled in probabilistic terms.

Probabilistic Ranking Principle • Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant); • The model assumes that this probability of relevance depends on the query and the document representations only; • Ideal answer set is referred to as R and should maximize the probability of relevance; • Documents in the set R are predicted to be relevant.

Probabilistic Ranking Principle • But, • how to compute probabilities? • what is the sample space?

The Ranking • Probabilistic ranking computed as: • sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q) • This is the odds of the document dj being relevant; • Taking the odds minimize the probability of an erroneous judgment; • Definition: • wij  {0,1} • P(R | dj) :probability that given doc is relevant; • P(R | dj) : probability doc is not relevant.

The Initial Ranking • How we can compute the probabilities P(ki | R) and P(ki | R) ? • Estimation based on assumptions: • P(ki | R) = 0.5 • P(ki | R) = ni / N where ni is the number of docs that contain ki; • Use this initial guess to retrieve an initial ranking; • Improve upon this initial ranking.

Improving the Initial Ranking • Let • V : set of docs initially retrieved • Vi : subset of docs retrieved that contain ki • Reevaluate estimates: • P(ki | R) = Vi / V • P(ki | R) = (ni – Vi)/(N – V) • Repeat recursively

Improving the Initial Ranking • To avoid problems with V=1 and Vi=0: • P(ki | R) = (Vi + 0.5) / (V + 1); • P(ki | R) = (ni - Vi + 0.5) / (N - V + 1); • Also, • P(ki | R) = Vi + (ni/N) V + 1 • P(ki | R) = ni - Vi + (ni/N) N - V + 1

Pluses and Minuses • Advantages: • Docs ranked in decreasing order of probability of relevance; • Disadvantages: • need to guess initial estimates for P(ki | R); • method does not take into account tf and idf factors.

Brief Comparison of Classic Models • Boolean model does not provide for partial matches and is considered to be the weakest classic model; • Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections; • This seems also to be the view of the research community.

Models based on fuzzy sets; Extensions of the Boolean model: continuous weights belonging to [0, 1] interval; Models based in latent semantic analysis (LSA); Models based on neural networks; Models based on Bayesian networks; Models for structured documents; Models for browsing; … Alternative models

The Probabilistic Model