200 likes | 382 Views
IR Models: Review Vector Model and Probabilistic. Algebraic. Set Theoretic. Generalized Vector Lat. Semantic Index Neural Networks. Structured Models. Fuzzy Extended Boolean. Non-Overlapping Lists Proximal Nodes. Classic Models. Probabilistic. boolean vector probabilistic.
E N D
Algebraic Set Theoretic Generalized Vector Lat. Semantic Index Neural Networks Structured Models Fuzzy Extended Boolean Non-Overlapping Lists Proximal Nodes Classic Models Probabilistic boolean vector probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext IR Model Taxonomy U s e r T a s k Retrieval: Adhoc Filtering Browsing
Classic IR Models - Basic Concepts • Each document represented by a set of representative keywords or index terms • An index term is a document word useful for remembering the document main themes • The importance of the index terms is represented by weights associated to them • Let • kibe an index term • djbe a document • wijis a weight associated with (ki,dj) • The weight wij quantifies the importance of the index term for describing the document contents
j Vector Model Similarity dj • Sim(q,dj) = cos() = [vec(dj) vec(q)] / |dj| * |q| = [ wij * wiq] / |dj| * |q| • TF-IDF term-weighting scheme • wij = freq(i,q) / max(freq(l,q)])* log(N/ni) • Default query term weights • wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni) q i
k2 k1 Example 1:no weights d7 d6 d2 d4 d5 d3 d1 k3
k2 k1 Example 2:query weights d7 d6 d2 d4 d5 d3 d1 k3
k2 k1 Example 3:query and document weights d7 d6 d2 d4 d5 d3 d1 k3
Summary of Vector Space Model • Advantages: • term-weighting improves quality of the answer set • partial matching allows retrieval of docs that approximate the query conditions • cosine ranking formula sorts documents according to degree of similarity to the query • Disadvantages: • assumes independence of index terms (??); not clear that this is bad though
Probabilistic Model • Objective: to capture the IR problem using a probabilistic framework • Given a user query, there is an ideal answer set • Querying as specification of the properties of this ideal answer set (clustering) • But, what are these properties? • Guess at the beginning what they could be (i.e., guess initial description of ideal answer set) • Improve by iteration
Probabilistic Model • An initial set of documents is retrieved somehow • User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected) • IR system uses this information to refine description of ideal answer set • By repeating this process, it is expected that the description of the ideal answer set will improve • Description of ideal answer set is modeled in probabilistic terms
Probabilistic Ranking Principle • Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj relevant. • The model assumes that this probability of relevance depends on the query and the document representations only. • Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant. • But, • how to compute probabilities?
The Ranking • Probabilistic ranking computed as: • sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q) • This is the odds of the document dj being relevant • Taking the odds minimizes the probability of an erroneous judgement • Definition: • wij {0,1}, wiq {0,1} • P (R | vec(dj)) :probability that given doc is relevant • P (R | vec(dj)) : probability doc is not relevant
The Ranking • sim(dj,q) = P(R | vec(dj)) / P(R | vec(dj)) • = [P(vec(dj) | R) * P(R)] [P(vec(dj) | R) * P(R)] ~ P(vec(dj) | R) P(vec(dj) | R) • P(vec(dj) | R) : probability of randomly selecting the document dj from the set R of relevant documents
The Ranking • sim(dj,q) ~ P(vec(dj) | R) P(vec(dj) | R) ~ [ indoc P(ki | R)] x [ indoc P(ki | R)] [ indoc P(ki | R)] x [ indoc P(ki | R)] • P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents • Assumes the independence of terms.
The Ranking • sim(dj,q) ~ [ indoc P(ki | R)] x [ indoc P(ki | R)] [ indoc P(ki | R)] x [ indoc P(ki | R)] • math happens ... • ~ wiq * wij * (log P( ki | R) + log P( ki | R) ) P(ki | R) P(ki | R) where P(ki | R) = 1 - P(ki | R) P(ki | R) = 1 - P(ki | R)
The Initial Ranking • sim(dj,q) ~ • ~ wiq * wij * (log P( ki | R) + log P( ki | R) ) P(ki | R) P(ki | R) • Probabilities P(ki | R) and P(ki | R) ? • Estimates based on assumptions: • P(ki | R) = 0.5 • P(ki | R) = ni N where ni is the number of docs that contain ki • Use this initial guess to retrieve an initial ranking • Improve upon this initial ranking
Improving the Initial Ranking • sim(dj,q) ~ • ~ wiq * wij * (log P( ki | R) + log P( ki | R) ) P(ki | R) P(ki | R) • Let • V : set of docs initially retrieved • These are considered relevant even if we are not sure • Vi : subset of docs retrieved that contain ki • Reevaluate estimates: • P(ki | R) = Vi V • P(ki | R) = ni - Vi N - V • Repeat recursively
Improving the Initial Ranking • sim(dj,q) ~ • ~ wiq * wij * (log P( ki | R) + log P( ki | R) ) P(ki | R) P(ki | R) • Need to avoid problems with small known document sets (e.g. when V=1 and Vi=0): • P(ki | R) = Vi + 0.5 V + 1 • P(ki | R) = ni - Vi + 0.5 N - V + 1 • But we can use frequency in corpus instead, • P(ki | R) = Vi + ni/N V + 1 • P(ki | R) = ni - Vi + ni/N N - V + 1
Probabilistic Model • Advantages: • Documents are ranked in decreasing order of probability of relevance • Disadvantages: • need to guess initial estimates for P(ki | R) • method does not take into account tf and idf factors • New Use for Model: • Meta search • Use probabilities to model the value of different search engines for different topics
Brief Comparison of Classic Models • Boolean model does not provide for partial matches and is considered to be the weakest classic model • Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections • This seems also to be the view of the research community