120 likes | 243 Views
Probabilistic Information Retrieval. CSE6392 - Database Exploration Gautam Das Thursday, March 29 2006. Z.M. Joseph Spring 2006, CSE, UTA. Basic Rules of Probability. Recall the product rule: Baye’s Theorem:. Basic Assumptions.
E N D
Probabilistic Information Retrieval CSE6392 - Database Exploration Gautam Das Thursday, March 29 2006 Z.M. Joseph Spring 2006, CSE, UTA
Basic Rules of Probability • Recall the product rule: • Baye’s Theorem:
Basic Assumptions • Assume a database D consisting of a set of objects: documents, tuples, etc. • Q : Query • R : ‘Relevant Set’ of tuples • Goal is to find an R for each Q, given D. • Instead of deterministic, consider probabilistic ordering • Ranking/Scoring function should decide the degree of relevance of a document • Thus given a document d: Score(d) = P(R|D) [1] Thus, according to this, if you know the relevance set, then R’s members would have probability of 1, which would be the maximum score. Others would get a probability of 0.
Simplification • From [1]: • Take ratios of probability that document is in R to probability that it is not in R: • This retains the old ordering. Factors in the elements outside R which are part of D.
Applying Bayes Theorem • Simplify as follows:
Observations • Forms the scoring function • The equation still retains R, which we do not know. • The ordering will still be the same using this equation as a scoring function
Derivation for Keyword Queries • Now assume that a query contains a vector of words, with zero probability assigned if it does not occur. • Then, applying the previous equation to each word w (instead of to a document) and combining all the words of the query gives:
Search for “Microsoft Corporation” • Thus expression would be: • Assume you had two documents: • D1 : Contains ‘Microsoft’ but not ‘Corporation’ • D2 : Contains ‘Corporation’ but not ‘Microsoft’ • Thus:
Search for “Microsoft Corporation” • Because Corporation is more common in the database D, then P(Corporation|D) will be far higher than P(Microsoft|D). • Thus Score(D1) will be higher than Score(D2). • Thus document which has ‘Microsoft’ in it will get higher ranking as this is more specific than the word ‘Corporation’. • Similar to Vector Space ranking by relevance
Relevance Feedback • Can keep fine-tuning R by getting user feedback on initial rankings. • Once a better R is known, better scoring and ranking of matches is possible.
PIR Applied to Databases • Originally PIR was applied to documents and not to databases • Applying PIR to databases is not easy as it is difficult to capture various aspects • These include: • Different values of an attributes • PIR is based on words in document, in a database if a car is blue, black,etc. that is not easily captured • Would you assign each color as a keyword? • What to sacrifice in ranking is also not easy to capture – if a user’s preference is black cars, how is PIR applied to that when listing results that do not match entirely?