200 likes | 343 Views
CS 430: Information Discovery. Lecture 17 Probabilistic Information Retrieval. Course Administration. Midterm Examination Kimball B11, 7:30 to 9:00 pm on Wednesday, October 31. Assignment 3 Revised version now online:
E N D
CS 430: Information Discovery Lecture 17 Probabilistic Information Retrieval
Course Administration Midterm Examination Kimball B11, 7:30 to 9:00 pm on Wednesday, October 31. Assignment 3 Revised version now online: • Clarifies requirements, e.g., precedence of operators, stemming with wild cards, etc. • Detailed submission requirements, so that we can better grade and comment on your work.
Three Approaches to Information Retrieval Many authors divide the methods of information retrieval into three categories: Boolean (based on set theory) Vector space (based on linear algebra) Probabilistic (based on Bayesian statistics) In practice, the latter two have considerable overlap.
Probability Ranking Principle "If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately a possible on the basis of whatever data made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data." W.S. Cooper
Probabilistic Ranking Basic concept: "For a given query, if we know some documents that are relevant, terms that occur in those documents should be given greater weighting in searching for other relevant documents. By making assumptions about the distribution of terms and applying Bayes Theorem, it is possible to derive weights theoretically." Van Rijsbergen
Probability Theory -- Bayesian Formulas P(b | a) P(a) P(b | a) P(a) P(b) P(b) Notation Let a, b be two events. P(a | b) is the probability of a given b Bayes Theorem P(a | b) = P(a | b) = Derivation P(a | b) P(b) = P(a b) = P(b | a) P(a) where a is the event not a
Concept R is a setof documents that are guessed to be relevant and R the complement of R. 1. Guess a preliminary probabilistic description of R and use it to retrieve a first set of documents. 2. Interact with the user to refine the description. 3. Repeat, thus generating a succession of approximations to R.
Probabilistic Principle P(R | dj) P(R | dj) P(dj | R) P(R) P(dj | R) P(R) P(dj | R) P(dj | R) Given a user query q and a document dj the model estimates the probability that the user finds djrelevant. i.e., P(R | dj). similarity (dj, q) = = by Bayes Theorem = x constant
Binary Independence Retrieval Model (BIR) Suppose that the weights for term i in document dj and query q are wi,j and wi,q, where all weights are 0 or 1. Let P(ki | R) be the probability that index term ki is present in a document randomly selected from the set R. If the index terms are independent, after some mathematical manipulation, taking logs and ignoring factors that are constant for all documents: similarity (dj, q) = wi,q x wi,j x ( log + log ) P(ki | R) 1 - P(ki | R) 1 - P(ki | R) P(ki | R) i
Estimates of P(ki | R) Initial guess, with no information to work from: P(ki | R) = c P(ki | R) = ni / N where: c is an arbitrary constant, e.g., 0.5 ni is the number of documents that contain ki N is the total number of documents in the collection
Improving the Estimates of P(ki | R) Human feedback -- relevance feedback Automatically (a) Run query q using initial values. Consider the t top ranked documents. Let r be the number of these documentsthat contain the term ki. (b) The new estimates are: P(ki | R) = r / t P(ki | R) = (ni - r) / (N - t) Note: The ratio of these two terms, with minor changes of notation and taking logs, gives w2 on page 368 of Frake.
Continuation similarity (dj, q) = wi,q x wi,j x ( log + log ) = wi,q x wi,j x ( log r/(t- r) + log (N - r)/(N + r - t - ni) ) = wi,q x wi,j x log {r/(t- r)}/{(N + r - t - ni)/(N - r)} Note: With a minor change of notation, this is w4 on page 368 of Frake. P(ki | R) 1 - P(ki | R) 1 - P(ki | R) P(ki | R) i i i
Probabilistic Weighting ( ) ( ) r R - r n - r N - R N number of documents in collection R number of relevant documents for query q n number of documents with term t r number of relevant documents with term t w = log r R - r n - r N - R number of relevant documents with term t number of relevant documents without term t ( ) ( ) number of non-relevant documents with term t number of non-relevant documents in collection
Discussion of Probabilistic Model Advantages • Based on firm theoretical basis Disadvantages • Initial definition of R has to be guessed. • Weights ignore term frequency • Assumes independent index terms (as does vector model)
Review of Weighting The objective is to measure the similarity between a document and a query using statistical (not linguistic) methods. Concept is to weight terms by some factor based on the distribution of terms within and between documents. In general: (a) Weight is an increasing function of the number of times that the term appears in the document (b) Weight is a decreasing function of the number of documents that contain the term (or the total number of occurrences of the term) (c) Weight needs to be adjusted for documents that differ greatly in length.
Normalization of Within Document Frequency (Term Frequency) Normalization to moderate the effect of high-frequency terms Croft's normalization: cfij = K + (1 - K) fij/mi (fij > 0) fij is the frequency of term j in document i cfij is Croft's normalized frequency mi is the maximum frequency of any term in document i K is a constant between 0 and 1 that is adjusted for the collection K should be set to low values (e.g., 0.3) for collections with long documents (35 or more terms). K should be set to higher values (greater than 0.5) for collections with short documents.
Normalization of Within Document Frequency (Term Frequency) Examples Croft's normalization: cfij = K + (1 - K) fij/mi (fij > 0) document K mi weight (most weight (least length frequent term) frequent term) 20 0.3 5 1.00 0.44 20 0.3 2 1.00 0.65 100 0.5 25 1.00 0.52 100 0.5 2 1.00 0.75
Measures of Within Document Frequency (c) Salton and Buckley recommend using different weightings for documents and queries documents fik for terms in collections of long documents 1 for terms in collections of short document queries cfik with K = 0.5 for general use fik for long queries (cfik with K = 0)
Ranking -- Practical Experience 1. Basic method is inner (dot) product with no weighting 2. Cosine (dividing by product of lengths) normalizes for vectors of different lengths 3. Term weighting using frequency of terms in document usually improves ranking 4. Term weighting using an inverse function of terms in the entire collection improves ranking (e.g., IDF) 5. Weightings for document structure improve ranking 6. Relevance weightings after initial retrieval improve ranking Effectiveness of methods depends on characteristics of the collection. In general, there are few improvements beyond simple weighting schemes.
Inverse Document Frequency (IDF) (a) Simplest to use is 1 / dk (Salton) dk number of documents that contain term k (b) Normalized forms: IDFi= log2 (N/ni)+ 1 or IDFi= log2 (maxn/ni)+ 1 (Sparck Jones) N number of documents in the collection ni total number of occurrences of term i in the collection maxn maximum frequency of any term in the collection