450 likes | 459 Views
Understand probabilistic information retrieval models such as Binary Independence Model and Optimal Retrieval in information retrieval systems. Learn about relevance probability, similarity functions, and historical models used. Compute term relevance weights, estimate relevance probabilities, and calculate term weights. Explore Robertson and Sparck Jones model and Croft’s probabilistic model.
E N D
CS533 Information Retrieval Dr. Michal Cutler Lecture #7 February 15, 2000
Probabilistic information retrieval • The model • Binary independence model • Non-binary independence models
Optimal Retrieval • Given a query Q and a collection • Optimal document retrieval principle • Arrange documents in descending order of probability of relevance to Q • Let OP-list denote the resulting list • If k documents are required • Take the first k documents from the OP-list
The model • Probability of relevance of each document to the query not available in practice • The following slide shows they are not needed
The model cont. • Let rel denote the event that a document is relevant to the user • P(rel| Di) is the probability that Di is relevant to the user • We need a similarity function s so that: • P(rel|Di)>P(rel|Dj) iff s(Q, Di)>s(Q, Dj)
The similarity function cont. • For every document D we need to compute:
Some History • Maron and Kuhns 1960 • Robertson and Sparck Jones 1976 • Croft and Harper 1979 • Yu Meng and Park 1989 • Other models (Robertson and Walker, Kwok)
The model • The set of all documents is partitioned with respect to the query Q into the sets rel and nonrel. • The sets rel and nonrel change from query to query
An Independence Assumption • The set of all terms are distributed independently in both rel and nonrel • Very strong assumption • Q= “What is happening with the impeachment trial?” • Occurrence of “impeachment” in relevant documents is independent from occurrence of “trial”
The model • xi = di is the event that D has di occurrences of term i. • From independence assumption:
The model • Let g(x)=log(P(x|rel)/P(x|nonrel)) • The logarithm is used to make the calculations simpler by changing multiplications to sums
Binary independence model • In this model a term occurs or does not occur in a document • Let x = (x1,…,xt) denote any document in the collection, where xi is 1 or 0.
Term relevance weights tri • The first sum in the formula for g(x = D): • Depends only on pi and qi which are the probabilities that the ith term occurs in the relevant and the non relevant documents • It is independent of the occurrence of terms in document D
Term relevance weights tri • The second sum depends on the actual terms which appear in D.
Term relevance weights tri • tri can be interpreted as the power of term i to discriminate between the relevant and the nonrelevant documents
Notation • N is the total number of documents • R is the total number of relevant documents • ri is the number of relevant documents containing term i • dfi is the number of documents in which term i occurs
Term frequencies Occurrence Relevant Non relevant Total documents documents xi =1 ridfi-ridfi xi =0 R-riN-R-dfi+riN-dfi Total RN-RN
Computing tri • We can use the previous foil to compute: pi= ri / R qi = (dfi-ri) / (N - R) 1-pi= (R - ri) / R and 1- qi= (N - R - dfi+ ri) / (N - R)
Estimating pi and qi • ri and R are not known, before the system has been used extensively and collected relevance results • Various proposals have been made for estimating pi and qi. • We discuss some of them
Estimating qi • Most documents in which term i occurs are non relevant to the average query • N is large
Estimating qi • qi can be estimated by the occurrence probability of the term in the entire collection • qi= dfi / N
Estimating tri • tri=log(pi /(1- pi)+log ((1- qi)/ qi) =C+log((1- qi) / qi)= = C+log ((1- dfi / N)/(dfi / N)) = C+log ((N - dfi) / dfi) • When pi is close to 1, C is very large
Estimating pi • Assume no relevance information • The probability of term i either occurring or not occurring in the smaller set of relevant documents can be assumed to be equal. • So pi=1/2
Estimating tri • In this case the term-relevance weight is: • tri=log1+ log ((N - dfi) / dfi) = = log ((N - dfi) / dfi) • This formula is a form of idf
Estimating r • If term i is an ideal indexing term, it occurs only in relevant documents. • In this case ri = dfi, • pi = dfi/ R and • qi= 0.
Estimating r • If term i is a poor indexing term, it is sprinkled evenly among the relevant and non relevant documents. In this case, it can be estimated by • ri= (R / N) dfi
Estimating r • We can assume that an index term is in-between an ideal and a poor one • The constants a, b, and c have some medium value
Estimating r r=a(df) for 0<=df<=R and R/N<a<1 r=b+cdf for R<df<N and 0<c<R/N r r=df R r=(R/N)df df R N
Croft’s probabilistic model • Introduced term frequency into probabilistic model • Relevance estimated by including probability that a term appears in a document
Croft’s probabilistic model • Initial search • wijk = (C + idfi)fik where • i denotes the ith term in query j and document k, • C is a constant
Croft’s probabilistic model • fik=K+(1-K)freqik/max freqik • freqik is the frequency of occurrence of term i in document k • max freqik is the maximum frequency of any term in document k. • K is a constant
Croft’s probabilistic model • Search using feedback • pij (qij) is the probability that term i occurs in the set of relevant (non relevant) documents for query j
Croft’s probabilistic model • It is assumed that nonretrieved documents are not relevant • So both R and r can be estimated from the provided feedback • A problem arises when the user indicates that no relevant documents were retrieved