Relevance Information: A Loss of Entropy but a Gain for IDF?

Relevance Information:A Loss of Entropy but a Gain for IDF? Arjen P. de Vries arjen@acm.org Thomas Roelleke, thor@dcs.qmul.ac.uk SIGIR 2005

Motivation • How should relevance information be incorporated in systems using TF*IDF term weighting? • TF*IDF combines frequent occurrence with term discriminative-ness • Adding relevance information to a retrieval system corresponds to a loss of entropy; how does this affect IDF (the measure of term discriminativeness)?

Overview • PART I • IDF, relevance, and BIR • PART II • Alternative estimation for IDF

IDF • A robust summary statistic of term occurrence, that helps identify ‘good’ terms • Follows naturally from the Binary Independence Retrieval model (BIR) • The ranking that results from the situation without relevance information • Related to the occurrence probability P(t|C)

Binary term presence/absence BIR • Rank documents by their probability of relevance • Using odds of relevance avoids estimation of some terms without affecting the ranking

BIR Without relevance information, h(t,C) = – log n/(N – n) (Almost)the Discriminative-ness of term t in collection C!

BIR and IDF • View IDF as term statistic in a set of documents, R or ¬R • Then, the BIR probability estimation known as F4 corresponds to IDF(t,¬R) – IDF(t,R) + IDF(¬t,R) – IDF(¬t,¬R) • IDF(t,R) can be interpreted as the discriminativeness of term presence among the relevant documents, etc.

BIR and IDF • In practice, the ‘complement method’ gives ¬R = C\R ≈ C, so, usually, updating IDF under relevance information corresponds to subtracting IDF(t,R)! • The BIR modifies h(t,C) more significantly for those terms that are rare in the relevant set; for, they do not help identify good documents

Implication for TF*IDF systems • A system using IDF(t,C) uses presence weighting only, assuming that the term t occurs in all relevant documents (such that IDF(t,R) = – log R/R = 0) • Systems using TF*IDF term weighting can incorporate RFB in accordance to the binary independence retrieval model

Estimation IDF • Recall that IDF(t,C) = – log P(t|C), the occurrence probability of t in C. • Assuming events d are disjoint and exhaustivewe obtain P(t|C)=n/N • Q: Is this the best method for estimation? • Notice that, in the BIR formulation, sets R and ¬R have very different cardinality…

Estimation TF • For TF weights, we know that P(t|d) estimated by a Poisson approximation (e.g., applied in BM25) or by lifting (e.g., applied in Inquery) leads to superior retrieval results • Motivation for this different estimate is to better handle the influence of varying document lengths

Poisson Estimate • The ‘Poisson estimate’ approximates the (Poisson-based) probability that term t occurs at least once

Poisson vs. Estimate vs. n/N

Again, for small n

|Poisson – Estimate|

Experimental Setup • Ad-hoc retrieval • TREC-7 and TREC-8 (topics 351-400) • No stemming • Routing • LA Times articles for training (1989/1990) • Remainder for testing (1991-1994) • BM25 constants:

Results: IDF vs. IDFp • IDF • IDFp

IDF vs. IDFp • For the short T queries, the user selects carefully the most discriminative terms with respect to relevance • The longer TD and TDN queries contain however also noisy, non-discriminative terms

IDF vs. IDFp • IDFp orders terms with respect to their discriminative-ness in the same order as IDF, but reduces the influence of the non-discriminative terms on the ranking • Differentiate more between rare terms, and less between frequent terms • As a result, the effect of the Poisson-based estimation is much stronger for the longer queries

TF*IDF vs. TF*IDFp • Estimation with IDFp results in better mean average precision than the ‘traditional’ estimate • Strong emphasis on discriminative-ness (Poisson approximation IDFp using large values of K) improves effectiveness • Best overall performance for K=N/10

Routing experiment • The TF*IDFp results without feedback are better than all TF*IDF results • But, the TF*IDFp results without feedback are also better than all TF*IDFp results with feedback • Finally, the TF*IDF results improve only marginally with feedback • LA times training data not representative?

Conclusions • PART I • IDF and the Binary Independence Retrieval model are very closely related • Relevance information can be incorporated in TF*IDF by revising IDF • PART II • Different estimation of the occurrence probability in IDF leads to improved retrieval effectiveness

Open Questions • Can we derive the choice for K=N/10 analytically? • Is the observed improvement in effectiveness really due to a better (frequentist) model of the occurrence probability, or is it a qualitative argument for informative-ness? • More questions in the audience?! 

Relevance Information: A Loss of Entropy but a Gain for IDF?