Lower-Bounding Term Frequency Normalization

Lower-Bounding Term Frequency Normalization Yuanhua Lv and ChengXiang Zhai University of Illinois at Urbana-Champaign CIKM 2011 Best Student Award Paper Speaker: Tom Nov 8th, 2011

It is very difficult to improve retrieval models • BM25 [Robertson et al. 1994] • Pivoted length normalization (PIV) [Singhal et al. 1996] • Query likelihood with Dirichlet prior (DIR) [Ponte & Croft 1998; Zhai & Lafferty 2001] • PL2 [Amati & Rijsbergen 2002] 17 years 15 years 10 years 9 years All these models remain strong baselines todayafter so many years!

1. Why does it seem to be so hard to beat these state-of-the-art retrieval models {BM25, PIV, DIR, PL2 …}? 2. Are they hitting the ceiling?

Key heuristic in all effective retrieval models: term frequency (TF) normalization by document length [Singhal et al. 96; Fang et al. 04] • BM25 • DIR: Query likelihood with Dirichlet prior Term Frequency Document length Term discrimination PIV and PL2 implement similar retrieval heuristics

However, the component of TF normalization by document length is NOT lower-bounded properly • BM25 • DIR: Query likelihood with Dirichlet prior When a document is very long, its score from matching a query term could be too small!

As a result, long documents could be overly penalized D2 matches the query term, while D1 does not S(D2) < S(D1) PL2 DIR S(D2) < S(D1) Score Score

Empirical evidence: long documents indeed overly penalized Relevance Relevance Retrieval Retrieval Document length Document length Prob. of relevance/retrieval: the probability of a randomly selected relevant/retrieved document having a certain document length [Singhal et al. 96]

Bug TF normalization not lower-bounded properly, and long documents overly penalized Functionality analysis of retrieval models White-box Testing Are these retrieval models sharing this similar bug because they all violate some necessary retrieval heuristics? Can we formally capture these necessary heuristics?

Two novel heuristics for regulating the interactions between TF and doc. length • There should be a sufficiently large gap between the presence and absence of a query term • Document length normalization should not cause a very long document with a non-zero TF to receive a score too close to or even lower than a short document with a zero TF • A short document that only covers a very small subset of the query terms should not easily dominate over a very long document that contains many distinct query terms LB1 LB2

Lower-bounding constraint 1 (LB1):Occurrence > Non-Occurrence w w Q: Q’: w w D2: D1: q Score(Q, D1) = Score(Q, D2) q Score(Q’, D1) < Score(Q’, D2)

Lower-bounding constraint 2 (LB2):First Occurrence > Repeated Occurrence q1 Q: q1 q1 q1 q1 D2’: D2: D1’: D1: q2 q1 q2 Score(Q, D1) = Score(Q, D2) Score(Q, D1’) < Score(Q, D2’)

BM25 satisfies LB1 but violates LB2 • LB1 is satisfied unconditionally • LB2 is equivalent to: (Parameters: k1 > 0 && 0 < b < 1) Long documents tend to violate LB2 Large b or k1 violates LB2 easily

DIR satisfiesLB2 but violates LB1 • LB2 is equivalent to: • LB1 is equivalent to: satisfied unconditionally! Long documents tend to violate LB1 Large µ or non-discriminative terms violate LB1 easily

No retrieval model satisfies both constraints Can we "fix" this problem for all the models in a general way?

Solution: a general approach to lower-bounding TF normalization • The score of a document D from matching a query term t: Term discrimination BM25 DIR PIV and PL2 also have their corresponding components

Solution: a general approach to lower-bounding TF normalization (Cont.) • Objective: an improved version that does not hurt other retrieval heuristics, but • A heuristic solution: l can be absorbed into δ which satisfies all retrieval heuristics that are satisfied by

Example: BM25+, a lower-bounded version of BM25 BM25: BM25+: BM25+ incurs almost no additional computational cost Similarly, we can also improve PIV, DIR, and PL2, leading to PIV+, DIR+, and PL2+ respectively

BM25+ can satisfy both LB1 and LB2 • Similarly to BM25, BM25+ satisfies LB1 • LB2 can also be satisfied unconditionally if: Experiments show later that setting δ = 1.0 works very well

The proposed approach can fix or alleviate the problem of all these retrieval models LB1 LB2 Current retrieval models Improved retrieval models

Experiment Setup • Standard TREC document collections • Web: WT2G, WT10G, and Terabyte • News: Robust04 • Standard TREC query sets: • Short (the title field): e.g., “Iraq foreign debt reduction” • Verbose (the description field): e.g., “Identify any efforts, proposed or undertaken, by world governments to seek reduction of Iraq's foreign debt” • 2-fold cross validation for parameter tuning

BM25+ improves over BM25 significantly σ = 2.31 σ = 2.63 σ = 1.19 Web Web News Short Verbose Superscripts 1/2/3/4 indicating significance at the 0.05/0.02/0.01/0.001 level BM25+ performs better on Web data than on News data δ = 1.0 works well, confirming constraint analysis that ? BM25+ performs better on verbose queries

BM25 overly penalizes long documents more seriously for verbose queries The “condition” that BM25 violates LB2 is (monotonically decreasing with b & k1) The optimal settings of b & k1 are larger for verbose queries

The improvement indeed comes from alleviating the problem of overly-penalizing long docs BM25 (short) BM25 (verbose) BM25+ (verbose) BM25+ (short)

DIR+ improves over DIR significantly Short Verbose Superscripts 1/2/3/4 indicating significance at the 0.05/0.02/0.01/0.001 level Fixing δ = 0.05 works very well ? DIR+ performs better on verbose than on short queries DIR can only satisfy LB1 if Optimal µ settings

PL2+ improves over PL2 significantly Short Verbose Superscripts 1/2/3/4 indicating significance at the 0.05/0.02/0.01/0.001 level Fixing δ = 0.8 works very well PL2+ performs better on verbose than on short queries Optimal settings of c: the smaller, the more dangerous

PIV+ works as we expected Superscripts 1 indicating significance at the 0.05 level PIV+ does not consistently outperform PIV, as we expected PIV can satisfy LB2 if It’s fine, as the optimal settings of s are very small

1. Why does it seem to be so hard to beat these state-of-the-art retrieval models {BM25, PIV, DIR, PL2 …}? We weren’t able to figure out their deficiency analytically. 2. Are they hitting the ceiling? No, they haven’t hit the ceiling yet!

Conclusions • Reveal a common deficiency of current retrieval models • Propose two novel formal constraints • Show that current retrieval models do not satisfy both constraints, and that retrieval performance tends to be poor if either constraint is violated • Develop a general and efficient solution, which has been shown analytically to fix/alleviate the problem of current retrieval models • Demonstrate the effectiveness of the proposed algorithms across different collections for different types of queries

Our models {BM25+, DIR+, PL2+} can potentially replace current state-of-the-art retrieval models {BM25, DIR, PL2} BM25: BM25+:

Future work • This work has demonstrated the power of doing axiomatic analysis to fix deficiencies of retrieval models. Are there any other deficiencies of current retrieval models? If so, can we solve them with axiomatic analysis? • Can we go beyond bag of words with constraint analysis? • Can we find a comprehensive set of constraints that are sufficient for deriving a unique (optimal) retrieval function

Thanks!

Lower-Bounding Term Frequency Normalization