Tie-Breaking Bias: Effect of an Uncontrolled Parameter on Information Retrieval Evaluation

CLEF’10: Conference on Multilingual and MultimodalInformation Access EvaluationSeptember 20-23, Padua, Italy Tie-Breaking Bias:Effect of an Uncontrolled Parameteron Information Retrieval Evaluation Guillaume Cabanac, Gilles Hubert, Mohand Boughanem, Claude Chrisment

Effect of the Tie-Breaking BiasG. Cabanac et al. Outline • Motivation A tale about two TREC participants • Context IRS effectiveness evaluationIssue Tie-breaking bias effects • Contribution Reordering strategies • Experiments Impact of the tie-breaking bias • Conclusion and Future Works

1. Motivation  Tie-breaking bias illustration G. Cabanac et al. A tale about two TREC participants (1/2) Topic 031 “satellite launch contracts” 5 relevant documents Chris Ellen one single difference C = (N, 0.8), (R, 0.8), (N, 0.5) E = (N, 0.8), (R, 0.8), (N, 0.5) unlucky lucky Why such a huge difference?

1. Motivation  Tie-breaking bias illustration G. Cabanac et al. A tale about two TREC participants (2/2) Chris Ellen C = (N, 0.8), (R, 0.8), (N, 0.5) E = (N, 0.8), (R, 0.8), (N, 0.5) one single difference After 15 days of hard work  Only difference: the name of one document 

2. Context & issue  Tie-breaking bias G. Cabanac et al. Measuring the effectiveness of IRSs • User-centered vs. System-focused[Spärk Jones & Willett, 1997] • Evaluation campaigns • 1958 Cranfield UK • 1992 TREC Text Retrieval Conference USA • 1999 NTCIR NII Test Collection for IR Systems Japan • 2001 CLEF Cross-Language Evaluation Forum Europe • … • “Cranfield” methodology • Task • Test collection • Corpus • Topics • Qrels • Measures : MAP, P@X ... using trec_eval [Voorhees, 2007]

2. Context & issue  Tie-breaking bias G. Cabanac et al. Runs are reordered prior to their evaluation • Qrels = qid, iter,docno, rel Run = qid, iter,docno, rank,sim, run_id relevant[1 ; 127] (N, 0.8), (R, 0.8), (N, 0.5) Reordering by trec_evalqid asc, sim desc, docno desc (R, 0.8), (N, 0.8), (N, 0.5) Effectiveness measure = f (intrinsic_quality, )MAP, P@X, MRR…

3. Contribution  Reordering strategies G. Cabanac et al. Consequences of run reordering • Measures of effectiveness for an IRS s • RR(s,t) 1/rank of the 1st relevant document, for topic t • P(s,t,d) precision at document d, for topic t • AP(s,t)average precision for topic t • MAP(s)mean average precision • Tie-breaking bias • Is the Wall Street Journal collection more relevant than Associated Press? • Problem 1 comparing 2 systems AP(s1, t) vs. AP(s2, t) • Problem 2 comparing 2 topics AP(s, t1) vs. AP(s, t2) Sensitive to document rank Ellen Chris

3. Contribution  Reordering strategies G. Cabanac et al. Alternative unbiased reordering strategies • Conventional reordering (TREC) • Ties sorted Z  A qidasc, simdesc, docnodesc • Realistic reordering • Relevant docs last qidasc, simdesc, relasc, docnodesc • Optimistic reordering • Relevant docs first qidasc, simdesc, reldesc, docnodesc ex aequo ex aequo

4. Experiments  Impact of the tie-breaking bias G. Cabanac et al. Effect of the tie-breaking bias • Study of 4 TREC tasks • 22 editions • 1360 runs • Assessing the effect of tie-breaking • Proportion of document ties  How frequent is the bias? • Effect on measure values • Top 3 observed differences • Observed difference in % • Significance of the observed difference: Student’s t-test (paired, unilateral) 1993 1997 1998 1999 2000 2002 2004 2009 web filtering routing adhoc 3 GB of data from trec.nist.gov

4. Experiments  Impact of the tie-breaking bias G. Cabanac et al. Ties demographics • 89.6% of the runs comprise ties • Ties are present all along the runs

4. Experiments  Impact of the tie-breaking bias G. Cabanac et al. Proportion of tied documents in submitted runs On average, 25.2 % of a result-list = tied documents On average, 10.6 docs in a tied group of docs

4. Experiments  Impact of the tie-breaking bias G. Cabanac et al. Effect on Reciprocal Rank (RR)

4. Experiments  Impact of the tie-breaking bias G. Cabanac et al. Effect on Average Precision (AP)

4. Experiments  Impact of the tie-breaking bias G. Cabanac et al. Effect on Mean Average Precision (MAP) Difference of ranks computed on MAP not significant (Kendall’s t)

4. Experiments  Impact of the tie-breaking bias G. Cabanac et al. What we learnt: Beware of tie-breaking for AP • Poor effect on MAP, larger effect on AP • Measure bounds APRealisticAPConventionnalAPOptimistic • Failure analysis for the ranking process • Error bar = element of chance  potential for improvement padre1, adhoc’94

4. Experiments  Impact of the tie-breaking bias G. Cabanac et al. Related works in IR evaluation Topics reliability?[Buckley & Voorhees, 2000]  25[Voorhees & Buckley, 2002] error rate[Voorhees, 2009] n collections Qrels reliability?[Voorhees, 1998] quality[Al-Maskari et al., 2008] TREC vs. TREC [Voorhees, 2007] Measures reliability?[Buckley & Voorhees, 2000] MAP  [Sakai, 2008] ‘system bias’[Moffat & Zobel, 2008] new measures [Raghavan et al., 1989] Precall [McSherry & Najork, 2008] Tied scores Pooling reliability?[Zobel, 1998] approximation [Sanderson & Joho, 2004] manual[Buckley et al., 2007] size adaptation [Cabanac et al., 2010]tie-breaking bias

Impact du « biais des ex aequo » dans les évaluations de RI G. Cabanac et al. Conclusions and future works • Context: IR evaluation • TREC and other campaigns based on trec_eval • Contributions •  Measure = f (intrinsic_quality, luck)  tie-breaking bias • Measure bounds (realistic  conventional optimistic) • Study of the tie-breaking bias effect • (conventional, realistic) for RR, AP and MAP • Strong correlation, yet significant difference • No difference on system rankings (based on MAP) • Future works • Study of other / more recent evaluation campaigns • Reordering-free measures • Finer grained analyses: finding vs. ranking

CLEF’10: Conference on Multilingual and MultimodalInformation Access EvaluationSeptember 20-23, Padua, Italy Thank you

Tie-Breaking Bias: Effect of an Uncontrolled Parameter on Information Retrieval Evaluation