A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall)

A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall) Stefano Mizzaro Department of Mathematics and Computer Science University of Udine mizzaro@dimi.uniud.it http://www.dimi.uniud.it/~mizzaro

Outline • Introduction: Measures of retrieval effectiveness... motivation for... • ...a new measure: Average Distance Measure (ADM) • Discussion • Theoretical and practical adequacy of ADM • ADM vs. precision and recall • Pbms. with P & R • Conclusions and future work S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

“Less” retrieved Not retrieved “More” retrieved Retrieved Not relevant “Less” relevant Relevant “More” relevant Documents database [Salton & McGill, 84] From binary to continuous relevance & retrieval S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

Continuous relevance & retrieval • SRE = System Relevance Estimate (aka RSV) • URE = User Relevance Estimate “More” relevant “Less” relevant SRE 1.0 “More” retrieved 0.5 “Less” retrieved 0 URE 1.0 0.5 0 S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

R =RetRel/(RetRel+NRetRel) P = RetRel/(RetRel+RetNRel) Thresholds on URE & SRE: why? “More” relevant “Less” relevant SRE 1.0 ... and historical reasons Retrieved & relevant? Retrieved & nonrelevant? “More” retrieved 0.5 “Less” retrieved Nonretrieved& nonrelevant? Nonretrieved & relevant? 0 URE 0.5 1.0 0 S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

Average Distance Measure (ADM) • SRE: • URE: • ADM = average “distance” between URE and SRE values S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

Exactly evaluated ADM: graphical representation SRE 1.0 0 URE 1.0 0 S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

Docs. d1 d2 d3 ADM URE 0.8 0.4 0.1 IRS1 0.9 0.5 0.2 0.9 IRS2 1.0 0.6 0.3 0.8 IRS3 0.8 0.4 1.0 0.7 ADM: An example S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

Adequacy of ADM • One single number • Allows complete ordering of different performances • ... • ADM vs. P & R • No hyper-sensitiveness to small variations close to borders • No lack of sensitiveness to big variations inside “equivalence” regions • Wrong thresholds S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

P R E ADM SRE IRS1 0.67 1 0.84 0.83 1.0 IRS2 1 0.5 0.75 0.83 IRS3 0.5 0.5 0.5 0.826 0.5 0.49 unstable stable 0 URE 0.5 1.0 0 0.49 Hyper-sensitiveness: Three very similar IRSs S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

SRE unstable 1.0 stable 0.5 P R E ADM IRS1 1 1 1 1 IRS2 1 1 1 0.5 0 URE 0.5 1.0 0 Lack of sensitiveness:two very different IRSs S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

Again on the thresholds... “More” relevant “Less” relevant SRE 1.0 Retrieved & relevant? Retrieved & nonrelevant? “More” retrieved 0.5 “Less” retrieved Nonretrieved& nonrelevant? Nonretrieved & relevant? 0 URE 0.5 1.0 0 S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

The “right” thresholds “More” relevant “Less” relevant SRE 1.0 E =CE / (OE + UE) OverEvaluated “More” retrieved Correctly Evaluated 0.5 “Less” retrieved UnderEvaluated 0 URE 0.5 0 1.0 S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

ADM in practice • How to get URE values? Either • asking the judge(s) to directly express continuous relevance judgments (feasible, literature evidence), or • averaging dichotomous/discrete relevance judgments • UREs for all the documents in the database? Impossible!! • Sampling • (that takes place with P & R too, anyway) S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

Conclusions • ADM, a new measure of retrieval effectiveness • Adequacy • Improvements w.r.t. P & R: avoids hyper-sensitiveness and lack of sensitiveness • Practical usability (continuous relevance judgments, sampling) • Very preliminary work S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

Future work • Theoretical variations and improvements • Standard deviation in place of the difference of absolute values? • Which sampling? • Re-examine the data of some evaluation experiments (any volunteers?) • Using ADM in real life S. Mizzaro - A new measure of retrieval effectiveness (Or: What's wrong...)

A new measure of retrieval effectiveness (Or: What’s wrong with precision and recall)