User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )

User Performance versus Precision Measures for Simple Search Tasks(Don’t bother improving MAP) Andrew Turpin Falk Scholer {aht,fscholer}@cs.rmit.edu.au

People in glass houses should not throw stones http://www.hartley-botanic.co.uk/hartley_images/victorian_range/victorian_range_09.jpg

Scientists should not live in glass houses. Nor straw, nor wood… http://www-math.uni-paderborn.de/~odenbach/pics/pigs/pig2.jpg

Scientists should do more than throw stones www.worth1000.com/entries/ 161000/161483INPM_w.jpg

Overview • How are IR systems compared? • Mean Average Precision: MAP • Do metrics match user experience? • First grain (Turpin & Hersh SIGIR 2000) • Second pebble (Turpin & Hersh SIGIR 2001) • Third stone (Allan et al SIGIR 2005) • This golf ball (Turpin & Scholer SIGIR 2006)

0.00 0.00 0.67 0.25 0.40 0.33 P@5 1/5 = 0.20 2/5 = 0.40 P@1 0/1 = 0.00 0/1 = 0.00 AP Av. of P at 1’s= 0.25 Av. of P at 1’s= 0.54 0 0 0.00 0 0 0.00 1 0 0.00 1 0.25 0 1 0 0.20 0 0.17 0

Sum of all precision values at relevant documents Number of relevant docs in the list AP = (0.25) / 1 = 0.25 (0.67 + 0.40) / 2 = 0.54 Sum of all precision values at relevant documents Number of relevant docs in all lists AP = 0.08 (0.25) / 3 = 0.36 (0.67 + 0.40) / 3 =

Mean Average Precision (MAP) • Previous example showed precision for one query • Ideally need many queries (50 or more) • Take the mean of the AP values over all queries: MAP • Do a paired t-test, Wilcoxon, Tukey HSD,… • Compares systems on the same collection and same queries

Typical IR empirical systems paper Turpin & Moffat SIGIR 1999

Monz et al SIGIR 2005 Fang et al SIGIR 2004 Shi et al SIGIR 2005 Jordan et al JCDL June 2006

Implicit assumption More relevant documents high in the list is good • Do users generally want more than one relevant document? • Do users read lists top to bottom? • Who determines relevance? Binary? Conditional or state-based? • While MAP is tractable, does it reflect user experience? • Is Yahoo! really better than Google, or vice-versa?

General Experiment • Get a collection, set of queries, relevance judgments • Compare System A and System B using MAP (Cranfield) • Get users to do queries with System A or System B (balanced design…) • Did the users do better with A or B? • Did the users prefer A or B?

Experiment 2000 MAP 0.275 IR 0.330 24 Users Engine A MAP 0.324 IR 0.390 6 Queries Engine B

Experiment 2001 MAP 0.270 QA 66% 32 Users Engine A MAP 0.354 QA 60% 8 Queries Engine B

Experiment 2005 • James Allan et al, UMass, SIGIR2005 • Passage retrieval and a recall task • Used bpref, which “tracks MAP” • Small benefit to users when bpref goes from • 0.50 to 0.60 and 0.90 to 0.95 • No benefit in the mid range 0.60 to 0.90

Experiments 2000, 2001, 2005 MAP Exp 2001 Exp 2002

Experiment 2006 MAP 0.55 A MAP 0.65 32 Users B MAP 0.75 C MAP 0.85 D MAP 0.95 50 Queries (100 documents) E

Our Sheep

Time required to find first relevant document 300 250 200 Time (seconds) 150 100 50 0 0.55 0.65 0.75 0.85 0.95 MAP

Failures % of queries with no relevant answer MAP

“Better” MAP definition

Conclusion • MAP does allow us to compare IR systems, but the assumption that an increase in MAP translates into an increase in user performance or satisfaction is not true • Supported by 4 different experiments • Don’t automatically choose MAP as a metric • P@1 for Web style tasks?

P@1 300 250 200 Time (seconds) 150 100 50 0 0 1 P@1

0-10% 20-30% 40-50% 60-70% 80-90% 10-20% 30-40% 50-60% 70-80% 90-100%

Rank of saved/viewed docs

Number of relevant found

User Performance versus Precision Measures for Simple Search Tasks ( Don’t bother improving MAP )