26-01-2012

| 1 26-01-2012 Zoekmachines • Gertjan van Noord 2013 Lecture 5: Evaluation

Retrieval performance General performance of a system • speed • security • usability ….. Retrieval performance and evaluation • is the system presenting documents related to the query? • is the user satisfied? information need? 2 of 25

User Queries Same words, different intent • ”Check my cash” • “Cash my check” Different words, same intent - ”Can I open an interest-bearing account?” - ”open savings account” - ”start account for saving” Gap between users’ language and official terminology • “daylight lamp” • “random reader”, “edentifier”, “log in machine”, “card reader”

26-01-12

26-01-12 What is relevancy? • What if you have to search through a 2 page doc to find what you need? • What if doc is 35 pages? Or 235? • What if you need to click though once to get to answer? • Or 2 times? 3 times? • Is relevancy a characterisics of a result, or of a result set? • What is the effect of an irrelevant result in an otherwise good result set? Determining relevancy is complex!

Translation of info need Each information need has to be translated into the "language" of the IR system reality document info need query relevance 6

General retrieval evaluation:batch mode (automatic) testing Test set consisting of: • set of documents • set of queries • file with relevant document numbers for each query (human evaluation!) Experimental test sets: (a.o.) ADI, CACM, Cranfield, TREC testsets

Example CACM data files query.text .I 1 .W What articles exist which deal with TSS (Time Sharing System), an operating system for IBM computers? cacm.all .I 1410 .T Interarrival Statistics for Time Sharing Systems .W The optimization of time-shared system performance requires the description of the stochastic processes governing the user inputs and the program activity. This paper provides a statistical description of the user input process in the SDC-ARPA general-purpose […] qrels.text 01 1410 0 0 01 1572 0 0 01 1605 0 0 01 2020 0 0 01 2358 0 0

Continguency table of results

Exercise Testset • 10.000 documents in database • for query Q available: 50 relevant docs Resultset query Q • 100 documents • relevant: 20 docs What is the recall? What is the precision? What is the generality? 20/50=.4 20/100=.2 50/10.000=.005

Harmonic mean F • Difficult to compare systems if we have two numbers. If one system has P=0.4 and R=0.6 and another has P=0.35 and R=0.65 – which one is better? • Combine both numbers: harmonic mean • You go to school by bike: 10 kilometers. In the morning, you bike 30 km/h. In the afternoon, you bike 20 km/h. What is your average speed?

Harmonic mean F • Difficult to compare systems if we have two numbers. If one system has P=0.4 and R=0.6 and another has P=0.35 and R=0.65 – which one is better? • Combine both numbers: harmonic mean • You go to school by bike: 10 kilometers. In the morning, you bike 30 km/h. In the afternoon, you bike 20 km/h. What is your average speed? • 20 + 30 minutes for 20 km. 24 km/h! • Harmonic mean: (2 * v1 * v2)/(v1 + v2)

Harmonic mean F • If precision is higher, F-score is higher too • If recall is higher, F-score is higher too • F-score maximal when precision AND recall are high

Harmonic mean F • What if P=0.1 and R=0.9 • What if P=0.4 and R=0.6

Retrieval performance Retrieval of all relevant docs first 100% ideal parabolic recall random perverse N number of retrieved docs =>

Until now: ordering of results is ignored

Recall and precision on rank x set of 80 docs, 4 relevant for query Q ordered answer set for Q: NRNNNNNRN..NRN…NRN……N recall always rising? precision always falling?

Recall and precision changes 80 docs, 4 relevant for query Q Answer set: NRNNNNNRN..NRN…NRN……N recall always rising? rising or equal precision always falling? falling, equal, but also rising: if d3 had rank 9

Recall-precision graph 100% precision on different recall levels - for 1 query - average over queries - comparing systems prec 100% recall

R/P graph:comparing 2 systems

Interpolation Interpolation: if a higher recall level has a higher precision, use it for the lower recall level as well, no spikes

Single value summaries for ordered result lists (1) 11 pt average precision averaging the (interpolated) precision values on the 11 recall levels (0%-100%) 3 pt average precision same at recall levels 20%, 50%, 80%

Single value summaries for ordered result lists (2) p@n: precision at a document cut-off value (n = 5, 10, 20, 50,…) usual measure for web retrieval (why?) r@n:recall at a document cut-off value R-precision: precision on rank R where R=total number of relevant docs for the query (why??)

Single value summaries for ordered result lists (3) Average precision average of precision measured on the ranks of relevant docs seen for a query(non-interpolated) MAP mean of the average precisions on set of queries

Example average precision doc rang r p d1 2 .25 .50 d2 8 .50 .25 d3 9 .75 .33 d4 40 1.00 .10 ------- 1.18 Average precision = 1.18 / 4 = 0.295

Experimentation & evaluation in Smart • Smart can be used interactive and experimental • For both, the weighting schemas to use can be defined in the make file used for indexing • For experimentation an additional command smart retrieve spec.xxx is needed for each weighting schema • This runs the test set and saves the results in a set of files in the index directory • The evaluation results are available by using the experimental commands (Eeval_rr, …) in Smart

Smart evaluation Eeval_rr Used: CACM testset 2 indexing methods default cut-off 15 docs: 15 x 52

……continued 0%, 10%, … ,100% 20%, 50%, 80%

Ranks of relevant docs: Eind_rr 1. Doc weight == Query weight == nnn (term-freq) 2. Doc weight == Query weight == ntc (term-freq) Query Doc_id Ranks of rel docs 1 1410 : 138 5 1 1572 : 23 4 1 1605 : 31 19 1 2020 : 158 583 1 2358 : 16 17 2 2434 : 82 2 2 2863 : 65 8 2 3078 : 53 3 3 1134 : 35 1 3 1613 : 163 52 3 1807 : 76 51 What is R-precision for query 1 for nnn? for ntc?

26-01-12 What do these numbers really mean? • How do we know the user was happy with the result? • Click? • What determines whether you click or not? • Snippet • Term highlighting • What determines whether you click back and try another result?

Annotator agreement • To be sure that judgements are reliable, more than one judge should given his rate • A common measure for the agreement between judgements is the kappa statistic • Of course there will always be some agreement. This expected chance agreement is included in kappa.

Kappa measure P(A) the proportion of agreement cases P(E) the expected proportion of agreement For more than 2 judges, kappa is calculated between each pair and the outcomes are averaged

Kappa example: page 152, table 8.2

Kappa interpretation =1 complete agreement >0.8 good agreement >0.67 fair agreement <0.67 dubious 0 just chance <0 worse than chance for binary relevance decisions, the agreement is generally not higher than fair

26-01-2012

26-01-2012

Presentation Transcript

May 01, 2012

LHC OP DAYS - 26/01/2012

01/03/2012

2010-01-26

01 February 2012

2012/01/23

2012/01/30

MEETING IN VIENNA 16 01 2012 – 20 01 2012

2012/01/16

25/01/2012