Evaluation

Evaluation

Types of Evaluation • Might evaluate several aspects • Evaluation generally comparative • System A vs. B • System A vs A´ • Most common evaluation - retrieval effectiveness • Assistance in formulating queries • Speed of retrieval • Resources required • Presentation of documents • Ability to find relevant documents

The Concept of Relevance • Relevance of a document D to a query Q is subjective • Different users will have different judgments • Same users may judge differently at different times • Degree of relevance of different documents may vary

The Concept of Relevance • In evaluating IR systems it is assumed that: • A subset of the documents of the database (DB) are relevant • A document is either relevant or not

Relevance • In a small collection - the relevance of each document can be checked • With real collections, never know full set of relevant documents • Any retrieval model includes an implicit definition of relevance • Satisfiability of a FOL expression • Distance • P(Relevance|query,document) • P(query|document)

potato blight … x Potato farming and nutritional value of potatoes. growing potatoes …  Mr. Potato Head … nutritional info for spuds x  x  Evaluation • Set of queries • Collection of documents (corpus) • Relevance judgements: Which documents are correct and incorrect for each query • If small collection, can review all documents • Not practical for large collections Any ideas about how we might approach collecting relevance judgments for very large collections?

Finding Relevant Documents • Pooling • Retrieve documents using several automatic techniques • Judge top n documents for each technique • Relevant set is union • Subset of true relevant set • Possible to estimate size of relevant set by sampling • When testing: • How should un-judged documents be treated? • How might this affect results?

Test Collections • To compare the performance of two techniques: • each technique used to evaluate same queries • results (set or ranked list) compared using metric • most common measures - precision and recall • Usually use multiple measures to get different views of performance • Usually test with multiple collections – • performance is collection dependent

Retrieved documents Let retrieved = 100, relevant = 25, rel & ret = 10 Rel&Ret documents Relevant documents Recall = 10/25 = .40 Ability to return ALL relevant items. Retrieved Precision = 10/100 = .10 Ability to return ONLY relevant items. Evaluation

Precision and Recall • Precision and recall well-defined for sets • For ranked retrieval • Compute value at fixed recall points (e.g. precision at 20% recall) • Compute a P/R point for each relevant document, interpolate • Compute value at fixed rank cutoffs (e.g. precision at rank 20)

Average Precision for a Query • Often want a single-number effectiveness measure • Average precision is widely used in IR • Calculate by averaging precision when recall increases

Averaging Across Queries • Hard to compare P/R graphs or tables for individual queries (too much data) • Need to average over many queries • Two main types of averaging • Micro-average - each relevant document is a point in the average (most common) • Macro-average - each query is a point in the average • Also done with average precision value • Average of many queries’ average precision values • Called mean average precision (MAP) • “Average average precision” sounds weird

Averaging and Interpolation • Interpolation • actual recall levels of individual queries are seldom equal to standard levels • interpolation estimates the best possible performance value between two known values • e.g.) assume 3 relevant docs retrieved at ranks 4, 9, 20 • their precision at actual recall is .25, .22, and .15 • On average, as recall increases, precision decreases

Averaging and Interpolation • Actual recall levels of individual queries are seldom equal to standard levels • Interpolated precision at the ith recall level, Ri, is the maximum precision at all points p such that Ri p  Ri+1 • assume only 3 relevant docs retrieved at ranks 4, 9, 20 • their actual recall points are: .33, .67, and 1.0 • their precision is .25, .22, and .15 • what is interpolated precision at standard recall points? Recall levelInterpolated Precision 0.0, 0.1, 0.2, 0.3 0.25 0.4, 0.5, 0.6 0.22 0.7, 0.8, 0.9, 1.0 0.15

Interpolated Average Precision • Average precision at standard recall points • For a given query, compute P/R point for every relevant doc. • Interpolate precision at standard recall levels • 11-pt is usually 100%, 90, 80, …, 10, 0% (yes, 0% recall) • 3-pt is usually 75%, 50%, 25% • Average over all queries to get average precision at each recall level • Average interpolated recall levels to get single result • Called “interpolated average precision” • Not used much anymore; “mean average precision” more common • Values at specific interpolated points still commonly used

Let, Rq = {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123} |Rq|= 10, no. of relevant docs for q Ranking of retreived docs in the answer set of q: 10 % Recall=> .1 * 10 rel docs = 1 rel doc retrieved One doc retrieved to get 1 rel doc: precision = 1/1 = 100% Micro-averaging: 1 Qry Find precision given total number of docs retrieved at given recall value.

Let, Rq = {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123} |Rq|= 10, no. of relevant docs for q Ranking of retreived docs in the answer set of q: 10 % Recall=> .1 * 10 rel docs = 1 rel doc retrieved One doc retrieved to get 1 rel doc: precision = 1/1 = 100% Micro-averaging : 1 Qry 20% Recall: .2 * 10 rel docs = 2 rel docs retrieved 3 docs retrieved to get 2 rel docs: precision = 2/3 = 0.667 Find precision given total number of docs retrieved at given recall value.

Let, Rq = {d3, d5, d9, d25, d39, d44, d56, d71, d89, d123} |Rq|= 10, no. of relevant docs for q Ranking of retreived docs in the answer set of q: 10 % Recall=> .1 * 10 rel docs = 1 rel doc retrieved One doc retrieved to get 1 rel doc: precision = 1/1 = 100% Micro-averaging : 1 Qry 20% Recall: .2 * 10 rel docs = 2 rel docs retrieved 3 docs retrieved to get 2 rel docs: precision = 2/3 = 0.667 30% Recall: .3 * 10 rel docs = 3 rel docs retrieved 6 docs retrieved to get 3 rel docs: precision = 3/6 = 0.5 What is precision at recall values from 40-100%?

120 Recall/ Precision Curve • • 100 80 • Precision 60 • • 40 • 20 • • • • • 0 20 40 60 80 100 120 Recall • |Rq|= 10, no. of relevant docs for q • Ranking of retreived docs in the answer set of q: Recall Precision 0.1 1/1 = 100% 0.2 2/3 = 0.67% 0.3 3/6 = 0.5% 0.4 4/10 = 0.4% 0.5 5/15 = 0.33% 0.6 0% … … 1.0 0%

Averaging and Interpolation • macroaverage - each query is a point in the avg • can be independent of any parameter • average of precision values across several queries at standard recall levels e.g.) assume 3 relevant docs retrieved at ranks 4, 9, 20 • their actual recall points are: .33, .67, and 1.0 (why?) • their precision is .25, .22, and .15 (why?) • Average over all relevant docs • rewards systems that retrieve relevant docs at the top (.25+.22+.15)/3= 0.21

Recall-Precision Tables & Graphs

Document Level Averages • Precision after a given number of docs retrieved • e.g.) 5, 10, 15, 20, 30, 100, 200, 500, & 1000 documents • Reflects the actual system performance as a user might see it • Each precision avg is computed by summing precisions at the specified doc cut-off and dividing by the number of queries • e.g. average precision for all queries at the point where n docs have been retrieved

R-Precision • Precision after R documents are retrieved • R = number of relevant docs for the query • Average R-Precision • mean of the R-Precisions across all queries e.g.) Assume 2 qrys having 50 & 10 relevant docs; system retrieves 17 and 7 relevant docs in the top 50 and 10 documents retrieved, respectively

Evaluation • Recall-Precision value pairs may co-vary in ways that are hard to understand • Would like to find composite measures • A single number measure of effectiveness • primarily ad hoc and not theoretically justifiable • Some attempt to invent measures that combine parts of the contingency table into a single number measure

Contingency Table Miss = C/(A+C)

Symmetric Difference A is the retrieved set of documents B is the relevant set of documents A  B (the symmetric difference) is the shaded area

E measure (van Rijsbergen) • used to emphasize precision or recall • like a weighted average of precision and recall • large a increases importance of precision • can transform by a = 1/(b2 +1), b = P/R • when a = 1/2, b = 1; precision and recall are equally important E= normalized symmetric difference of retrieved and relevant sets E b=1 = |A B|/(|A| + |B|) • F =1- E is typical (good results mean larger values of F)

Evaluation

Evaluation

Presentation Transcript

evaluation

Evaluation

Evaluation

Evaluation

EVALUATION

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation

EVALUATION

Evaluation

Evaluation

Evaluation

Evaluation

Evaluation Economic Evaluation

Evaluation

Evaluation