120 likes | 133 Views
This lecture discusses evaluation measures such as precision, recall, accuracy, mean average precision, and more for assessing the performance of information retrieval systems.
E N D
INFORMATION RETRIEVAL TECHNIQUESBYDR. ADNAN ABID Lecture # 25 BENCHMARKS FOR THE EVALUATION OF IR SYSTEMS
ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the underline sources • “Introduction to information retrieval” by PrabhakarRaghavan, Christopher D. Manning, and Hinrich Schütze • “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell • “Modern information retrieval” by Baeza-Yates Ricardo, • “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline • Evaluation Measures • Precision and Recall • Unranked retrieval evaluation • Trade-off between Recall and Precision • Computing Recall/Precision Points
Evaluation Measures • Precision • Recall • Accuracy • Mean Average Precision • F-Measure/E-Measure • Non Binary Relevance • Discounted Cumulative Gain • Normalized Discounted Cumulative Gain
retrieved & irrelevant Not retrieved & irrelevant Entire document collection irrelevant Relevant documents Retrieved documents retrieved & relevant not retrieved but relevant relevant retrieved not retrieved Precision and Recall
Unranked retrieval evaluation:Precision and Recall • Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) • Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) • Precision P = tp/(tp + fp) • Recall R = tp/(tp + fn)
Should we instead use the accuracy measure for evaluation? • Given a query, an engine classifies each doc as “Relevant” or “Nonrelevant” • The accuracy of an engine: the fraction of these classifications that are correct • ACCURACY = (tp + tn) / ( tp + fp + fn + tn) • Accuracy is a commonly used evaluation measure in machine learning classification work • Why is this not a very useful evaluation measure in IR?
Precision and Recall • Precision • The ability to retrievetop-ranked documents that are mostly relevant. • Recall • The ability of the search to find all of the relevant items in the corpus.
Determining Recall is Difficult • Total number of relevant items is sometimes not available: • Sample across the database and perform relevance judgment on these items. • Apply different retrieval algorithms to the same database for the same query. The aggregate of relevant items is taken as the total relevant set.
Returns relevant documents but misses many useful ones too The ideal Returns most relevant documents but includes lots of junk Trade-off between Recall and Precision 1 Precision 0 1 Recall
Computing Recall/Precision Points • For a given query, produce the ranked list of retrievals. • Adjusting a threshold on this ranked list produces different sets of retrieved documents, and therefore different recall/precision measures. • Mark each document in the ranked list that is relevant according to the gold standard. • Compute a recall/precision pair for each position in the ranked list that contains a relevant document.
Computing Recall/Precision Points: Example 1 Let total # of relevant docs = 6 Check each new recall point: R=1/6=0.167; P=1/1=1 R=2/6=0.333; P=2/2=1 R=3/6=0.5; P=3/4=0.75 R=4/6=0.667; P=4/6=0.667 Missing one relevant document. Never reach 100% recall R=5/6=0.833; p=5/13=0.38