110 likes | 202 Views
Lecture 3: Retrieval Evaluation. Maya Ramanath. Benchmarking IR Systems. Result Quality Data Collection Ex: Archives of the NYTimes Query set Provided by experts, identified from real search logs, etc. Relevance judgements For a given query, is the document relevant?.
E N D
Lecture 3: Retrieval Evaluation Maya Ramanath
Benchmarking IR Systems Result Quality • Data Collection • Ex: Archives of the NYTimes • Query set • Provided by experts, identified from real search logs, etc. • Relevance judgements • For a given query, is the document relevant?
Evaluation for Large Collections • Cranfield/TREC paradigm • Pooling of results • A/B testing • Possible for search engines • Crowdsourcing • Let users decide
Precision and Recall • Relevance judgements are binary – “relevant” or “not-relevant”. • Partition the collection into 2 parts. • Precision • Recall Can a search engine guarantee 100% recall?
F-measure • F-Measure: Weighted harmonic mean of Precision and Recall Why use harmonic mean instead of arithmetic mean?
Precision-Recall Curves • Using precision and recall to evaluate ranked retrieval Source: Introduction to Information Retrieval. Manning, Raghavan and Schuetze, 2008
Single measures Precision at k, P@10, P@100, etc. and others…
Graded Relevance – NDCG • Highly relevant documents should have more importance • Higher the rank of a relevant document, more valuable it is to the user
Inter-judge Agreement – Fleiss’ Kappa N– number of results n – number of ratings/result k – number of grades nij – no. of judges who agree that the ithresult should have grade j.
Tests of Statistical Significance • Wilcoxon signed rank test • Student’s paired t-test • …and more