960 likes | 977 Views
The Principle of Information Retrieval. Department of Information Management School of Information Engineering Nanjing University of Finance & Economics 2011. II 课程内容. 5 Evaluation in information retrieval. How to measure user happiness. The key utility measure is user happiness
E N D
The Principle of Information Retrieval Department of Information ManagementSchool of Information EngineeringNanjing University of Finance & Economics 2011
How to measure user happiness • The key utility measure is user happiness • Relevance of results (effectiveness) • Speed of response • Size of the index • User interface design • Independent of the quality of the results • Utility, success, completeness, satisfaction, worth, value, time, cost, …
评价的作用 • 评价在信息检索研究中发挥着重要作用 • 评价在信息检索系统的研发中一直处于核心的地位,以致于算法与其效果评价方式是合二为一的(Saracevic, SIGIR 1995)
信息检索系统评价的起源 • Kent等人第一次提出了关于Precision和Recall(开始称为relevance)的概念(Kent, 1955)
信息检索系统评价的起源 • Cranfield-like evaluation methodology • Cranfield在上世纪伍十年代末到六十年代初提出了基于查询样例集、标准答案集和语料库的评测方案,被称为IR评价的“grand-daddy” • 确立了评价在信息检索研究中的核心地位 • Cranfield是一个地名,也是一个研究所的名称
信息检索系统评价的起源 • Gerard Salton 与 SMART 系统 • Gerard Salton是SMART系统的主要研发者。SMART首次提供了一个研究平台,你可以只关心算法,而不必关心索引什么的,同时也提供了一个评测计算,你提供了答案后,可以给出常用的指标
信息检索系统评价的起源 • Sparck-Jones 的著作 “Information retrieval experiment” • 主要论述IR实验和评测
How to measure information retrieval effectiveness1-2 • We need a test collection consisting of three things • A document collection • A test suite of information needs, expressible as queries • A set of relevance judgments, standardly a binary assessment of either relevant or not relevant for each query-document pair
How to measure information retrieval effectiveness2-2 • And in this test collection • A document is given a binary classification as either relevant or not relevant • Collection and suite of information needs have to be of a reasonable size • Results are highly variable over different documents and information needs • 50 information needs at least
Difficulties • The difference of stated information need and query • Relevance is assessed relative to an information need, not a query • The subjectivity of relevance decision • Many systems contain various parameters that can be adjusted to tune system performance • The correct procedure is to have one or more development test collections
Difficulties • Voorhees 估计,对一个规模为800万的文档集合进行针对1个查询主题的相关性评判需要耗费1名标注人员9个月的工作时间 • TREC提出pooling方法,在保证评价结果可靠性的基础上大大减少了评判工作量 • 缺点:处理的查询数目少,针对小规模的查询集合,仍需要耗费十余名标注人员1-2个月的工作时间
5.2 Standard test collections • The Cranfield collection • Text Retrieval Conference (TREC) • GOV2 • NII Test Collections for IR Systems (NTCIR) • Reuters-21578 and Reuters-RCV1
The Cranfield collection • Collected in the United Kingdom starting in the late 1950s, it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries • Allowing precise quantitative measures • But too small
Text Retrieval Conference (TREC)1-2 • The U.S. National Institute of Standards and Technology (NIST) has run a large IR test bed evaluation series since 1992 • Over a range of different test collections • But the best known test collections are the ones used for the TREC Ad Hoc track during the first 8 TREC evaluations from 1992 to 1999 • Comprise 6 CDs containing 1.89 million documents (mainly, but not exclusively, newswire articles)
Text Retrieval Conference (TREC)2-2 • TRECs 6–8 provide 150 information needs over about 528,000 newswire and Foreign Broadcast Information Service articles • Relevance judgments are available only for the documents that were among the top k returned for some system which was entered in the TREC evaluation
GOV2 • Contains 25 million GOV2 web pages • GOV2 is one of the largest Web test collection but still more than 3 orders of magnitude smaller than the current size of the document collections indexed by the large web search companies
NII Test Collections for IR Systems (NTCIR) • Similar sizes to the TREC collections • Focusing on East Asian language and cross-language information retrieval
Reuters-21578 and Reuters-RCV1 • Most used for text classification • Reuters-21578 collection of 21578 newswire articles • Reuters Corpus Volume 1 (RCV1) is much larger, consisting of 806,791 documents
8.3 Evaluation of unranked retrieval sets • Precision • Recall • Accuracy • F measure
Precision • Precision (P) is the fraction of retrieved documents that are relevant
Recall • Recall (R) is the fraction of relevant documents that are retrieved
The comparison of P and R • Typical web surfers prefer P to R • Variousprofessional searchers such asparalegals and intelligence analysts prefer R to P • Individuals searching their hard disks prefer R to P
The comparison of P and R • The two quantities clearly trade off against one another • Recall is a non-decreasing function of the number of documents retrieved • Can always get a recall of 1 (but very low precision) by retrieving all documents for all queries • Precision usually decreases as the number of documents retrieved is increased
Which is more difficult to measure? • Precision • Recall
关于查全率的质疑 • 分母是个无法确定的值,所以建立在其理论上的查全率也是不实际的 • 相关文献没能被检索出来的原因是什么? • 数据库的设计水平还是用户操作水平? • 是标引的原因还是检索的原因? • 是任何一个数据库都有一个相关的系数存在?
关于查准率的质疑 • 数据库中存在大量应查到而查不到的文献时,查出来的文献就是100%准确有意义吗? • 查准率分母中的不相关文档是如何产生的? • 是系统造成的? • 用户检索时由于表达不清造成的? • 还是用户最终取舍形成的?
关于查全率与查准率关系的质疑1-2 b: irrelevant and retrieved a: relevant and retrieved irrelevant and unretrieved c: relevant and unretrieved
关于查全率与查准率关系的质疑2-2 • 一般认为是呈反比关系 • 如果a/(c+a)值增大,必然是c值减小 • c值减小有两种原因,是c线下移或b线右移,其结果必然b值增大,所以a/(b+a)值减小,反之也成立 • 但是这是建立在一种假设之上,那就是a为定量 • 但是定量不是a,而是(c+a),如c值下降,a值必然上升,所以a是变量 • 所以,b和c之间没有必然的联系,可以想象b线和c线能够同时向边线移动。c和b等于0不太可能,但不是没有可能 • 事实上,检索系统可以同时提高两个指标
The other way to define P and R • P = tp/(tp + fp) • R = tp/(tp + fn)
F measure1-4 • A single measure that trades off precision versus recall • α∈[0, 1] and thus β2∈[0,∞] • The default balanced F measure use β=1
F measure2-4 • Values of β < 1 emphasize precision, while values of β > 1 emphasize recall • It is harmonic mean rather than the simpler average
F measure3-4 • The harmonic mean is always less than either the arithmetic or geometric mean, and often quite close to the minimum of the two numbers • This strongly suggests that the arithmetic mean is an unsuitable measure to use because it closer to their maximum than harmonic mean
Accuracy1-2 • Accuracy is the fraction of its classifications that are correct • An information retrieval system can be thought of as a two-class classifier • Accuracy=(tp+tn)/(tp+fp+fn+tn)
Accuracy2-2 • Often used for evaluating machine learning classification problems • Not an appropriate measure for information retrieval problems • Normally over 99.9% of the documents are in the not relevant category • Can maximize accuracy by simply deeming all documents irrelevant to all queries, that is to say tn maybe too large
8.4 Evaluation of ranked retrieval results • Precision, recall, and the F measure are set-based measures and are computed using unordered sets of documents • In a ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top k retrieved documents
The types • Precision-recall curve • Mean Average Precision (MAP) • Precision at k • R-precision • ROC curve