The Principle of Information Retrieval

The Principle of Information Retrieval Department of Information ManagementSchool of Information EngineeringNanjing University of Finance & Economics 2011

II 课程内容

5 Evaluation in information retrieval

How to measure user happiness • The key utility measure is user happiness • Relevance of results (effectiveness) • Speed of response • Size of the index • User interface design • Independent of the quality of the results • Utility, success, completeness, satisfaction, worth, value, time, cost, …

5.1 Introduction

评价的作用 • 评价在信息检索研究中发挥着重要作用 • 评价在信息检索系统的研发中一直处于核心的地位，以致于算法与其效果评价方式是合二为一的（Saracevic, SIGIR 1995）

信息检索系统评价的起源 • Kent等人第一次提出了关于Precision和Recall（开始称为relevance）的概念（Kent, 1955）

信息检索系统评价的起源 • Cranfield-like evaluation methodology • Cranfield在上世纪伍十年代末到六十年代初提出了基于查询样例集、标准答案集和语料库的评测方案，被称为IR评价的“grand-daddy” • 确立了评价在信息检索研究中的核心地位 • Cranfield是一个地名，也是一个研究所的名称

信息检索系统评价的起源 • Gerard Salton 与 SMART 系统 • Gerard Salton是SMART系统的主要研发者。SMART首次提供了一个研究平台，你可以只关心算法，而不必关心索引什么的，同时也提供了一个评测计算，你提供了答案后，可以给出常用的指标

信息检索系统评价的起源 • Sparck-Jones 的著作 “Information retrieval experiment” • 主要论述IR实验和评测

How to measure information retrieval effectiveness1-2 • We need a test collection consisting of three things • A document collection • A test suite of information needs, expressible as queries • A set of relevance judgments, standardly a binary assessment of either relevant or not relevant for each query-document pair

How to measure information retrieval effectiveness2-2 • And in this test collection • A document is given a binary classification as either relevant or not relevant • Collection and suite of information needs have to be of a reasonable size • Results are highly variable over different documents and information needs • 50 information needs at least

Difficulties • The difference of stated information need and query • Relevance is assessed relative to an information need, not a query • The subjectivity of relevance decision • Many systems contain various parameters that can be adjusted to tune system performance • The correct procedure is to have one or more development test collections

Difficulties • Voorhees 估计，对一个规模为800万的文档集合进行针对1个查询主题的相关性评判需要耗费1名标注人员9个月的工作时间 • TREC提出pooling方法，在保证评价结果可靠性的基础上大大减少了评判工作量 • 缺点：处理的查询数目少，针对小规模的查询集合，仍需要耗费十余名标注人员1-2个月的工作时间

5.2 Standard test collections • The Cranfield collection • Text Retrieval Conference (TREC) • GOV2 • NII Test Collections for IR Systems (NTCIR) • Reuters-21578 and Reuters-RCV1

The Cranfield collection • Collected in the United Kingdom starting in the late 1950s, it contains 1398 abstracts of aerodynamics journal articles, a set of 225 queries • Allowing precise quantitative measures • But too small

Text Retrieval Conference (TREC)1-2 • The U.S. National Institute of Standards and Technology (NIST) has run a large IR test bed evaluation series since 1992 • Over a range of different test collections • But the best known test collections are the ones used for the TREC Ad Hoc track during the first 8 TREC evaluations from 1992 to 1999 • Comprise 6 CDs containing 1.89 million documents (mainly, but not exclusively, newswire articles)

Text Retrieval Conference (TREC)2-2 • TRECs 6–8 provide 150 information needs over about 528,000 newswire and Foreign Broadcast Information Service articles • Relevance judgments are available only for the documents that were among the top k returned for some system which was entered in the TREC evaluation

GOV2 • Contains 25 million GOV2 web pages • GOV2 is one of the largest Web test collection but still more than 3 orders of magnitude smaller than the current size of the document collections indexed by the large web search companies

NII Test Collections for IR Systems (NTCIR) • Similar sizes to the TREC collections • Focusing on East Asian language and cross-language information retrieval

Reuters-21578 and Reuters-RCV1 • Most used for text classification • Reuters-21578 collection of 21578 newswire articles • Reuters Corpus Volume 1 (RCV1) is much larger, consisting of 806,791 documents

8.3 Evaluation of unranked retrieval sets • Precision • Recall • Accuracy • F measure

Precision • Precision (P) is the fraction of retrieved documents that are relevant

Recall • Recall (R) is the fraction of relevant documents that are retrieved

The another way to define P and R

The comparison of P and R • Typical web surfers prefer P to R • Variousprofessional searchers such asparalegals and intelligence analysts prefer R to P • Individuals searching their hard disks prefer R to P

The comparison of P and R • The two quantities clearly trade off against one another • Recall is a non-decreasing function of the number of documents retrieved • Can always get a recall of 1 (but very low precision) by retrieving all documents for all queries • Precision usually decreases as the number of documents retrieved is increased

Which is more difficult to measure? • Precision • Recall

关于查全率的质疑 • 分母是个无法确定的值，所以建立在其理论上的查全率也是不实际的 • 相关文献没能被检索出来的原因是什么？ • 数据库的设计水平还是用户操作水平？ • 是标引的原因还是检索的原因？ • 是任何一个数据库都有一个相关的系数存在？

关于查准率的质疑 • 数据库中存在大量应查到而查不到的文献时，查出来的文献就是100％准确有意义吗？ • 查准率分母中的不相关文档是如何产生的？ • 是系统造成的？ • 用户检索时由于表达不清造成的？ • 还是用户最终取舍形成的？

相对查准率

关于查全率与查准率关系的质疑1-2 b: irrelevant and retrieved a: relevant and retrieved irrelevant and unretrieved c: relevant and unretrieved

关于查全率与查准率关系的质疑2-2 • 一般认为是呈反比关系 • 如果a／(c+a)值增大，必然是c值减小 • c值减小有两种原因，是c线下移或b线右移，其结果必然b值增大，所以a／(b+a)值减小，反之也成立 • 但是这是建立在一种假设之上，那就是a为定量 • 但是定量不是a，而是(c+a)，如c值下降，a值必然上升，所以a是变量 • 所以，b和c之间没有必然的联系，可以想象b线和c线能够同时向边线移动。c和b等于0不太可能，但不是没有可能 • 事实上，检索系统可以同时提高两个指标

Contingency table

The other way to define P and R • P = tp/(tp + fp) • R = tp/(tp + fn)

F measure1-4 • A single measure that trades off precision versus recall • α∈[0, 1] and thus β2∈[0,∞] • The default balanced F measure use β=1

F measure2-4 • Values of β < 1 emphasize precision, while values of β > 1 emphasize recall • It is harmonic mean rather than the simpler average

F measure3-4 • The harmonic mean is always less than either the arithmetic or geometric mean, and often quite close to the minimum of the two numbers • This strongly suggests that the arithmetic mean is an unsuitable measure to use because it closer to their maximum than harmonic mean

F measure4-4

Accuracy1-2 • Accuracy is the fraction of its classifications that are correct • An information retrieval system can be thought of as a two-class classifier • Accuracy=(tp+tn)/(tp+fp+fn+tn)

Accuracy2-2 • Often used for evaluating machine learning classification problems • Not an appropriate measure for information retrieval problems • Normally over 99.9% of the documents are in the not relevant category • Can maximize accuracy by simply deeming all documents irrelevant to all queries, that is to say tn maybe too large

8.4 Evaluation of ranked retrieval results • Precision, recall, and the F measure are set-based measures and are computed using unordered sets of documents • In a ranked retrieval context, appropriate sets of retrieved documents are naturally given by the top k retrieved documents

The types • Precision-recall curve • Mean Average Precision (MAP) • Precision at k • R-precision • ROC curve

The Principle of Information Retrieval

The Principle of Information Retrieval

Presentation Transcript

Information Retrieval

Information retrieval

The Mathematics of Information Retrieval

Information Retrieval

Information retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

The Mathematics of Information Retrieval

Information Retrieval

information retrieval

Information Retrieval