Retrieval Evaluation

Retrieval Evaluation J. H. Wang Mar. 18, 2008

Outline • Chap. 3, Retrieval Evaluation • Retrieval Performance Evaluation • Reference Collections

Introduction • Types of evaluation • Functional analysis phase, and error analysis phase • Performance evaluation • Performance evaluation • Response time/space required • Retrieval performance evaluation • The evaluation of how precise is the answer set

Retrieval Performance Evaluation • Query in batch mode vs. interactive sessions Relevant Docs In Answer Set |Ra| Recall=|Ra|/|R| Precision=|Ra|/|A| collection Answer Set |A| Relevant Docs |R| Sorted by relevance

6.d9* 7.d511 8.d129 9.d187 10.d25* 11.d38 12.d48 13.d250 14.d11 15.d3* 1.d123* 2.d84 3.d56* 4.d6 5.d8 Precision versus Recall Curve • Rq={d3,d5,d9,d25,d39,d44,d56, d71,d89,d123} • P=100% at R=10% • P= 66% at R=20% • P= 50% at R=30% Ranking for query q: Usually based on 11 standard recall levels: 0%, 10%, ..., 100%

Precision versus Recall Curve • For a single query Fig3.2

Average Over Multiple Queries • P(r)=average precision at the recall level r • Nq= Number of queries used • Pi(r)=The precision at recall level r for the i-th query

6.d9 7.d511 8.d129* 9.d187 10.d25 11.d38 12.d48 13.d250 14.d11 15.d3* 1.d123 2.d84 3.d56* 4.d6 5.d8 Interpolated Precision • Rq={d3,d56,d129} • P=33% at R=33% • P=25% at R=66% • P=20% at R=100% • P(rj)=max ri≦r≦rj+1P(r)

Interpolated Precision • Let rj, j{0, 1, 2, …, 10}, be a reference to the j-th standard recall level • P(rj)=max ri≦r≦rj+1P(r) R=30%, P3(r)~P4(r)=33% R=40%, P4(r)~P5(r) R=50%, P5(r)~P6(r) R=60%, P6(r)~P7(r)=25%

Average Recall vs. Precision Figure

Single Value Summaries • Average precision versus recall • Compare retrieval algorithms over a set of example queries • Sometimes we need to compare individual query’s performance • Averaging precision over many queries might disguise important anomalies in the retrieval algorithms • We might be interested in whether one of them outperforms the other for each query • Need a single value summary • The single value should be interpreted as a summary of the corresponding precision versus recall curve

Single Value Summaries • Average Precision at Seen Relevant Documents • Averaging the precision figures obtained after each new relevant document is observed • Example: Figure 3.2, (1+0.66+0.5+0.4+0.3)/5=0.57 • This measure favors systems which retrieve relevant documents quickly (i.e., early in the ranking) • R-Precision • The precision at the R-th position in the ranking • R: the total number of relevant documents of the current query (number of documents in Rq) • Fig3.2: R=10, value=0.4 • Fig3.3: R=3, value=0.33

Precision Histograms • Use R-precision measures to compare the retrieval history of two algorithms through visual inspection • RPA/B(i)=RPA(i)-RPB(i)

Summary Table Statistics • Single value measures can be stored in a table regarding the set of all queries • the number of queries • total number of documents retrieved by all queries • total number of relevant documents which were effectively retrieved when all queries are considered • total number of relevant documents which could have been retrieved by all queries • …

Precision and Recall Appropriateness • Proper estimation of maximum recall for a query requires knowledge of all documents in the collection • Recall and precision are related measures which capture different aspects of the documents • Measures which quantify the informativeness of the retrieval process might be more appropriate • Recall and precision are easy to define when a linear ordering of the retrieved documents is enforced

Alternative Measures • The Harmonic Mean • Values in [0,1] • The E Measure • Relative importance of recall and precision • b=1, E(j)=F(j) • b>1, more interested in precision • b<1, more interested in recall

User-Oriented Measure • Assumption: different users might have a different interpretation of which document is relevant

User-Oriented Measure • Coverage=|Rk|/|U| • Novelty=|Ru|/(|Ru|+|Rk|) • A high coverage ratio indicates that the system is finding most of the relevant documents that the user expected to see • A high novelty ratio indicates that the system is revealing many new documents which were previously unknown

Other Measures • Relative recall: the ratio between the number of relevant documents found and the number of relevant documents the user expected to find • Recall effort: the ratio between the number of relevant documents the user expected to find and the number of documents examined • Others: expected search length, satisfaction, frustration

Reference Collections • Reference test collections for the evaluation of IR systems • TIPSTER/TREC: large size, thorough experimentation • CACM, ISI: historical importance • Cystic Fibrosis: small collections, extensively studied by specialists before generation of relevant documents

Criticisms for IR Research • Lacks a solid formal framework as a basic foundation • It’s difficult to dismiss due to the subjectiveness associated with the task of deciding on the relevance of a document • Lacks robust and consistent testbeds and benchmarks • Early experimentation was based on relatively small test collections, and there were no widely accepted benchmarks • In early 1990s, TREC conference under Donna Harman (NIST) dedicated to experimentation with a large test collection

TREC (Text REtrieval Conference) • Initiated under the National Institute of Standards and Technology(NIST) • Goals: • Providing a large test collection • Uniform scoring procedures • Forum for comparing results • 7th TREC conference in 1998 • Document collection: test collections, example information requests (topics), relevant docs • The benchmarks tasks

The Documents Collection • Tagged with SGML to allow easy parsing <doc> <docno>WSJ880406-0090</docno> <hl>AT&T Unveils Services to Upgrade Phone Networks Under Global Plan</hl> <author>Janet GuyonWSJ Staff)</author> <dateline>New York</dateline> <text> American Telephone & Telegrapj Co. introduced the first of a newgeneration of phone service with broad… </text> </doc>

TREC1-6 Documents

The Example Information Requests (Topics) • Each request (topic) is a description of an information need in natural language • Topic number for different topics <top> <num> Number:168 <title>Topic:Financing AMTRAK <desc>Description: ….. <nar>Narrative:A ….. </top>

TREC～Topics

TREC～Relevance Assessment • Relevance assessment • Pooling Method • The documents in the pool are shown to human assessor to decide on the relevance • Two assumptions • Vast majority of the relevant documents is collected in the assembled pool • Documents that are not in the pool can be considered to be not relevant

Pooling Method • The set of relevant documents for each example information request is obtained from a pool of possible relevant documents • This pool is created by taking the top K documents (usually, K=100) in the rankings generated by the various participating retrieval systems

The (Benchmark) Tasks at the TREC Conferences • Ad hoc task • Receive new requests and execute them on a pre-specified document collection • Routing task • Receive test info. requests, two document collections • First doc: training and tuning retrieval algorithm • Second doc: testing the tuned retrieval algorithm

Other Tracks • *Chinese • Filtering • Interactive • *NLP (natural language processing) • Cross languages • High precision • Spoken document retrieval • Query (TREC-7) • Others: Web, Terabyte, SPAM, Blog, Novelty, Question Answering, HARD, …

TREC～Evaluation

Evaluation Measures at the TREC Conferences • Summary table statistics • Recall-precision • Document level averages* • Average precision histogram

The CACM Collection • Small collections about computer science literature (1958-1979) • Text of 3,204 documents • Structured subfields • word stems from the title and abstract sections • Categories • direct references between articles: a list of document pairs [da,db] • Bibliographic coupling connections: a list of triples [d1,d2,ncited] • Number of co-citations for each pair of articles [d1,d2,nciting] • A unique environment for testing retrieval algorithms which are based on information derived from cross-citing patterns

CACM collection also includes a set of 52 test information requests • Ex: “What articles exist which deal with TSS (Time Sharing System), an operating system for IBM computers?” • Also includes two Boolean query formulations and a set of relevant documents • Since the requests are fairly specific, the average number of relevant documents for each request is small (around 15) • Precision and recall tend to be low

The ISI Collection • The 1,460 documents in the ISI test collection were selected from a previous collection assembled by Small at ISI (Institute of Scientific Information) • The documents selected were those most cited in a cross-citation study done by Small • The main purpose is to support investigation of similarities based on terms and on cross-citation patterns

The Cystic Fibrosis (CF) Collection • 1,239 documents indexed with the term “cystic fibrosis” (“囊狀纖維化”) in Medline database • Information requests were generated by an expert in cystic fibrosis • Relevance scores were provided by subject experts • 0: non-relevance • 1: marginal relevance • 2: high relevance

Characteristics of CF collection • Relevance score was generated directly by human experts • It includes a good number of information requests (relative to the collection size) • The respective query vectors present overlap among themselves • This allows experimentation with retrieval strategies which take advantage of past query sessions to improve retrieval performance

Trends and Research Issues • Interactive user interface • A general belief: effective retrieval is highly dependent on obtaining proper feedback from the user • Deciding which evaluation measures are most appropriate in this scenario • Ex: informativeness measure in 1992 • The proposal, the study, the characterization of alternative measures to recall and precision

Retrieval Evaluation