University of Palestine

University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2nd Semester 2008-2009

Chapter 3 Retrieval Evaluation

Why is System Evaluation Needed? • There are many retrieval systems on the market, which one is the best? • When the system is in operation, is the performance satisfactory? Does it deviate from the expectation? • To fine tune a query to obtain the best result (for a particular set of documents and application) • To provide inputs to cost-benefit analysis of an information system (e.g., time saving compared to a manual system) • To determine the effects of changes made to an existing system (system A versus system B) • Efficiency: speed • Effectiveness: how good the result is?

Retrieval Evaluation • Before the final implementation of an information retrieval system an evaluation of the system is carried out ,The type of evaluation to be considered depends on objectives of retrieval system Any software system has to provide the functionality it was conceived for. Thus , the first type of evaluation which should be considered is a functionalanalysis.

Cont… • Functional analysis Does the system provide most of the functions that the user expects? What are unique functions of this system? How user-friendly is the system? • Error Analysis How often does the system fail?

Retrieval Performance Evaluation The most common measures of system performance are timeand space The shorter the response time, the smaller the space used, the better the system is considered to be. There is a tradeoff between space complexity and time complexity which frequently allows trading one for the other.

Retrieval Performance Evaluation • In a system designed for providing informationretrieval,other metrics, besides time and space are also of interest, like recall and precision. Since the user query request is vague, the retrieved documents are not exact answers and have to be ranked according to their relevance to the query. • Such relevance ranking concept is not present in data retrieval systems. • The IR systems require the evaluation of how precise is the answer set.

Relevance for IR • The capability of an information retrieval system to select and retrieve data appropriate to a user’s needs • A measurement of the outcome of a search • The judgment on what should or should not be retrieved • There are no simple answers to what is relevant and what is not relevant. (difficult to define, Subjective, depending on knowledge, needs, time, situation, etc.) • Relevanceis a central concept of information retrieval.

Effectiveness of Retrieval System Effectivenessis a measure of the ability of the system to retrieve relevant documents while at the same time holding back non-relevant one, It can be measured by recall and precision.

Difficulties in Evaluating IR System • Effectiveness is related to relevancy of items retrieved • Relevancy is not a binary evaluation but a continuous function • Even relevancy judgement is binary, it is difficult to make the judgement • Relevancy, from a human judgement standpoint, is • subjective - depends upon a specific user’s judgement • situational - relates to user’s requirement • cognitive - depends on human perception and behavior • temporal - changes over time

RetrievalPerformance Evaluation • The Retrieval Performance Evaluation for information retrieval systems is usually based on a test reference collection and on an evaluation measure. • The test reference collection consists of: • A collection of documents. • A set of example information requests. • A set of relevant documents ( Provided by specialists) for each example information requests.

Recall and Precision • Given a query, how many documents should a system retrieve: • Are all the retrieved documents relevant? • Have all the relevant documents been retrieved ? • Measures for system performance: • The first question is about the precision of the search • The second is about the completeness (recall) of the search.

Entire document collection retrieved & irrelevant Not retrieved & irrelevant irrelevant Retrieved documents Relevant documents retrieved & relevant not retrieved but relevant relevant retrieved not retrieved Retrieval Effectiveness - Precision and Recall

Cont…… Relevant Not Relevant Retrieved a b Not retrieved d c a a P = -------------- R = -------------- a+b a+c

Precision and Recall • Precision • evaluates the correlation of the query to the database • an indirect measure of the completeness of indexing algorithm • Recall • the ability of the search to find all of the relevant items in the database • Among three numbers, • only two are always available • total number of items retrieved • number of relevant items retrieved • total number of relevant items is usually not available • Unfortunately, precision and recall affect each other in the opposite direction! Given a system: • Broadening a query will increase recall but lower precision • Increasing the number of documents returned has the same effect

Return most of the relevant documents but include many junks The ideal 1 recall 0 precision Return mostly relevant documents but miss many relevant ones Relationship between Recall and Precision 1

Recall and Precision... Examples • If you knew that there were 1000 relevant documents in a database ( R ) and your search retrieved 100 of these relevant documents ( Ra )  Your recall would be 10%. • If your search retrieves 100 documents ( A ) and 20 of these are relevant ( Ra ), your precision is 20%.

Fallout Measure • Falloutis just everything that is left over. All the junk that came up in your search that was irrelevant. • If you retrieve 100 documents and 20 are relevant, then your fallout is 80%. • Fallout becomes a bigger problem as the size of your database grows and your retrieval gets larger .

Fallout Rate • Problems with precision and recall: • A query on “Hong Kong” will return most relevant documents but it doesn’t tell you how good or how bad the system is! (What is the chance that a randomly picked document is relevant to the query?) • number of irrelevant documents in the collection is not taken into account • recall is undefined when there is no relevant document in the collection • precision is undefined when no document is retrieved • A good system should have high recall and low fallout

Example 1 • Consider an example information request for which a query q is formulated. • Assume that a set Rq is composed of the following documents: • Assume that a set Rq containing the relevant documents for q has been defined. Rq = { d3,d5,d9,d25,d39,d44,d56,d71,d89,d123 }

Example 1 • Consider now a new retrieval algorithm which has just been designed. • Assume that this algorithm returns, for query q, a ranking of the documents in the answer set as follows.

Example 1 • The documents that are relevant to the query q are marked with a bullet.

Cont….. • Compute : • Recall • Precision • Fallout rate

Cont….. Precision = number of relevant doc retrieved / Total numbers of doc retrieved. Recall= number of relevant doc retrieved / total number of relevant documents .

Cont… Precision = 5 / 15 = 33 % Recall = 5 / 10 = 50 %

University of Palestine