240 likes | 353 Views
IR System Evaluation. Farhad Oroumchian. IR System Evaluation. System-centered strategy Given documents, queries, and relevance judgments Try several variations on the retrieval system Measure which gets more good docs near the top User-centered strategy
E N D
IR System Evaluation Farhad Oroumchian
IR System Evaluation • System-centered strategy • Given documents, queries, and relevance judgments • Try several variations on the retrieval system • Measure which gets more good docs near the top • User-centered strategy • Given several users, and at least 2 retrieval systems • Have each user try the same task on both systems • Measure which system works the “best”
Which is the Best Rank Order? a b c d g h e f R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R R
Measures of Effectiveness • Good measures of effectiveness should • Capture some aspect of what the user wants • Have predictive value for other situations • Different queries, different document collection • Be easily replicated by other researchers • Be expressed as a single number • Allows two systems to be easily compared • No measures of effectiveness are that good!
Some Assumptions • Unchanging, known queries • The same queries are used by each system • Binary relevance • Every document is either relevant or it is not • Unchanging, known relevance • The relevance of each doc to each query is known • But only used for evaluation, not retrieval! • Focus on effectiveness, not efficiency
Exact Match MOE • Precision • How much of what was found is relevant? • Often of interest, particularly for interactive searching • Recall • How much of what is relevant was found? • Particularly important for law and patent searches • Fallout • How much of what was irrelevant was rejected? • Useful when different size collections are compared
The Contingency Table Action Retrieved Not Retrieved Doc Relevant Retrieved Relevant Rejected Relevant Irrelevant Retrieved Irrelevant Rejected Not relevant
MOE for Ranked Retrieval • Start with the first relevant doc on the list • Compute recall and precision at that point • Move down the list, 1 relevant doc at a time • Computing recall and precision at each point • Plot precision for every value of recall • Interpolate with a nonincreasing step function • Repeat for several queries • Average the plots at every point
R The Precision-Recall Curve R Action Retrieved Not Retrieved R Doc=10 Relevant Retrieved Relevant Rejected Relevant=4 Irrelevant Retrieved Irrelevant Rejected Not relevant=6 R
Single-Number MOE • Precision at a fixed number of documents • Precision at 10 docs is the “AltaVista measure” • Precision at a given level of recall • Adjusts for the total number of relevant docs • Average precision • Average of precision at recall=0.0, 0.1, …, 1.0 • Area under the precision/recall curve • Breakeven point • Point where precision = recall
Single-Number MOE Precision at recall=0.1 Average Precision Breakeven Point Precision at 10 docs
Single-Number MOE Weaknesses • Precision at 10 documents • Pays no attention to recall • Precision at constant recall • A specific recall fraction is rarely the user’s goal • Breakeven point • Nobody ever searches at the breakeven point • Average precision • Users typically operate near an extreme of the curve • So the average is not very informative
Why Choose Average Precision? • It is easy to trade between recall and precision • Adding related query terms improves recall • But naive query expansion techniques kill precision • Limiting matches by part-of-speech helps precision • But it almost always hurts recall • Comparisons should give some weight to both • Average precision is a principled way to do this • Rewards improvements in either factor
How Much is Enough? • The maximum average precision is 1.0 • But inter-rater reliability is 0.8 or less • So 0.8 is a practical upper bound at every point • Precision » 0.8 is sometimes seen at low recall • Two goals • Achieve a meaningful amount of improvement • This is a judgment call, and depends on the application • Achieve that improvement reliably across queries • This can be verified using statistical tests
Statistical Significance Tests • How sure can you be that an observed difference doesn’t simply result from the particular queries you chose? Experiment 1 Experiment 1 Query System A Query System A System B System B 1 2 3 4 5 6 7 0.20 0.21 0.22 0.19 0.17 0.20 0.21 1 2 3 4 5 6 7 0.02 0.39 0.16 0.58 0.04 0.09 0.12 0.40 0.41 0.42 0.39 0.37 0.40 0.41 0.76 0.07 0.37 0.21 0.02 0.91 0.46 Average 0.20 0.40 Average 0.20 0.40
The Sign Test • Compare the average precision for each query • Note which system produces the bigger value • Assume that either system is equally likely to produce the bigger value for any query • Compute the probability of the outcome you got • Any statistics package contains the formula for this • Probabilities<0.05 are “statistically significant” • But they still need to pass the “meaningful” test!
The Students T-Test • More powerful than the sign test • If the assumptions are satisfied • Compute the average precision difference • On a query by query basis for enough queries to approximate a normal distribution • Assume that the queries are independent • Compute the probability of the outcome you got • Again, any statistics package can be used • A probability>0.05 is “statistically significant”
Obtaining Relevance Judgments • Exhaustive assessment can be too expensive • TREC has 50 queries for >1 million docs each year • Random sampling won’t work either • If relevant docs are rare, none may be found! • IR systems can help focus the sample • Each system finds some relevant documents • Different systems find different relevant documents • Together, enough systems will find most of them
Pooled Assessment Methodology • Each system submits top 1000 documents • Top 100 documents for each are judged • All are placed in a single pool • Duplicates are eliminated • Placed in an arbitrary order to avoid bias • Evaluated by the person that wrote the query • Assume unevaluated documents not relevant • Overlap evaluation shows diminishing returns • Compute average precision over all 1000 docs
TREC Overview • Documents are typically distributed in April • Topics are distributed June 1 • Queries are formed from topics using standard rules • Top 1000 selections are due August 15 • Special interest track results due 2 weeks later • Cross-language IR, Spoken Document Retrieval, … • Relevance judgments available in October • Results presented in late November each year
Concerns About Precision/Recall • Statistical significance may be meaningless • Average precision won’t reveal curve shape • Averaging over recall washes out information • How can you know the quality of the pool? • How to extrapolate to other collections?
Project Test Collection • مواد قانوناز سيستمقوانيندادهپردازي • 41 query از سيستمقوانين • Relevance judgments keyed to ItemID • Relevance is in scale 0-4 • 0=- بطور كلينامربوط • 1= نامربوط • 2= كمي مربوط • 3= مربوط • 4= كاملا مربوط • معمولا 3و 4 مربوط و 0،1،2نامربوطتلقي ميشوند
Project Overview • Install or write the software the software • Choose the parameters • Stopwords, stemming, term weights, etc. • Index the document collection • This may require some format-specific tweaking • Run the 20 queries • Compute average precision and other measures • Test query length effect for statistical significance
Team Project User Studies • Measure value of some part of the interface • e.g., selection interface with and without titles • Choose a dependent variable to measure • e.g., number of documents examined • Run a pilot study with users from your team • Fine tune your experimental procedure • Run the experiment with at least 3 subjects • From outside your team (may be in the class)