310 likes | 318 Views
Explore the methodologies and parameters used to evaluate retrieval systems in the field of computer science. Learn about models, user satisfaction, precision and recall, and more.
E N D
Special Topics in Computer ScienceThe Art of Information RetrievalChapter 3: Retrieval Evaluation Alexander Gelbukh www.Gelbukh.com
Previous chapter • Modeling is needed for formal operations • Boolean model is the simplest • Vector model is the best combination of quality and simplicity • TF-IDF term weighting • This (or similar) weighting is used in all further models • Many interesting and not well-investigated variations • possible future work
Previous chapter: Research issues • How people judge relevance? • ranking strategies • How to combine different sources of evidence? • What interfaces can help users to understand and formulate their Information Need? • user interfaces: an open issue • Meta-search engines: combine results from different Web search engines • They almost do not intersect • How to combine ranking?
Evaluation! • How do you measure if your system is good or bad? • To go to the right direction, need to know where you want to get to. • “We can do it this way” vs. “This way it performs better” • I think it is better... • We do it this way... • Our method takes into account syntax and semantics... • I like the results... • Criterion of truth. Crucial for any science. • Enables competition financial policy attracts people • TREC international competitions
Methodology • Define formally your task and constraints • Define formally your evaluation criterion (argue if needed) • One numerical value, not several! • Demonstrate that your method gives better value than • the baseline (the simple obvious way) • Retrieve all. Retrieve none. Retrieve at random. Use Google. • state-of-the-art (the best reported method) • That your parameter settings are optimal • Consider singular (extreme) settings • Set your parameters to 0. To infinity.
Methodology The only valid way of reasoning • “But we want the clusters to be non-trivial” • Then add this as a penalty to your criteria or as constraints • Divide your “acceptability considerations”: • Constraints: yes/no. • Evaluation: better/worse. • Check that your evaluation criteria are well justified • “My formula gives it this way” • “My result is correct since this is what my algorithm gives” • Reason in terms of the user task, not your algorithm / formulas • Are your good/bad judgments in accord with intuition?
Evaluation? • IR: “user satisfaction” • Difficult to model formally • Expensive to measure directly (experiments with subjects) • At least two contradicting parameters • Completeness vs. quality • No good way to combine into one single numerical value • Some “user-defined” “weights of importance” of the two • Not formal, depend on situation • Art
Parameters to evaluate • Performance evaluation • Speed • Space • Tradoff • Common for all systems. Not discussed here. • Retrieval performance (quality?) evaluation • = goodness of a retrieval strategy • A testreference collection: docs and queries. • The “correct” set (or ordering) provided by “experts” • A similarity measure to compare system output with the “correct” one.
Evaluation: Model User Satisfaction • User task • Batch query processing? Interaction? Mixed? • Way of use • Real-life situation: what factors matter? • Interface type • This chapter: laboratory settings • Repeatability • Scalability
Precision & Recall • Tradeoff (as with time and space) • Assumes the retrieval results are sets • as Boolean; in Vector use threshold • Measures closeness between two sets • Recall:Of relevant docs, how many (%) were retrieved? Others are lost. • Precision:Of retrieved docs, how many (%) are relevant? Others are noise. • Nowadays with huge collections Precision is more important!
Precision & Recall Recall = Precision =
Ranked Output... • “Truth”: unordered “relevant” set • Output: ordered guessing • Compare ordered set with an unordered one
...Ranked Output • Plot precision vs. recall curve • In the initial part of the list containing n% of all relevant docs, what the precision is? • 11 standard recall levels: 0%, 10%, ..., 90%, 100%. • 0%: interpolated
Many experiments • Average precision and recall Ranked output: • Average precision at each recall level • To get equal (standard) recall levels, interpolation • of 3 relevant docs, there is no 10% level! • Interpolated value at level n =maximum known value between n and n + 1 • If none known, use the nearest known.
Precision vs. Recall Figures • Alternative method: document cutoff values • Precision at first 5, 10, 15, 20, 30, 50, 100 docs • Used to compare algorithms. • Simple • Intuitive • NOT a one-value comparison!
Single-value summaries • Performance for an individual query • Can be averaged over several queries, too • Histogram for several queries can be made • Tables can be made • Curves cannot be used for this! • Precision at first relevant doc? • Average precision at (each) seen relevant docs • Favors systems that give several relevant docs first • R-precision • precision at Rth retrieved (R = total relevant)
Precision histogram Two algs: A, B R(A)-R(B). Which is better?
Alternative measures • Problems with Precision & Recall measure: • Recall cannot be estimated with large collections • Two values, but we need one value to compare • Designed for batch mode, not interactive. Informativeness! • Designed for linear ordering of docs (not weak ordering) • Alternative measures: combine both in one F-measure: E-measure:user preference Rec vs. Prec
User-oriented measures Definitions:
User-oriented measures • Coverage ratio • Many expected docs • Novelty ratio • Many new docs • Relative recall: # found / # expected • Recall effort: # expected / # examined until those are found • Other: • expected search length (good for weak order) • satisfaction (considers only relevant docs) • frustration (considers only non-relevant docs)
Reference collections Texts with queries and relevant docs known TREC • Text REtrieval Conference. Different in different years • Wide variety of topics. Document structure marked up. • 6 GB. See NIST website: available at small cost • Not all relevant docs marked! • Pooling method: • top 100 docs in ranking of many search engines • manually verified • Was tested that is a good approximation to the “real” set
Ad-hoc (conventional: query answer) Routing (ranked filtering of changing collection) Chinese ad-hoc Filtering (changing collection; no ranking) Interactive (no ranking) NLP: does it help? Cross-language (ad-hoc) High precision (only 10 docs in answer) Spoken document retrieval (written transcripts) Very large corpus (ad-hoc, 20 GB = 7.5 M docs) Query task (several query versions; does strategy depends on it?) Query transforming Automatic Manual ...TREC tasks
...TREC evaluation • Summary table statistics • # of requests used in the task • # of retrieved docs; # of relevant retrieved and not retrieved • Recall-precision averages • 11 standard points. Interpolated (and not) • Document level averages • Also, can include average R-value • Average precision histogram • By topic. • E.g., difference between R-precision of this system and average of all systems
Smaller collections • Simpler to use • Can include info that TREC does not • Can be of specialized type (e.g., include co-citations) • Less sparse, greater overlap between queries • Examples: • CACM • ISI • there are others
CACM collection • Communications of ACM, 1958-1979 • 3204 articles • Computer science • Structure info (author, date, citations, ...) • Stems (only title and abstract) • Good for algorithms relying on cross-citations • If a paper cites another one, they are related • If two papers cite the same ones, they are related • 52 queries with Boolean form and answer sets
ISI collection • On information sciences • 1460 docs • For similarity in terms and cross-citation • Includes: • Stems (title and abstracts) • Number of cross-citations • 35 natural-language queries with Boolean form and answer sets
Cystic Fibrosis (CF) collection • Medical • 1239 docs • MEDLINE data • keywords assigned manually! • 100 requests • 4 judgments for each doc • Good to see agreement • Degrees of relevance, from 0 to 2 • Good answer set overlap • can be used for learning from previous queries
Research issues • Different types of interfaces; interactive systems: • What measures to use? • Such as infromativeness
Conclusions • Main measures: Precision & Recall. • For sets • Rankings are evaluated through initial subsets • There are measures that combine them into one • Involve user-defined preferences • Many (other) characteristics • An algorithm can be good at some and bad at others • Averages are used, but not always are meaningful • Reference collection exists with known answers to evaluate new algorithms
Thank you! Till October 9 October 23: midterm exam