130 likes | 147 Views
INFORMATION RETRIEVAL TECHNIQUES BY DR . ADNAN ABID. Lecture # 24 BENCHMARKS FOR THE EVALUATION OF IR SYSTEMS. ACKNOWLEDGEMENTS. The presentation of this lecture has been taken from the underline sources
E N D
INFORMATION RETRIEVAL TECHNIQUESBYDR. ADNAN ABID Lecture # 24 BENCHMARKS FOR THE EVALUATION OF IR SYSTEMS
ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the underline sources • “Introduction to information retrieval” by PrabhakarRaghavan, Christopher D. Manning, and Hinrich Schütze • “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell • “Modern information retrieval” by Baeza-Yates Ricardo, • “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline • Happiness: elusive to measure • Gold Standard • search algorithm • Relevance judgments • Standard relevance benchmarks • Evaluating an IR system
Happiness: elusive to measure • Most common proxy: relevance of search results • But how do you measure relevance? • We will detail a methodology here, then examine its issues • Relevance measurement requires 3 elements: • A benchmark document collection • A benchmark suite of queries • A usually binary assessment of either Relevant or Nonrelevant for each query and each document • Some work on more-than-binary, but not the standard
Human Labeled Corpora (Gold Standard) • Start with a corpus of documents. • Collect a set of queries for this corpus. • Have one or more human experts exhaustively label the relevant documents for each query. • Typically assumes binary relevance judgments. • Requires considerable human effort for large document/query corpora.
So you want to measure the quality of a new search algorithm • Benchmark documents – nozama’s products • Benchmark query suite – more on this • Judgments of document relevance for each query • Do we really need every query-doc pair? Relevance judgment 5 million nozama.com products 50000 sample queries
Relevance judgments • Binary (relevant vs. non-relevant) in the simplest case, more nuanced (0, 1, 2, 3 …) in others • What are some issues already? • 5 million times 50K takes us into the range of a quarter trillion judgments • If each judgment took a human 2.5 seconds, we’d still need 1011 seconds, or nearly $300 million if you pay people $10 per hour to assess • 10K new products per day
Crowd source relevance judgments? • Present query-document pairs to low-cost labor on online crowd-sourcing platforms • Hope that this is cheaper than hiring qualified assessors • Lots of literature on using crowd-sourcing for such tasks • Main takeaway – you get some signal, but the variance in the resulting judgments is very high
Sec. 8.5 What else? • Still need test queries • Must be germane to docs available • Must be representative of actual user needs • Random query terms from the documents generally not a good idea • Sample from query logs if available • Classically (non-Web) • Low query rates – not enough query logs • Experts hand-craft “user needs”
Standard relevance benchmarks • TREC (Text REtrieval Conference)- National Institute of Standards and Technology (NIST) has run a large IR test bed for many years • Reuters and other benchmark doc collections used • “Retrieval tasks” specified • sometimes as queries • Human experts mark, for each query and for each doc, Relevant or Nonrelevant • or at least for subset of docs that some system returned for that query
Evaluation Algorithm under test Benchmarks • A benchmark collection contains: • A set of standard documents and queries/topics. • A list of relevant documents for each query. • Standard collections for traditional IR: • Smart collection: ftp://ftp.cs.cornell.edu/pub/smart • TREC: http://trec.nist.gov/ Precision and recall Retrieved result Standard document collection Standard queries Standard result
Some public test Collections Typical TREC
Evaluating an IR system • Note: user need is translated into a query • Relevance is assessed relative to the user need, not thequery • E.g., Information need: My swimming pool bottom is becoming black and needs to be cleaned. • Query: pool cleaner • Assess whether the doc addresses the underlying need, not whether it has these words