200 likes | 332 Views
The Evolution of Shared-Task Evaluation. Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA. The Story. Evaluation-guided r esearch The three C’s Five examples Thinking forward. Evaluation-Guided Research. Information Retrieval
E N D
The Evolution of Shared-Task Evaluation Douglas W. Oard College of Information Studies and UMIACS University of Maryland, College Park, USA FIRE
The Story • Evaluation-guided research • The three C’s • Five examples • Thinking forward
Evaluation-Guided Research • Information Retrieval • Text classification • Automatic Speech Recognition • Optical Character Recognition • Named Entity Recognition • Machine Translation • Extractive summarization • …
Key Elements • Task model • Single-valued evaluation measure • Affordable evaluation process
Critiques • Early convergence • Duplicative ($) • Incrementalism • Privileging the measurable
The Big Four • TREC • NTCIR • CLEF • FIRE
10 More • TDT • Amarylis • INEX • TRECVid • TAC • MediaEval • STD • OAEI • CONLL • WePS
What We Create • Collections • Comparison points • Baseline results • Communities • Competition?
Elsewhere in the Ecosystem … • Capacity • From universities, industry, individuals, and funding agencies • Completed work • Often requires working outside our year-long innovation cycles with rigid timelines • Culling • Conferences and journals are the guardians of community standards
A Typical Task Life Cycle • Year 1: • Task definition • Evaluation design • Community building • Year 2: • Creating training data • Year 3: • Reusable test collection • Establishing strong baselines
Some Sea Stories • TDT • CLIR • Speech Retrieval • E-Discovery
Topic Detection and Tracking • Cultures • Speech, sponsor • Event-based relevance • Document boundary discovery • Complexity • 5 tasks, 3 languages, 2 modalities • Lasting influence
Cross-Language IR • TREC CLIR (Arabic) • Standard resources • Light stemming • Problematic task model • CLEF Interactive CLIR • Controlled user studies • Problematic evaluation design • Qualitative vs. quantitative
Speech Retrieval • TREC Spoken Document Retrieval • The “solved problem” • CLEF Cross-Language Speech Retrieval • Grounded queries • Start time error evaluation measure • FIRE QA for the Spoken Web
TREC Legal Track • Iterative task design • Sampling • Measurement error • Families • Cultures
What’s in a Test Collection? • Queries • Documents • Relevance judgments
What’s in a Test Collection? • Queries • Content • Units of judgment • Relevance judgments • Evaluation measure(s)
Personality Types • Innovators • Organizers • Optimizers • Deployers • Resourcers
Some Takeaways • Progressive invalidation • Social engineering • Innovation from outside
A Final Thought It isn’t what you don’t know that limits your thinking. Rather, it is what you know that isn’t true.