CLEF 2009, Corfu Question Answering Track Overview

CLEF 2009, CorfuQuestion Answering TrackOverview A. Peñas P. Forner R. Sutcliffe Á. Rodrigo C. Forascu I. Alegria D. Giampiccolo N. Moreau P. Osenova D. Santos L.M. Cabral J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi

QA Tasks & Time

2009 campaign ResPubliQA: QA on European Legislation GikiCLEF: QA requiring geographical reasoning on Wikipedia QAST: QA on Speech Transcriptions of European Parliament Plenary sessions

QA 2009 campaign

ResPubliQA 2009:QA on European Legislation Additional Assessors Fernando Luis Costa Anna Kampchen Julia Kramme Cosmina Croitoru Advisory Board Donna Harman Maarten de Rijke Dominique Laurent Organizers Anselmo Peñas Pamela Forner Richard Sutcliffe Álvaro Rodrigo Corina Forascu Iñaki Alegria Danilo Giampiccolo Nicolas Moreau Petya Osenova

Evolution of the task

Objectives • Move towards a domain of potential users • Compare systems working in different languages • Compare QA Tech. with pure IR • Introduce more types of questions • Introduce Answer Validation Tech.

Collection • Subset of JRC-Acquis (10,700 docs x lang) • Parallel at document level • EU treaties, EU legislation, agreements and resolutions • Economy, health, law, food, … • Between 1950 and 2006 • XML-TEI.2 encoding • Unfortunately, non parallel at the paragraph level -> extra work

500 questions • REASON • Why did a commission expert conduct an inspection visit to Uruguay? • PURPOSE/OBJECTIVE • What is the overall objective of the eco-label? • PROCEDURE • How are stable conditions in the natural rubber trade achieved? • In general, any question that can be answered in a paragraph

500 questions • Also • FACTOID • In how many languages is the Official Journal of the Community published? • DEFINITION • What is meant by “whole milk”? • No NIL questions

Translation of questions

Selection of the final pool of 500 questions out of the 600 produced

Systems response No Answer ≠ Wrong Answer • Decide if the answer is given or not • [ YES | NO ] • Classification Problem • Machine Learning, Provers, etc. • Textual Entailment • Provide the paragraph (ID+Text) that answers the question Aim To leave a question unanswered has more value than to give a wrong answer

Assessments R: The question is answered correctly W: The question is answered incorrectly NoA: The question is not answered • NoA R: NoA, but the candidate answer was correct • NoA W: NoA, and the candidate answer was incorrect • Noa Empty: NoA and no candidate answer was given Evaluation measure: c@1 • Extension of the traditional accuracy (as proportion of questions correctly answered) • Considering unanswered questions

Evaluation measure n: Number of questions nR: Number of correctly answered questions nU: Number of unanswered questions

Accuracy Accuracy Evaluation measure If nU = 0 then c@1=nR/n  Accuracy If nR = 0 then c@1=0 If nU = n then c@1=0 • Leave a question unanswered gives value only if this avoids to return a wrong answer • The added value is the performance shown with the answered questions: Accuracy

List of Participants

Value of reducing wrong answers

Detecting wrong answers Maintaining the number of correct answers, the candidate answer was not correct for 83% of unanswered questions Very good step towards improving the system

Many systems under the IR baselines IR important, not enough Feasible Task Perfect combination is 50% better than best system

Comparison across languages • Same questions • Same documents • Same baseline systems • Strict comparison only affected by the variable of language • But it is feasible to detect the most promising approaches across languages

Comparison across languages Systems above the baselines Icia, Boolean + intensive NLP + ML-based validation & very good knowledge of the collection (Eurovoc terms…) Baseline, Okapi-BM25 tuned for paragraph retrieval

Comparison across languages Systems above the baselines nlel092, ngram-based retrieval, combining evidence from several languages Baseline, Okapi-BM25 tuned for paragraph retrieval

Comparison across languages Systems above the baselines Uned, Okapi-BM25 + NER + paragraph validation + ngram based re-ranking Baseline, Okapi-BM25 tuned for paragraph retrieval

Comparison across languages Systems above the baselines nlel091, ngram-based paragraph retrieval Baseline, Okapi-BM25 tuned for paragraph retrieval

Comparison across languages Systems above the baselines Loga, Lucene + deep NLP + Logic + ML-based validation Baseline, Okapi-BM25 tuned for paragraph retrieval

Conclusion • Compare systems working in different languages • Compare QA Tech. with pure IR • Pay more attention to paragraph retrieval • Old issue, late 90’s state of the art (English) • Pure IR performance: 0.38 - 0.58 • Highest difference respect IR baselines: 0.44 – 0.68 • Intensive NLP • ML-based answer validation • Introduce more types of questions • Some types difficult to distinguish • Any question that can be answered in a paragraph • Analysis of results by question types (in progress)

Conclusion • Introduce Answer Validation Tech. • Evaluation measure: c@1 • Value of reducing wrong answers • Detecting wrong answers is feasible • Feasible task • 90% of questions have been answered • Room for improvement: Best systems around 60% • Even with less participants we have • More comparison • More analysis • More learning • ResPubliQA proposal for 2010 • SC and breakout session

Interest on ResPubliQA 2010 But we need more You have already a Gold Standard of 500 questions & answers to play with…

CLEF 2009, Corfu Question Answering Track Overview

CLEF 2009, Corfu Question Answering Track Overview

Presentation Transcript

Question-Answering

Question Answering

Patent Track @ CLEF

CLEF 2012, Rome QA4MRE, Question Answering for Machine Reading Evaluation

Question AnswerinG

Question-Answering: Overview

Overview of the Multilingual Question Answering Track

CLEF 2007 Multilingual Question Answering Track

The Multiple Language Question Answering Track at CLEF 2003

CLEF 2008 Multilingual Question Answering Track

Question Answering

Question Answering

Question Answering

Question Answering

Question-Answering: Overview

Question Answering

Question Answering

Question Answering

Overview of the Multilingual Question Answering Track

Evaluating Multilingual Question Answering Systems at CLEF