310 likes | 477 Views
CLEF 2009, Corfu Question Answering Track Overview. A. Peñas P. Forner R. Sutcliffe Á. Rodrigo C. Forascu I. Alegria D. Giampiccolo N. Moreau P. Osenova. D. Santos L.M. Cabral. J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi. QA Tasks & Time.
E N D
CLEF 2009, CorfuQuestion Answering TrackOverview A. Peñas P. Forner R. Sutcliffe Á. Rodrigo C. Forascu I. Alegria D. Giampiccolo N. Moreau P. Osenova D. Santos L.M. Cabral J. Turmo P.R. Comas S. Rosset O. Galibert N. Moreau D. Mostefa P. Rosso D. Buscaldi
2009 campaign ResPubliQA: QA on European Legislation GikiCLEF: QA requiring geographical reasoning on Wikipedia QAST: QA on Speech Transcriptions of European Parliament Plenary sessions
ResPubliQA 2009:QA on European Legislation Additional Assessors Fernando Luis Costa Anna Kampchen Julia Kramme Cosmina Croitoru Advisory Board Donna Harman Maarten de Rijke Dominique Laurent Organizers Anselmo Peñas Pamela Forner Richard Sutcliffe Álvaro Rodrigo Corina Forascu Iñaki Alegria Danilo Giampiccolo Nicolas Moreau Petya Osenova
Objectives • Move towards a domain of potential users • Compare systems working in different languages • Compare QA Tech. with pure IR • Introduce more types of questions • Introduce Answer Validation Tech.
Collection • Subset of JRC-Acquis (10,700 docs x lang) • Parallel at document level • EU treaties, EU legislation, agreements and resolutions • Economy, health, law, food, … • Between 1950 and 2006 • XML-TEI.2 encoding • Unfortunately, non parallel at the paragraph level -> extra work
500 questions • REASON • Why did a commission expert conduct an inspection visit to Uruguay? • PURPOSE/OBJECTIVE • What is the overall objective of the eco-label? • PROCEDURE • How are stable conditions in the natural rubber trade achieved? • In general, any question that can be answered in a paragraph
500 questions • Also • FACTOID • In how many languages is the Official Journal of the Community published? • DEFINITION • What is meant by “whole milk”? • No NIL questions
Selection of the final pool of 500 questions out of the 600 produced
Systems response No Answer ≠ Wrong Answer • Decide if the answer is given or not • [ YES | NO ] • Classification Problem • Machine Learning, Provers, etc. • Textual Entailment • Provide the paragraph (ID+Text) that answers the question Aim To leave a question unanswered has more value than to give a wrong answer
Assessments R: The question is answered correctly W: The question is answered incorrectly NoA: The question is not answered • NoA R: NoA, but the candidate answer was correct • NoA W: NoA, and the candidate answer was incorrect • Noa Empty: NoA and no candidate answer was given Evaluation measure: c@1 • Extension of the traditional accuracy (as proportion of questions correctly answered) • Considering unanswered questions
Evaluation measure n: Number of questions nR: Number of correctly answered questions nU: Number of unanswered questions
Accuracy Accuracy Evaluation measure If nU = 0 then c@1=nR/n Accuracy If nR = 0 then c@1=0 If nU = n then c@1=0 • Leave a question unanswered gives value only if this avoids to return a wrong answer • The added value is the performance shown with the answered questions: Accuracy
Detecting wrong answers Maintaining the number of correct answers, the candidate answer was not correct for 83% of unanswered questions Very good step towards improving the system
Many systems under the IR baselines IR important, not enough Feasible Task Perfect combination is 50% better than best system
Comparison across languages • Same questions • Same documents • Same baseline systems • Strict comparison only affected by the variable of language • But it is feasible to detect the most promising approaches across languages
Comparison across languages Systems above the baselines Icia, Boolean + intensive NLP + ML-based validation & very good knowledge of the collection (Eurovoc terms…) Baseline, Okapi-BM25 tuned for paragraph retrieval
Comparison across languages Systems above the baselines nlel092, ngram-based retrieval, combining evidence from several languages Baseline, Okapi-BM25 tuned for paragraph retrieval
Comparison across languages Systems above the baselines Uned, Okapi-BM25 + NER + paragraph validation + ngram based re-ranking Baseline, Okapi-BM25 tuned for paragraph retrieval
Comparison across languages Systems above the baselines nlel091, ngram-based paragraph retrieval Baseline, Okapi-BM25 tuned for paragraph retrieval
Comparison across languages Systems above the baselines Loga, Lucene + deep NLP + Logic + ML-based validation Baseline, Okapi-BM25 tuned for paragraph retrieval
Conclusion • Compare systems working in different languages • Compare QA Tech. with pure IR • Pay more attention to paragraph retrieval • Old issue, late 90’s state of the art (English) • Pure IR performance: 0.38 - 0.58 • Highest difference respect IR baselines: 0.44 – 0.68 • Intensive NLP • ML-based answer validation • Introduce more types of questions • Some types difficult to distinguish • Any question that can be answered in a paragraph • Analysis of results by question types (in progress)
Conclusion • Introduce Answer Validation Tech. • Evaluation measure: c@1 • Value of reducing wrong answers • Detecting wrong answers is feasible • Feasible task • 90% of questions have been answered • Room for improvement: Best systems around 60% • Even with less participants we have • More comparison • More analysis • More learning • ResPubliQA proposal for 2010 • SC and breakout session
Interest on ResPubliQA 2010 But we need more You have already a Gold Standard of 500 questions & answers to play with…