120 likes | 241 Views
Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum. Thanks to… Bernardo Magnini Danilo Giampiccolo Pamela Forner Petya Osenova Christelle Ayache Bodgan Scaleanu Diana Santos Juan Feu Ido dagan …. UNED (coord.) Anselmo Peñas Álvaro Rodrigo
E N D
Answer Validation Exercise - AVEQA subtrack at Cross-Language Evaluation Forum Thanks to… Bernardo Magnini Danilo Giampiccolo Pamela Forner Petya Osenova Christelle Ayache Bodgan Scaleanu Diana Santos Juan Feu Ido dagan … UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo
What? Answer Validation Exercise Validate the correctness of the answers given by realQA systems... ...the answers of participants at CLEF QA 2006 Why? Give feedback on a single QA module, improve QA systems performance, improve systems self-score, help humans in the assessment of QA systems output, develop criteria for collaborative QA systems, ...
Exact Answer QA system Supporting snippet & doc ID Into affirmative form Hypothesis Text How? Turning it into a RTE exercise If the text semantically entails the hypothesis, then the answer is expected to be correct. Question several sentences <500 bytes
Example • Question: Who is the President of Mexico? • Answer (obsolete): Vicente Fox • Hypothesis: Vicente Fox is the President of Mexico • Supporting Text: “...President Vicente Fox promises a more democratic Mexico...” • Exercise • Text entails Hypothesis? • Answer: YES | NO
Looking for robust systems • Hypothesis are built semi-automatically from systems answers • Some answers are correct and exact • Many are too large, too short, too wrong • Many hypothesis with • Wrong syntax but understandable • Wrong syntax and not understandable • Wrong semantics
So, the exercise • Return an entailment value (YES|NO) for each given text-hypothesis pair • Results were evaluated against the QA human assessments • Subtasks English, Spanish, Italian, Dutch, French, German, Portuguese and Bulgarian
Collections Available for CLEF participants at nlp.uned.es/QA/ave/
Evaluation • Not balanced collections • Approach: Detect if there is enough evidence to accept an answer • Measures: Precision, recall and F over pairs YES (where text entails hypothesis) • Baseline system: Accept all answers, (give always YES)
DE EN ES FR IT NL PT Fernuniversität in Hagen 2 2 Language Computer Corporation 1 1 2 U. Rome "Tor Vergata" 2 2 U. Alicante (Kozareva) 2 2 2 2 2 2 1 13 U. Politecnica de Valencia 1 1 U. Alicante (Ferrández) 2 2 LIMSI-CNRS 1 1 U. Twente 1 2 2 1 1 2 1 10 UNED (Herrera) 2 2 UNED (Rodrigo) 1 1 ITC-irst 1 1 R2D2 project 1 1 Total 5 11 9 4 3 4 2 38 Participants and runs
Conclusions • Developed methodologies • Build collections from QA responses • Evaluate in chain with a QA Track • New testing collections for the QA and RTE communities • In 7 languages, not only English • Evaluation in a real environment • Real systems outputs -> AVE input
Conclusions • Reformulation of Answer Validation as Textual Entailment problem is feasible • Introduces a 4% of error (in the semi-automatic generation of the collection) • Good participation • 11 systems, 38 runs, 7 languages • Systems that reported the use of Logicobtained the best results