Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007

Answer Validation Exercise - AVEQA subtrack at Cross-Language Evaluation Forum 2007 Thanks to… Main task organizing committee UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo

What? Answer Validation Exercise Validate the correctness of the answers… ... given by the participants at CLEF QA 2007

Exact Answer QA system Supporting snippet & doc ID Into affirmative form Hypothesis Text AVE 2006: an RTE exercise If the text semantically entails the hypothesis, then the answer is expected to be correct. Question

Answer Validation Question Automatic Hypothesis Generation Hypothesis Textual Entailment Question Candidate answer Question Answering Answer is correct Supporting Text Answer is not correct or not enough evidence AVE 2006 AVE 2007 Answer Validation Exercise Black box

Answer Validation Exercise • AVE 2006  • Not possible to quantify the potential gain that AV modules give to QA systems • Change in AVE 2007 methodology • Group answers by question • Systems must validate all • But select one

AVE 2007 Collections • <q id="116" lang="EN"> • <q_str>What is Zanussi?</q_str> • <a id="116_1" value=""> • <a_str>was an Italian producer of home appliances</a_str> • <t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought</t_str> • </a> • <a id="116_2" value=""> • <a_str>who had also been in Cassibile since August 31</a_str> • <t_str doc="en/p29/2998260.xml">Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August 31.</t_str> • </a> • <a id="116_4" value=""> • <a_str>3</a_str> • <t_str doc="1618911.xml">(1985) 3 Out of 5 Live (1985) What Is This?</t_str> • </a> • </q>

Collections • Remove duplicated answers inside the same question group • Discard NIL answers, void answers and answers with too long supporting snippet • This processing lead to a reduction in the number of answers to be validated

Collections (# answers to validate) Available for CLEF participants at nlp.uned.es/QA/ave/

Evaluation • Not balanced collections • Approach: Detect if there is enough evidence to accept an answer • Measures: Precision, recall and F over ACCEPTED answers • Baseline system: Accept all answers

Evaluation Precision, Recall and F measure over correct answers for English

Comparing AV systems performance with QA systems (German)

Techniques reported at AVE 2007 • 10 reports, all reported a RTE approach

Conclusion • Evaluation in a real environment • Real systems outputs -> AVE input • Developed methodologies • Build collections from QA responses • Evaluate in chain with a QA Track • Compare results with QA systems • New testing collections for the QA and RTE communities • In 7 languages, not only English

Conclusion • 9 groups, 16 systems, 4 languages • All systems based on Textual Entailment • 5 out of 9 groups participated in QA • Introduction of RTE techniques in QA • More NLP • More Machine Learning • Systems based on syntactic or semantic analysis perform Automatic Hypothesis Generation • Combination of the question and the answer • Some cases directly in a logic form

Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007