150 likes | 307 Views
Answer Validation Exercise - AVE QA subtrack at Cross-Language Evaluation Forum 2007. Thanks to… Main task organizing committee. UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo. What? Answer Validation Exercise. Validate the correctness of the answers…
E N D
Answer Validation Exercise - AVEQA subtrack at Cross-Language Evaluation Forum 2007 Thanks to… Main task organizing committee UNED (coord.) Anselmo Peñas Álvaro Rodrigo Valentín Sama Felisa Verdejo
What? Answer Validation Exercise Validate the correctness of the answers… ... given by the participants at CLEF QA 2007
Exact Answer QA system Supporting snippet & doc ID Into affirmative form Hypothesis Text AVE 2006: an RTE exercise If the text semantically entails the hypothesis, then the answer is expected to be correct. Question
Answer Validation Question Automatic Hypothesis Generation Hypothesis Textual Entailment Question Candidate answer Question Answering Answer is correct Supporting Text Answer is not correct or not enough evidence AVE 2006 AVE 2007 Answer Validation Exercise Black box
Answer Validation Exercise • AVE 2006 • Not possible to quantify the potential gain that AV modules give to QA systems • Change in AVE 2007 methodology • Group answers by question • Systems must validate all • But select one
AVE 2007 Collections • <q id="116" lang="EN"> • <q_str>What is Zanussi?</q_str> • <a id="116_1" value=""> • <a_str>was an Italian producer of home appliances</a_str> • <t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought</t_str> • </a> • <a id="116_2" value=""> • <a_str>who had also been in Cassibile since August 31</a_str> • <t_str doc="en/p29/2998260.xml">Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August 31.</t_str> • </a> • <a id="116_4" value=""> • <a_str>3</a_str> • <t_str doc="1618911.xml">(1985) 3 Out of 5 Live (1985) What Is This?</t_str> • </a> • </q>
Collections • Remove duplicated answers inside the same question group • Discard NIL answers, void answers and answers with too long supporting snippet • This processing lead to a reduction in the number of answers to be validated
Collections (# answers to validate) Available for CLEF participants at nlp.uned.es/QA/ave/
Evaluation • Not balanced collections • Approach: Detect if there is enough evidence to accept an answer • Measures: Precision, recall and F over ACCEPTED answers • Baseline system: Accept all answers
Evaluation Precision, Recall and F measure over correct answers for English
Techniques reported at AVE 2007 • 10 reports, all reported a RTE approach
Conclusion • Evaluation in a real environment • Real systems outputs -> AVE input • Developed methodologies • Build collections from QA responses • Evaluate in chain with a QA Track • Compare results with QA systems • New testing collections for the QA and RTE communities • In 7 languages, not only English
Conclusion • 9 groups, 16 systems, 4 languages • All systems based on Textual Entailment • 5 out of 9 groups participated in QA • Introduction of RTE techniques in QA • More NLP • More Machine Learning • Systems based on syntactic or semantic analysis perform Automatic Hypothesis Generation • Combination of the question and the answer • Some cases directly in a logic form