300 likes | 423 Views
The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008) Tokyo, 16 December 2008. Evaluating Answer Validation in multi-stream Question Answering. Álvaro Rodrigo, Anselmo Peñas , Felisa Verdejo UNED NLP & IR group nlp.uned.es. Content. Context and motivation
E N D
The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008) Tokyo, 16 December 2008 Evaluating Answer Validation in multi-stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es
Content • Context and motivation • Question Answering at CLEF • Answer Validation Exercise at CLEF • Evaluating the validation of answers • Evaluating the selection of answers • Correct selection • Correct rejection • Analysis and discussion • Conclusion
Evolution of Results 2003 - 2006 (Spanish) Overall Best result <60% Definitions Best result >80% NOT IR approach
0.8 x 0.8 x 1.0 = 0.64 Pipeline Upper Bounds Use Answer Validation to break the pipeline Question Question analysis Answer Passage Retrieval Answer Extraction Answer Ranking Not enough evidence
Results in CLEF-QA 2006 (Spanish) Best with ORGANIZATION Perfect combination 81% Best with PERSON Best with TIME Best system 52,5%
Evaluation Framwork Question QA sys1 Answer Validation & Selection QA sys2 Answer QA sys3 Candidate answers QA sysn Collaborative architectures Diferent systems response better different types of questions • Specialisation • Collaboration
Collaborative architectures How to select the good answer? • Redundancy • Voting • Confidence score • Performance history Why not deeper analysis?
Answer Validation Exercise (AVE) Objective Validate the correctness of the answers Given by realQA systems... ...the participants at CLEF QA
Answer Validation Question Automatic Hypothesis Generation Hypothesis Textual Entailment Question Candidate answer Question Answering Answer is correct Supporting Text Answer is not correct or not enough evidence AVE 2006 AVE 2007 - 2008 Answer Validation Exercise (AVE)
Techniques in AVE 2007 Overview AVE 2007
Questions Question Answering Track Answer Validation Exercise Systems’ answers Systems’ Validation (YES, NO) Systems’ Supporting Texts Human Judgements (R,W,X,U) Mapping (YES, NO) Evaluation QA Track results AVE Track results Evaluation linked to main QA task Reuse human assessments
Content • Context and motivation • Evaluating the validation of answers • Evaluating the selection of answers • Analysis and discussion • Conclusion
Question QA sys1 Answer Validation & Selection QA sys2 Answer QA sys3 Candidate answers QA sysn Participant systems in a CLEF – QA Evaluation of Answer Validation & Selection Evaluation Proposed
Collections <q id="116" lang="EN"> <q_str> What is Zanussi? </q_str> <a id="116_1" value=""> <a_str> was an Italian producer of home appliances </a_str> <t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought</t_str> </a> <a id="116_2" value=""> <a_str> who had also been in Cassibile since August 31 </a_str> <t_str doc="en/p29/2998260.xml">Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August 31.</t_str> </a> <a id="116_4" value=""> <a_str> 3 </a_str> <t_str doc="1618911.xml">(1985) 3 Out of 5 Live (1985) What Is This?</t_str> </a> </q>
Evaluating the Validation Validation Decide if each candidate answer is correct or not • YES | NO • Not balanced collections • Approach: Detect if there is enough evidence to accept an answer • Measures: Precision, recall and F over correct answers • Baseline system: Accept all answers
Evaluating the Selection • Quantify the potential gain of Answer Validation in Question Answering • Compare AV systems with QA systems • Develop measures more comparable to QA accuracy
Evaluating the selection Given a question with several candidate answers Two options: • Selection • Select an answer ≡ try to answer the question • Correct selection: answer was correct • Incorrect selection: answer was incorrect • Rejection • Reject all candidate answers ≡ leave question unanswered • Correct rejection: All candidate answers were incorrect • Incorrect rejection: Not all candidate answers were incorrect
Evaluating the Selection Not comparable to qa_accuracy
Evaluating the Selection Rewards rejection (not balanced cols) Interpretation for QA: all questions correctly rejected by AV will be answered correctly
Interpretation for QA: questions correctly rejected by AV will be answered correctly in qa_accuracy proportion Evaluating the Selection
Content • Context and motivation • Evaluating the validation of answers • Evaluating the selection of answers • Analysis and discussion • Conclusion
Analysis and discussion(AVE 2007 English) Validation QA_acc correlated to R “Estimated” adjusts it Selection
Analysis and discussion (AVE 2007 Spanish) Validation Comparing AV & QA Selection
Conclusion • Evaluation framework for Answer Validation & Selection systems • Measures that reward not only Correct Selection but also Correct Rejection • Promote improvement of QA systems • Allow comparison between AV and QA systems • In what conditions multi-stream perform better • Room for improvement just using multi-stream-QA • Potential gain that AV systems can provide to QA
Thanks! http://nlp.uned.es/clef-qa/ave http://www.clef-campaign.org Acknowledgement: EU project T-CLEF (ICT-1-4-1 215231)
The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008) Tokyo, 16 December 2008 Evaluating Answer Validation in multi-stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es