Evaluating Answer Validation in multi-stream Question Answering

The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008) Tokyo, 16 December 2008 Evaluating Answer Validation in multi-stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es

Content • Context and motivation • Question Answering at CLEF • Answer Validation Exercise at CLEF • Evaluating the validation of answers • Evaluating the selection of answers • Correct selection • Correct rejection • Analysis and discussion • Conclusion

Evolution of the CLEF-QA Track

Evolution of Results 2003 - 2006 (Spanish) Overall Best result <60% Definitions Best result >80% NOT IR approach

0.8 x 0.8 x 1.0 = 0.64 Pipeline Upper Bounds Use Answer Validation to break the pipeline Question Question analysis Answer Passage Retrieval Answer Extraction Answer Ranking Not enough evidence

Results in CLEF-QA 2006 (Spanish) Best with ORGANIZATION Perfect combination 81% Best with PERSON Best with TIME Best system 52,5%

Evaluation Framwork Question QA sys1 Answer Validation & Selection QA sys2 Answer QA sys3 Candidate answers QA sysn Collaborative architectures Diferent systems response better different types of questions • Specialisation • Collaboration

Collaborative architectures How to select the good answer? • Redundancy • Voting • Confidence score • Performance history Why not deeper analysis?

Answer Validation Exercise (AVE) Objective Validate the correctness of the answers Given by realQA systems... ...the participants at CLEF QA

Answer Validation Question Automatic Hypothesis Generation Hypothesis Textual Entailment Question Candidate answer Question Answering Answer is correct Supporting Text Answer is not correct or not enough evidence AVE 2006 AVE 2007 - 2008 Answer Validation Exercise (AVE)

Techniques in AVE 2007 Overview AVE 2007

Questions Question Answering Track Answer Validation Exercise Systems’ answers Systems’ Validation (YES, NO) Systems’ Supporting Texts Human Judgements (R,W,X,U) Mapping (YES, NO) Evaluation QA Track results AVE Track results Evaluation linked to main QA task Reuse human assessments

Content • Context and motivation • Evaluating the validation of answers • Evaluating the selection of answers • Analysis and discussion • Conclusion

Question QA sys1 Answer Validation & Selection QA sys2 Answer QA sys3 Candidate answers QA sysn Participant systems in a CLEF – QA Evaluation of Answer Validation & Selection Evaluation Proposed

Collections <q id="116" lang="EN"> <q_str> What is Zanussi? </q_str> <a id="116_1" value=""> <a_str> was an Italian producer of home appliances </a_str> <t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought</t_str> </a> <a id="116_2" value=""> <a_str> who had also been in Cassibile since August 31 </a_str> <t_str doc="en/p29/2998260.xml">Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August 31.</t_str> </a> <a id="116_4" value=""> <a_str> 3 </a_str> <t_str doc="1618911.xml">(1985) 3 Out of 5 Live (1985) What Is This?</t_str> </a> </q>

Evaluating the Validation Validation Decide if each candidate answer is correct or not • YES | NO • Not balanced collections • Approach: Detect if there is enough evidence to accept an answer • Measures: Precision, recall and F over correct answers • Baseline system: Accept all answers

Evaluating the Validation

Evaluating the Selection • Quantify the potential gain of Answer Validation in Question Answering • Compare AV systems with QA systems • Develop measures more comparable to QA accuracy

Evaluating the selection Given a question with several candidate answers Two options: • Selection • Select an answer ≡ try to answer the question • Correct selection: answer was correct • Incorrect selection: answer was incorrect • Rejection • Reject all candidate answers ≡ leave question unanswered • Correct rejection: All candidate answers were incorrect • Incorrect rejection: Not all candidate answers were incorrect

Evaluating the Selection Not comparable to qa_accuracy

Evaluating the Selection

Evaluating the Selection Rewards rejection (not balanced cols) Interpretation for QA: all questions correctly rejected by AV will be answered correctly

Interpretation for QA: questions correctly rejected by AV will be answered correctly in qa_accuracy proportion Evaluating the Selection

Content • Context and motivation • Evaluating the validation of answers • Evaluating the selection of answers • Analysis and discussion • Conclusion

Analysis and discussion(AVE 2007 English) Validation QA_acc correlated to R “Estimated” adjusts it Selection

Multi-stream QA performance (AVE 2007 English)

Analysis and discussion (AVE 2007 Spanish) Validation Comparing AV & QA Selection

Conclusion • Evaluation framework for Answer Validation & Selection systems • Measures that reward not only Correct Selection but also Correct Rejection • Promote improvement of QA systems • Allow comparison between AV and QA systems • In what conditions multi-stream perform better • Room for improvement just using multi-stream-QA • Potential gain that AV systems can provide to QA

Thanks! http://nlp.uned.es/clef-qa/ave http://www.clef-campaign.org Acknowledgement: EU project T-CLEF (ICT-1-4-1 215231)

The Second International Workshop on Evaluating Information Access (EVIA-NTCIR 2008) Tokyo, 16 December 2008 Evaluating Answer Validation in multi-stream Question Answering Álvaro Rodrigo, Anselmo Peñas, Felisa Verdejo UNED NLP & IR group nlp.uned.es

Evaluating Answer Validation in multi-stream Question Answering

Evaluating Answer Validation in multi-stream Question Answering

Presentation Transcript

Multi-Perspective Question Answering

Question-Answering

Question Answering

Multimedia Answer Generation for Community Question Answering

Evaluating Question Answering Validation

Question AnswerinG

Opinions in Question Answering

Question Answering Tutorial

Question Answering Technologies

Relevance Models and Answer Granularity for Question Answering

Question Answering

Question Answering

Question Answering

Question Answering

Automatic Answer Validation in Open -Domain Question Answering

Question Answering

Question Answering

Question Answering

Evaluating Multilingual Question Answering Systems at CLEF

Multi-Perspective Question Answering

Question Answering in Biomedicine