1 / 53

Evaluating Question Answering Validation

Information Science Institute Marina del Rey, December 11, 2009. Evaluating Question Answering Validation. Anselmo Peñas (and Alvaro Rodrigo) NLP & IR group UNED nlp.uned.es. Old friends. Question Answering Nothing else than answering a question Natural Language Understanding

kioko
Download Presentation

Evaluating Question Answering Validation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Science Institute Marina del Rey, December 11, 2009 EvaluatingQuestion Answering Validation Anselmo Peñas (and Alvaro Rodrigo) NLP & IR group UNED nlp.uned.es

  2. Old friends Question Answering • Nothing else than answering a question Natural Language Understanding • Something there, if you are able to answer a question QA: extrinsic evaluation for NLU Suddenly… (See the track?) …The QA Track at TREC

  3. Question Answering at TREC • Object of evaluation itself • Redefined as a (roughly speaking): • Highly-precision-oriented IR task • Where NLP was necessary • Specially for Answer Extraction

  4. What’sthisstoryabout?

  5. Outline • Motivation and goals • Definition and general framework • AVE 2006 • AVE 2007 & 2008 • QA 2009

  6. 2. Mid term goals and strategy Generation of methodology and evaluation resources 1. Analysis of current systems performance 3. Evaluation Task definition Task activation and development 4. Analysis of the evaluation cycle Result analysis Methodology analysis Out-line Long cycle Short cycle

  7. Systems performance 2003 - 2006 (Spanish) Overall Best result <60% Definitions Best result >80% NOT IR approach

  8. 0.8 x 0.8 x 1.0 = 0.64 Pipeline Upper Bounds SOMETHINGto break the pipeline Question Question analysis Answer Passage Retrieval Answer Extraction Answer Ranking Not enough evidence

  9. Results in CLEF-QA 2006 (Spanish) Best with ORGANIZATION Perfect combination 81% Best with PERSON Best with TIME Best system 52,5%

  10. Question QA sys1 SOMETHING forcombining / selecting QA sys2 Answer QA sys3 Candidate answers QA sysn Collaborative architectures Different systems response better different types of questions • Specialization • Collaboration

  11. Collaborative architectures Howtoselectthegoodanswer? • Redundancy • Voting • Confidence score • Performance history Whynotdeepercontentanalysis?

  12. Mid Term Goal Goal Improve QA systems performance New mid term goal Improve the devices for: Rejecting / Accepting / Selecting Answers The new task (2006) Validate the correctness of the answers Given by realQA systems... ...the participants at CLEF QA

  13. Outline • Motivation and goals • Definition and general framework • AVE 2006 • AVE 2007 & 2008 • QA 2009

  14. Define Answer Validation • Decide whether an answer is correct or not • More precisely: • The Task: • Given • Question • Answer • Supporting Text • Decide if the answer is correct according to the supporting text • Let’s call it Answer Validation Exercise (AVE)

  15. Whish list • Test collection • Questions • Answers • Supporting Texts • Human assessments • Evaluation measures • Participants

  16. Questions Question Answering Track Answer Validation Exercise Systems’ answers (ACCEPT / REJECT) Systems’ Supporting Texts Human Judgements (R,W,X,U) Mapping Evaluation (ACCEPT / REJECT) QA Track results AVE Track results Evaluation linked to main QA task Reuse human assessments

  17. Answer Validation Exercise (AVE) Answer Validation Question AutomaticHypothesis Generation Hypothesis Textual Entailment Candidate answer Answer is correct Supporting Text Answerisnotcorrectornotenoughevidence AVE 2006 AVE 2007 - 2008

  18. Outline • Motivation and goals • Definition and general framework • AVE 2006 • Underlying architecture: pipeline • Evaluating the validation • As RTE exercise: pairs text-hypothesis • AVE 2007 & 2008 • QA 2009

  19. ExactAnswer QA system Supportingsnippet Text Hypothesis AVE 2006: A RTE exercise If the text semantically entails the hypothesis, then the answer is expected to be correct. Question Entailment? Is this true? Yes 95% with current QA systems(J LOG COMP 2009)

  20. Collections AVE 2006 Available at: nlp.uned.es/clef-qa/ave/

  21. Evaluating the Validation Validation Decide if each candidate answer is correct or not • YES | NO • Not balanced collections • Approach: Detect if there is enough evidence to accept an answer • Measures: Precision, recall and F over correct answers • Baseline system: Accept all answers

  22. Evaluating the Validation

  23. Results AVE 2006

  24. Outline • Motivation and goals • Definition and general framework • AVE 2006 • AVE 2007 & 2008 • Underlying architecture: multi-stream • Quantify the potential benefit of AV in QA • Evaluating the correct selection of one answer • Evaluating the correct rejection of all answers • QA 2009

  25. Question QA sys1 Answer Validation & Selection QA sys2 Answer QA sys3 Candidateanswers + SupportingTexts QA sysn Participant systems in a CLEF – QA Evaluation of Answer Validation & Selection AVE 2007 & 2008

  26. Collections <q id="116" lang="EN"> <q_str> What is Zanussi? </q_str> <a id="116_1" value=""> <a_str> was an Italian producer of home appliances </a_str> <t_str doc="Zanussi">Zanussi For the Polish film director, see Krzysztof Zanussi. For the hot-air balloon, see Zanussi (balloon). Zanussi was an Italian producer of home appliances that in 1984 was bought</t_str> </a> <a id="116_2" value=""> <a_str> who had also been in Cassibile since August 31 </a_str> <t_str doc="en/p29/2998260.xml">Only after the signing had taken place was Giuseppe Castellano informed of the additional clauses that had been presented by general Ronald Campbell to another Italian general, Zanussi, who had also been in Cassibile since August 31.</t_str> </a> <a id="116_4" value=""> <a_str> 3 </a_str> <t_str doc="1618911.xml">(1985) 3 Out of 5 Live (1985) What Is This?</t_str> </a> </q>

  27. Evaluating the Selection Goals • Quantifythepotentialgain of AnswerValidation in QuestionAnswering • Compare AV systemswith QA systems • Developmeasures more comparable to QA accuracy

  28. Evaluating the selection Given a questionwithseveralcandidateanswers Twooptions: • Selection • Selectananswer ≡ try toanswerthequestion • Correctselection: answerwascorrect • Incorrectselection: answerwasincorrect • Rejection • Rejectallcandidateanswers≡leavequestionunanswered • Correctrejection: Allcandidateanswerswereincorrect • Incorrectrejection: Notallcandidateanswerswereincorrect

  29. Evaluating the Selection

  30. Evaluating the Selection Rewards rejection (not balanced cols) Interpretation for QA: all questions correctly rejected by AV will be answered correctly

  31. InterpretationforQA: questionscorrectlyrejectedhas value as iftheywereansweredcorrectly in qa_accuracyproportion EvaluatingtheSelection

  32. Analysis and discussion (AVE 2007 Spanish) Validation Comparing AV & QA Selection

  33. Techniques in AVE 2007

  34. Conclusion of AVE Answer Validation before • It was assumed as a QA module • But no space for its own development The new devices should help to improve QA they • Introduce more content analysis • Use Machine Learning techniques • Are able to break pipelines or combine streams Let’s transfer themto QA maintask

  35. Outline • Motivation and goals • Definition and general framework • AVE 2006 • AVE 2007 & 2008 • QA 2009

  36. CLEF QA 2009 campaign ResPubliQA: QA on European Legislation GikiCLEF: QA requiring geographical reasoning on Wikipedia QAST: QA on Speech Transcriptions of European Parliament Plenary sessions

  37. CLEF QA 2009 campaign

  38. ResPubliQA 2009:QA on European Legislation Additional Assessors Fernando Luis Costa Anna Kampchen Julia Kramme Cosmina Croitoru Advisory Board Donna Harman Maarten de Rijke Dominique Laurent Organizers Anselmo Peñas Pamela Forner Richard Sutcliffe Álvaro Rodrigo Corina Forascu Iñaki Alegria Danilo Giampiccolo Nicolas Moreau Petya Osenova

  39. Evolution of the task

  40. Collection • Subset of JRC-Acquis (10,700 docs x lang) • Parallel at document level • EU treaties, EU legislation, agreements and resolutions • Economy, health, law, food, … • Between 1950 and 2006

  41. 500 questions • REASON • Why did a commission expert conduct an inspection visit to Uruguay? • PURPOSE/OBJECTIVE • What is the overall objective of the eco-label? • PROCEDURE • How are stable conditions in the natural rubber trade achieved? • In general, any question that can be answered in a paragraph

  42. 500 questions • Also • FACTOID • In how many languages is the Official Journal of the Community published? • DEFINITION • What is meant by “whole milk”? • No NIL questions

  43. Systems response No Answer ≠ Wrong Answer • Decide if they answer or not • [ YES | NO ] • Classification Problem • Machine Learning, Provers, etc. • Textual Entailment • Provide the paragraph (ID+Text) that answers the question Aim To leave a question unanswered has more value than to give a wrong answer

  44. Assessments R: The question is answered correctly W: The question is answered incorrectly NoA: The question is not answered • NoA R: NoA, but the candidate answer was correct • NoA W: NoA, and the candidate answer was incorrect • Noa Empty: NoA and no candidate answer was given Evaluation measure: c@1 • Extension of the traditional accuracy (as proportion of questions correctly answered) • Considering unanswered questions

  45. Evaluation measure n: Number of questions nR: Number of correctly answered questions nU: Number of unanswered questions

  46. Accuracy Accuracy Evaluation measure If nU = 0 then c@1=nR/n  Accuracy If nR = 0 then c@1=0 If nU = n then c@1=0 • Leave a question unanswered gives value only if this avoids to return a wrong answer • The added value is the performance shown with the answered questions: Accuracy

  47. List of Participants

  48. Value of reducing wrong answers

  49. Detecting wrong answers Maintaining the number of correct answers, the candidate answer was not correct for 83% of unanswered questions Very good step towards improving the system

  50. Many systems under the IR baselines IR important, not enough Achievable Task Perfect combination is 50% better than best system

More Related