1 / 26

Ellen Voorhees NIST

Evaluating Answers to Definition Questions in HLT-NAACL 2003 & Overview of TREC 2003 Question Answering Track in TREC 2003. Ellen Voorhees NIST. QA Tracks in NIST. Pilot evaluation in ARDA AQUAINT program (fall, 2002)

vinson
Download Presentation

Ellen Voorhees NIST

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Answers to Definition Questionsin HLT-NAACL 2003&Overview of TREC 2003 Question Answering Trackin TREC 2003 Ellen Voorhees NIST

  2. QA Tracks in NIST • Pilot evaluation in ARDA AQUAINT program (fall, 2002) • The purpose of each pilot is to develop an effective evaluation methodology for systems that answer a certain kind of question. • The paper in HLT-NAACL 2003 is about the Definition Pilot.

  3. QA Tracks in NIST (Cont.) • TREC 2003 QA Track (August, 2003) • Passage task • Systems returned a single text snippet in response to factoid questions. • Main task • The task contains factoid, list, and definition questions. • The final score is a combination of the scores for the separate question types.

  4. Definition Questions • Asking for the definition or explanation of a term, or an introduction of a person or an organization • “What is mold?”and“Who is Colin Powell?” • Longer answer text • Various answers, not easy to evaluate the performance of systems • Precision? Recall? Exactness?

  5. Example of Response of Def Q • “Who is Christopher Reeve?” System responses: • Actor • the actor who was paralyzed when he fell off his horse • the name attraction • stars on Sunday in ABC’s remake of ”Rear Window • was injured in a show jumping accident and has become a spokesman for the cause

  6. First Round of Def Pilot • 8 runs (ABCDEFGH); allowing multiple answers for each question in one run; no length limit • Two assessors (author of the questions and the other person) • Two kinds of scores (0-10 pt.) • Content score: higher if more useful and less misleading information • Organization score: higher if useful information appears earlier • Final score is the combination with more emphasis on content score.

  7. Result of 1st Round of Def Pilot • Ranking of runs: • Author: FADEBGCH • Other: FAEGDBHC • Scores varied across assessors. • Different interpretation of “organization score” • But organization score was strongly correlated with content score. • Some relative ranking was shown.

  8. Second Round of Def Pilot • Goal: develop a more quantitative evaluation of system responses • “Information nuggets”: pieces of (atomic) information about the target of the question • What assessors do: • Create a list of info nuggets • Decide which nuggets are vital (must appear in a good definition) • Mark which nuggets appear in a system response

  9. Example of Assessment • Concept recall is quite straightforward: ratio of concepts retrieved. • Precision is hard to define. (Hard to divide text into concepts. Denominator is unknown.) • Using only recall to evaluate systems is untenable. (Entire documents get full recall.)

  10. Approximation to Precision • Borrowed from DUC (Harman and Over, 2002) • An allowance of 100 (non-space) characters for each nugget retrieved • Punishment if the length of the response is longer than allowance • Precision=1-(length-allowance)/length • In the previous example, allowance=4*100, length=175, thus precision=1.

  11. Final Score • Recall is computed only over vital nuggets. (2/3 in prev.) • Precision is computed over all nuggets. Let r be the number of vital nuggets returned in a response; a be the number of acceptable (non-vital but in the list) nuggets returned in a response; R be the total number of vital nuggets in the assessor’s list; len be of the number of non-white space characters in an answer string summed over all answer strings in the response; Then

  12. Result of 2nd Round of Def Pilot • F-measure • Different βvalue results in different f-measure ranking. • β=5 approximates the ranking of first round. author other length F 0.688 F 0.757 F 935.6 more verbose A 0.606 A 0.687 A 1121.2 more verbose D 0.568 G 0.671 D 281.8 G 0.562 D 0.669 G 164.5 relatively terse E 0.555 E 0.657 E 533.9 B 0.467 B 0.522 B 1236.5 complete sentence C 0.349 C 0.384 C 84.7 H 0.330 H 0.365 H 33.7 single snippet ... Rankings are stable!

  13. Def Task in TREC QA • 50 questions • 30 for person (e.g. Andrea Bocceli, Ben Hur) • 10 for organization (e.g. Friends of the Earth) • 10 for other thing (e.g. TB, feng shui) • Scenario • The questioner is an adult, a native speaker of English, and an “average” reader of US newspapers. In reading an article, the user has come across a term that they would like to find out more about. They may have some basic idea of what the term means either from the context of the article (for example, a bandicoot must be a type of animal) or basic background knowledge (Ulysses S. Grant was a US president). They are not experts in the domain of the target, and therefore are not seeking esoteric details (e.g., not a zoologist looking to distinguish the different species in genus Perameles).

  14. Result of Def Task, QA Track

  15. Analysis of TREC QA Track • Fidelity: the extent to which the evaluation measures what it is intended to measure. • TREC: the extent to which the abstraction captures (some of) the issues of the real task • Reliability: the extent to which an evaluation result can be trusted. • TERC: an evaluation ranks a better system ahead of a worse system

  16. Definition Task Fidelity • It is unclear whether the average user strongly prefer recall. (since β=5) • And it seems longer responses receive higher scores. • Determine how selective a system is • Baseline: returns all sentences in the corpus containing the target • Smarter baseline (BBN): as the baseline but the overlap between sentences is small

  17. Definition Task Fidelity (Cont.) 25 2 • No conclusion of β value can be made. • At least β=5 matches the user need in the pilot.

  18. Definition Task Reliability • Noise or error: • Human mistake in judgment • Different opinions from different assessors • Questions set • Evaluating the effect of different opinions • Two assessors create two different nugget sets. • Runs are scored using two nugget lists. • The stability of rankings is measured by Kendall’s τ. • The τ score is 0.848 (considered stable if τ>0.9) • Not good enough

  19. Example of Different Nugget Lists • “What is a golden parachute?” 1 vital Agreement between companies and top executives 2 vital Provides remuneration to executives who lose jobs 3 vital Remuneration is usually very generous 4 Encourages execs not to resist takeover beneficial to shareholders 5 Incentive for execs to join companies 6 Arrangement for which IRS can impose excise tax 1 vital provides remuneration to executives who lose jobs 2 vital assures officials of rich compensation if lose job due to takeover 3 vital contract agreement between companies and their top executives 4 aids in hiring and retention 5 encourages officials not to resist a merger 6 IRS can impose taxes

  20. Definition Task Reliability (Cont.) • Use two large question sets with the same size, F-measure scores of the system should be similar. • Simulation of such evaluation • Randomly create two question sets of the required size • Define error rate as the percentage of rank swaps • Grouping by the difference of F(β=5)

  21. Definition Task Reliability (Cont.) • Most errors (rank swaps) happen in small diff groups. • Difference > 0.123 is required to have confidence in F(β=5) • More questions are needed in the test set to increase the sensitivity while remaining equally confident in the result.

  22. List Task • List questions with multiple possible answers • “List the names of chewing gums” • No target number is specified. • Final answer list of a question is the collection of correct answers in the corpus. • Instance precision (IP) and instance recall (IR) • F=2*IP*IR/(IP+IR)

  23. Example of Final Answer List 1915: List the names of chewing gums. Stimorol Orbit Winterfresh Double Bubble Dirol Trident Spearmint Bazooka Doublemint Dentyne Freedent Hubba Bubba Juicy Fruit Big Red Chiclets Nicorette

  24. Other Tasks • Passage Task: • return a short (<250) span of text containing an answer. • Texts are restricted to extraction of a document • Factoid Task: • Exact answers • Passage task is evaluated separately. • The final score of the main task is FinalScore=1/2*FactoidScore+1/4*ListScore+1/4*DefScore

More Related