260 likes | 415 Views
Evaluating Answers to Definition Questions in HLT-NAACL 2003 & Overview of TREC 2003 Question Answering Track in TREC 2003. Ellen Voorhees NIST. QA Tracks in NIST. Pilot evaluation in ARDA AQUAINT program (fall, 2002)
E N D
Evaluating Answers to Definition Questionsin HLT-NAACL 2003&Overview of TREC 2003 Question Answering Trackin TREC 2003 Ellen Voorhees NIST
QA Tracks in NIST • Pilot evaluation in ARDA AQUAINT program (fall, 2002) • The purpose of each pilot is to develop an effective evaluation methodology for systems that answer a certain kind of question. • The paper in HLT-NAACL 2003 is about the Definition Pilot.
QA Tracks in NIST (Cont.) • TREC 2003 QA Track (August, 2003) • Passage task • Systems returned a single text snippet in response to factoid questions. • Main task • The task contains factoid, list, and definition questions. • The final score is a combination of the scores for the separate question types.
Definition Questions • Asking for the definition or explanation of a term, or an introduction of a person or an organization • “What is mold?”and“Who is Colin Powell?” • Longer answer text • Various answers, not easy to evaluate the performance of systems • Precision? Recall? Exactness?
Example of Response of Def Q • “Who is Christopher Reeve?” System responses: • Actor • the actor who was paralyzed when he fell off his horse • the name attraction • stars on Sunday in ABC’s remake of ”Rear Window • was injured in a show jumping accident and has become a spokesman for the cause
First Round of Def Pilot • 8 runs (ABCDEFGH); allowing multiple answers for each question in one run; no length limit • Two assessors (author of the questions and the other person) • Two kinds of scores (0-10 pt.) • Content score: higher if more useful and less misleading information • Organization score: higher if useful information appears earlier • Final score is the combination with more emphasis on content score.
Result of 1st Round of Def Pilot • Ranking of runs: • Author: FADEBGCH • Other: FAEGDBHC • Scores varied across assessors. • Different interpretation of “organization score” • But organization score was strongly correlated with content score. • Some relative ranking was shown.
Second Round of Def Pilot • Goal: develop a more quantitative evaluation of system responses • “Information nuggets”: pieces of (atomic) information about the target of the question • What assessors do: • Create a list of info nuggets • Decide which nuggets are vital (must appear in a good definition) • Mark which nuggets appear in a system response
Example of Assessment • Concept recall is quite straightforward: ratio of concepts retrieved. • Precision is hard to define. (Hard to divide text into concepts. Denominator is unknown.) • Using only recall to evaluate systems is untenable. (Entire documents get full recall.)
Approximation to Precision • Borrowed from DUC (Harman and Over, 2002) • An allowance of 100 (non-space) characters for each nugget retrieved • Punishment if the length of the response is longer than allowance • Precision=1-(length-allowance)/length • In the previous example, allowance=4*100, length=175, thus precision=1.
Final Score • Recall is computed only over vital nuggets. (2/3 in prev.) • Precision is computed over all nuggets. Let r be the number of vital nuggets returned in a response; a be the number of acceptable (non-vital but in the list) nuggets returned in a response; R be the total number of vital nuggets in the assessor’s list; len be of the number of non-white space characters in an answer string summed over all answer strings in the response; Then
Result of 2nd Round of Def Pilot • F-measure • Different βvalue results in different f-measure ranking. • β=5 approximates the ranking of first round. author other length F 0.688 F 0.757 F 935.6 more verbose A 0.606 A 0.687 A 1121.2 more verbose D 0.568 G 0.671 D 281.8 G 0.562 D 0.669 G 164.5 relatively terse E 0.555 E 0.657 E 533.9 B 0.467 B 0.522 B 1236.5 complete sentence C 0.349 C 0.384 C 84.7 H 0.330 H 0.365 H 33.7 single snippet ... Rankings are stable!
Def Task in TREC QA • 50 questions • 30 for person (e.g. Andrea Bocceli, Ben Hur) • 10 for organization (e.g. Friends of the Earth) • 10 for other thing (e.g. TB, feng shui) • Scenario • The questioner is an adult, a native speaker of English, and an “average” reader of US newspapers. In reading an article, the user has come across a term that they would like to find out more about. They may have some basic idea of what the term means either from the context of the article (for example, a bandicoot must be a type of animal) or basic background knowledge (Ulysses S. Grant was a US president). They are not experts in the domain of the target, and therefore are not seeking esoteric details (e.g., not a zoologist looking to distinguish the different species in genus Perameles).
Analysis of TREC QA Track • Fidelity: the extent to which the evaluation measures what it is intended to measure. • TREC: the extent to which the abstraction captures (some of) the issues of the real task • Reliability: the extent to which an evaluation result can be trusted. • TERC: an evaluation ranks a better system ahead of a worse system
Definition Task Fidelity • It is unclear whether the average user strongly prefer recall. (since β=5) • And it seems longer responses receive higher scores. • Determine how selective a system is • Baseline: returns all sentences in the corpus containing the target • Smarter baseline (BBN): as the baseline but the overlap between sentences is small
Definition Task Fidelity (Cont.) 25 2 • No conclusion of β value can be made. • At least β=5 matches the user need in the pilot.
Definition Task Reliability • Noise or error: • Human mistake in judgment • Different opinions from different assessors • Questions set • Evaluating the effect of different opinions • Two assessors create two different nugget sets. • Runs are scored using two nugget lists. • The stability of rankings is measured by Kendall’s τ. • The τ score is 0.848 (considered stable if τ>0.9) • Not good enough
Example of Different Nugget Lists • “What is a golden parachute?” 1 vital Agreement between companies and top executives 2 vital Provides remuneration to executives who lose jobs 3 vital Remuneration is usually very generous 4 Encourages execs not to resist takeover beneficial to shareholders 5 Incentive for execs to join companies 6 Arrangement for which IRS can impose excise tax 1 vital provides remuneration to executives who lose jobs 2 vital assures officials of rich compensation if lose job due to takeover 3 vital contract agreement between companies and their top executives 4 aids in hiring and retention 5 encourages officials not to resist a merger 6 IRS can impose taxes
Definition Task Reliability (Cont.) • Use two large question sets with the same size, F-measure scores of the system should be similar. • Simulation of such evaluation • Randomly create two question sets of the required size • Define error rate as the percentage of rank swaps • Grouping by the difference of F(β=5)
Definition Task Reliability (Cont.) • Most errors (rank swaps) happen in small diff groups. • Difference > 0.123 is required to have confidence in F(β=5) • More questions are needed in the test set to increase the sensitivity while remaining equally confident in the result.
List Task • List questions with multiple possible answers • “List the names of chewing gums” • No target number is specified. • Final answer list of a question is the collection of correct answers in the corpus. • Instance precision (IP) and instance recall (IR) • F=2*IP*IR/(IP+IR)
Example of Final Answer List 1915: List the names of chewing gums. Stimorol Orbit Winterfresh Double Bubble Dirol Trident Spearmint Bazooka Doublemint Dentyne Freedent Hubba Bubba Juicy Fruit Big Red Chiclets Nicorette
Other Tasks • Passage Task: • return a short (<250) span of text containing an answer. • Texts are restricted to extraction of a document • Factoid Task: • Exact answers • Passage task is evaluated separately. • The final score of the main task is FinalScore=1/2*FactoidScore+1/4*ListScore+1/4*DefScore