1 / 82

Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises

This paper explores different types of evaluation, pitfalls, fairness of metrics, research progress, benefits, and the feel of evaluation in summarization systems.

Download Presentation

Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lessons Learned from Evaluation of Summarization Systems:Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia University Major contributers: Ani Nenkova, Becky Passonneau

  2. Questions • What kinds of evaluation are possible? • What are the pitfalls? • Are evaluation metrics fair? • Is real research progress possible? • What are the benefits? • Should we evaluate our systems?

  3. What is the feel of the evaluation? • Is it competitive? • Does it foster a feeling of community? • Are the guidelines clearly established ahead of time? • Are the metrics fair? Do they measure what you want to measure?

  4. The night Max wore his wolf suit and made mischief of one kind

  5. and another and another

  6. His mother called him “WILD THING” and he said “I’LL EAT YOU UP!” so he was sent to bed without eating anything.

  7. DARPA GALE: Global Autonomous Language Environment • Three large teams: BBN, IBM, SRI • SRI: UC Berkeley, U Washington, UCSD, Columbia, NYU, UMASS, NCRI, Systran, Fair Isaacs, Ohio State • Generate responses to open-ended questions • 17 templates: definitions, biographies, events, relationships, reactions, etc. • Using English, Chinese, and Arabic text and speech, blogs to news • Find all instances when a fact is mentioned (redundancy)

  8. GALE Evaluation • Can systems do at least 50% as well as a human? • If not, the GALE program will not continue • The team that does worst may be cut • Independent evaluator: BAE • Has never done text evaluation before • Has experience with task based evaluation • Gold Standard • System responses graded by two judges • Relevant facts added to the pool • Granularity of scoring: nuggets • Metrics • Variants of precision/recall weighted • Document citations • Redundancy

  9. Year 1: Sample Q&A LIST FACTS ABOUT [The Trial of Saddam Hussein] • The judge , however, that all people should have heard voices, the order of a court to solve technical problems. (Chi) • His account of events surrounding the torture and execution of more than 140 men and teenage boys from the Dujail , appeared to do little to advance the prosecution's goal of establishing Saddam 's "command responsibility" for the deaths. • A trial without Saddam could be an embarrassment for the U.S. government, which has worked hard to help create a tribunal that would be perceived by Iraqis as independent and fair. • As the trial got under way, a former secret police officer testified that he had not received any orders from Saddam during the investigations that followed an assassination attempt against him in Dujail in 1982 .

  10. Year 1: Results • F-value (Beta of 1) • Machine average: 0.230 • Human average: 0.353 • Machine to Human average: 0.678

  11. DUC – Document Understanding Conference • Established and funded by DARPA TIDES • Run by independent evaluator NIST • Open to summarization community • Annual evaluations on common datasets • 2001-present • Tasks • Single document summarization • Headline summarization • Multi-document summarization • Multi-lingual summarization • Focused summarization • Update summarization

  12. DUC is changing direction again • DARPA GALE effort cutting back participation in DUC • Considering co-locating with TREC QA • Considering new data sources and tasks

  13. DUC Evaluation • Gold Standard • Human summaries written by NIST • From 2 to 9 summaries per input set • Multiple metrics • Manual • Coverage (early years) • Pyramids (later years) • Responsiveness (later years) • Quality questions • Automatic • Rouge (-1, -2, -skipbigrams, LCS, BE) • Granularity • Manual: sub-sentential elements • Automatic: sentences

  14. TREC definition pilot • Long answer to request for a definition • As a pilot, less emphasis on results • Part of TREC QA

  15. Evaluation Methods • Pool system responses and break into nuggets • A judge scores nuggets as vital, OK or invalid • Measure information precision and recall • Can a judge reliably determine which facts belong in a definition?

  16. Considerations Across Evaluations • Independent evaluator • Not always as knowledgeable as researchers • Impartial determination of approach • Extensive collection of resources • Determination of task • Appealing to a broad cross-section of community • Changes over time • DUC 2001-2002 Single and multi-document • DUC 2003: headlines, multi-document • DUC 2004: headlines, multilingual and multi-document, focused • DUC 2005: focused summarization • DUC 2006: focused and a new task, up for discussion • How long do participants have to prepare? • When is a task dropped? • Scoring of text at the sub-sentential level

  17. Task-based Evaluation • Use the summarization system as browser to do another task • Newsblaster: write a report given a broad prompt • DARPA utility evaluation: given a request for information, use question answering to write report

  18. Task Evaluation • Hypothesis: multi-document summaries enable users to find information efficiently • Task: fact-gathering given topic and questions • Resembles intelligence analyst task

  19. User Study: Objectives • Does multi-document summarization help? • Do summaries help the user find information needed to perform a report writing task? • Do users use information from summaries in gathering their facts? • Do summaries increase user satisfaction with the online news system? • Do users create better quality reports with summaries? • How do full multi-document summaries compare with minimal 1-sentence summaries such as Google News?

  20. User Study: Design • Compared 4 parallel news browsing systems • Level 1: Source documents only • Level 2: One sentence multi-document summaries (e.g., Google News) linked to documents • Level 3: Newsblaster multi-document summaries linked to documents • Level 4: Human written multi-document summaries linked to documents • All groups write reports given four scenarios • A task similar to analysts • Can only use Newsblaster for research • Time-restricted

  21. User Study: Execution • 4 scenarios • 4 event clusters each • 2 directly relevant, 2 peripherally relevant • Average 10 documents/cluster • 45 participants • Balance between liberal arts, engineering • 138 reports • Exit survey • Multiple-choice and open-ended questions • Usage tracking • Each click logged, on or off-site

  22. “Geneva” Prompt • The conflict between Israel and the Palestinians has been difficult for government negotiators to settle. Most recently, implementation of the “road map for peace”, a diplomatic effort sponsored by …… • Who participated in the negotiations that produced the Geneva Accord? • Apart from direct participants, who supported the Geneva Accord preparations and how? • What has the response been to the Geneva Accord by the Palestinians?

  23. Measuring Effectiveness • Score report content and compare across summary conditions • Compare user satisfaction per summary condition • Comparing where subjects took report content from

  24. Newsblaster

  25. User Satisfaction • More effective than a web search with Newsblaster • Not true with documents only or single-sentence summaries • Easier to complete the task with summaries than with documents only • Enough time with summaries than documents only • Summaries helped most • 5% single sentence summaries • 24% Newsblaster summaries • 43% human summaries

  26. User Study: Conclusions • Summaries measurably improve a news browser’s effectiveness for research • Users are more satisfied with Newsblaster summaries are better than single-sentence summaries like those of Google News • Users want search • Not included in evaluation

  27. Potential Problems

  28. That very night in Max’s room a forest grew

  29. And grew

  30. And grew until the ceiling hung with vines and the walls became the world all around

  31. And an ocean tumbled by with a private boat for Max and he sailed all through the night and day

  32. And he sailed in and out of weeks and almost over a year to where the wild things are

  33. And when he came to where the wild things are they roared their terrible roars and gnashed their terrible teeth

  34. Comparing Text Against Text • Which human summary makes a good gold standard? Many summaries are good • At what granularity is the comparison made? • When can we say that two pieces of text match?

  35. Measuring variation • Types of variation between humans Translation same content different wording Applications Summarization different content??  different wording Generation different content??  different wording

  36. Human variation: content words (Ani Nenkova) • Summaries differ in vocabulary • Differences cannot be explained by paraphrase • 7 translations • 20 documents • 7 summaries •  20 document sets • Faster vocabulary growth in summarization

  37. Variation impacts evaluation • Comparing content is hard • All kinds of judgment calls • Paraphrases • VP vs. NP • Ministers have been exchanged • Reciprocal ministerial visits • Length and constituent type • Robotics assists doctors in the medical operating theater • Surgeonsstarted using robotic assistants

  38. Nightmare: only one gold standard • System may have chosen an equally good sentence but not in the one gold standard • Pinochet arrested in London on Oct 16 at a Spanish judge’s request for atrocities against Spaniards in Chile. • Former Chilean dictator Augusto Pinochet has been arrested in London at the request of the Spanish government • In DUC 2001 (one gold standard), human model had significant impact on scores (McKeown et al) • Five human summaries needed to avoid changes in rank (Nenkova and Passonneau) • DUC2003 data • 3 topic sets, 1 highest scoring and 2 lowest scoring • 10 model summaries

  39. How many summaries are enough?

  40. Scoring • Two main approaches used in DUC • ROUGE (Lin and Hovy) • Pyramids (Nenkova and Passonneau) • Problems: • Are the results stable? • How difficult is it to do the scoring?

  41. ROUGE: Recall-Oriented Understudy for Gisting Evaluation Rouge – Ngram co-occurrence metrics measuring content overlap Counts of n-gram overlaps between candidate and model summaries Total n-grams in summary model

  42. ROUGE • Experimentation with different units of comparison: unigrams, bigrams, longest common substring, skip-bigams, basic elements • Automatic and thus easy to apply • Important to consider confidence intervals when determining differences between systems • Scores falling within same interval not significantly different • Rouge scores place systems into large groups: can be hard to definitively say one is better than another • Sometimes results unintuitive: • Multilingual scores as high as English scores • Use in speech summarization shows no discrimination • Good for training regardless of intervals: can see trends

  43. Pyramids • Uses multiple human summaries • Information is ranked by its importance • Allows for multiple good summaries • A pyramid is created from the human summaries • Elements of the pyramid are content units • System summaries are scored by comparison with the pyramid

  44. Content units: better study of variation than sentences • Semantic units • Link different surface realizations with the same meaning • Emerge from the comparison of several texts

  45. Content unit example S1Pinochet arrestedin London on Oct 16at a Spanish judge’s requestfor atrocities against Spaniards in Chile. S2Former Chilean dictatorAugusto Pinochet has been arrestedin Londonat the request of the Spanish government. S3Britain caused international controversy and Chilean turmoil byarrestingformer Chilean dictatorPinochetin London.

  46. SCU: A cable car caught fire (Weight = 4) A. The cause of the firewas unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000. C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps,caught fireinside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacitycaught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

  47. SCU: The cause of the fire is unknown (Weight = 1) A. The cause of the firewas unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000. C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps,caught fireinside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacitycaught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

  48. Tiers of differentially weighted SCUs Top: few SCUs, high weight Bottom: many SCUs, low weight Idealized representation W=3 W=2 W=1

More Related