Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises

Lessons Learned from Evaluation of Summarization Systems:Nightmares and Pleasant Surprises Kathleen McKeown Department of Computer Science Columbia University Major contributers: Ani Nenkova, Becky Passonneau

Questions • What kinds of evaluation are possible? • What are the pitfalls? • Are evaluation metrics fair? • Is real research progress possible? • What are the benefits? • Should we evaluate our systems?

What is the feel of the evaluation? • Is it competitive? • Does it foster a feeling of community? • Are the guidelines clearly established ahead of time? • Are the metrics fair? Do they measure what you want to measure?

The night Max wore his wolf suit and made mischief of one kind

and another and another

His mother called him “WILD THING” and he said “I’LL EAT YOU UP!” so he was sent to bed without eating anything.

DARPA GALE: Global Autonomous Language Environment • Three large teams: BBN, IBM, SRI • SRI: UC Berkeley, U Washington, UCSD, Columbia, NYU, UMASS, NCRI, Systran, Fair Isaacs, Ohio State • Generate responses to open-ended questions • 17 templates: definitions, biographies, events, relationships, reactions, etc. • Using English, Chinese, and Arabic text and speech, blogs to news • Find all instances when a fact is mentioned (redundancy)

GALE Evaluation • Can systems do at least 50% as well as a human? • If not, the GALE program will not continue • The team that does worst may be cut • Independent evaluator: BAE • Has never done text evaluation before • Has experience with task based evaluation • Gold Standard • System responses graded by two judges • Relevant facts added to the pool • Granularity of scoring: nuggets • Metrics • Variants of precision/recall weighted • Document citations • Redundancy

Year 1: Sample Q&A LIST FACTS ABOUT [The Trial of Saddam Hussein] • The judge , however, that all people should have heard voices, the order of a court to solve technical problems. (Chi) • His account of events surrounding the torture and execution of more than 140 men and teenage boys from the Dujail , appeared to do little to advance the prosecution's goal of establishing Saddam 's "command responsibility" for the deaths. • A trial without Saddam could be an embarrassment for the U.S. government, which has worked hard to help create a tribunal that would be perceived by Iraqis as independent and fair. • As the trial got under way, a former secret police officer testified that he had not received any orders from Saddam during the investigations that followed an assassination attempt against him in Dujail in 1982 .

Year 1: Results • F-value (Beta of 1) • Machine average: 0.230 • Human average: 0.353 • Machine to Human average: 0.678

DUC – Document Understanding Conference • Established and funded by DARPA TIDES • Run by independent evaluator NIST • Open to summarization community • Annual evaluations on common datasets • 2001-present • Tasks • Single document summarization • Headline summarization • Multi-document summarization • Multi-lingual summarization • Focused summarization • Update summarization

DUC is changing direction again • DARPA GALE effort cutting back participation in DUC • Considering co-locating with TREC QA • Considering new data sources and tasks

DUC Evaluation • Gold Standard • Human summaries written by NIST • From 2 to 9 summaries per input set • Multiple metrics • Manual • Coverage (early years) • Pyramids (later years) • Responsiveness (later years) • Quality questions • Automatic • Rouge (-1, -2, -skipbigrams, LCS, BE) • Granularity • Manual: sub-sentential elements • Automatic: sentences

TREC definition pilot • Long answer to request for a definition • As a pilot, less emphasis on results • Part of TREC QA

Evaluation Methods • Pool system responses and break into nuggets • A judge scores nuggets as vital, OK or invalid • Measure information precision and recall • Can a judge reliably determine which facts belong in a definition?

Considerations Across Evaluations • Independent evaluator • Not always as knowledgeable as researchers • Impartial determination of approach • Extensive collection of resources • Determination of task • Appealing to a broad cross-section of community • Changes over time • DUC 2001-2002 Single and multi-document • DUC 2003: headlines, multi-document • DUC 2004: headlines, multilingual and multi-document, focused • DUC 2005: focused summarization • DUC 2006: focused and a new task, up for discussion • How long do participants have to prepare? • When is a task dropped? • Scoring of text at the sub-sentential level

Task-based Evaluation • Use the summarization system as browser to do another task • Newsblaster: write a report given a broad prompt • DARPA utility evaluation: given a request for information, use question answering to write report

Task Evaluation • Hypothesis: multi-document summaries enable users to find information efficiently • Task: fact-gathering given topic and questions • Resembles intelligence analyst task

User Study: Objectives • Does multi-document summarization help? • Do summaries help the user find information needed to perform a report writing task? • Do users use information from summaries in gathering their facts? • Do summaries increase user satisfaction with the online news system? • Do users create better quality reports with summaries? • How do full multi-document summaries compare with minimal 1-sentence summaries such as Google News?

User Study: Design • Compared 4 parallel news browsing systems • Level 1: Source documents only • Level 2: One sentence multi-document summaries (e.g., Google News) linked to documents • Level 3: Newsblaster multi-document summaries linked to documents • Level 4: Human written multi-document summaries linked to documents • All groups write reports given four scenarios • A task similar to analysts • Can only use Newsblaster for research • Time-restricted

User Study: Execution • 4 scenarios • 4 event clusters each • 2 directly relevant, 2 peripherally relevant • Average 10 documents/cluster • 45 participants • Balance between liberal arts, engineering • 138 reports • Exit survey • Multiple-choice and open-ended questions • Usage tracking • Each click logged, on or off-site

“Geneva” Prompt • The conflict between Israel and the Palestinians has been difficult for government negotiators to settle. Most recently, implementation of the “road map for peace”, a diplomatic effort sponsored by …… • Who participated in the negotiations that produced the Geneva Accord? • Apart from direct participants, who supported the Geneva Accord preparations and how? • What has the response been to the Geneva Accord by the Palestinians?

Measuring Effectiveness • Score report content and compare across summary conditions • Compare user satisfaction per summary condition • Comparing where subjects took report content from

Newsblaster

User Satisfaction • More effective than a web search with Newsblaster • Not true with documents only or single-sentence summaries • Easier to complete the task with summaries than with documents only • Enough time with summaries than documents only • Summaries helped most • 5% single sentence summaries • 24% Newsblaster summaries • 43% human summaries

User Study: Conclusions • Summaries measurably improve a news browser’s effectiveness for research • Users are more satisfied with Newsblaster summaries are better than single-sentence summaries like those of Google News • Users want search • Not included in evaluation

Potential Problems

That very night in Max’s room a forest grew

And grew

And grew until the ceiling hung with vines and the walls became the world all around

And an ocean tumbled by with a private boat for Max and he sailed all through the night and day

And he sailed in and out of weeks and almost over a year to where the wild things are

And when he came to where the wild things are they roared their terrible roars and gnashed their terrible teeth

Comparing Text Against Text • Which human summary makes a good gold standard? Many summaries are good • At what granularity is the comparison made? • When can we say that two pieces of text match?

Measuring variation • Types of variation between humans Translation same content different wording Applications Summarization different content??  different wording Generation different content??  different wording

Human variation: content words (Ani Nenkova) • Summaries differ in vocabulary • Differences cannot be explained by paraphrase • 7 translations • 20 documents • 7 summaries •  20 document sets • Faster vocabulary growth in summarization

Variation impacts evaluation • Comparing content is hard • All kinds of judgment calls • Paraphrases • VP vs. NP • Ministers have been exchanged • Reciprocal ministerial visits • Length and constituent type • Robotics assists doctors in the medical operating theater • Surgeonsstarted using robotic assistants

Nightmare: only one gold standard • System may have chosen an equally good sentence but not in the one gold standard • Pinochet arrested in London on Oct 16 at a Spanish judge’s request for atrocities against Spaniards in Chile. • Former Chilean dictator Augusto Pinochet has been arrested in London at the request of the Spanish government • In DUC 2001 (one gold standard), human model had significant impact on scores (McKeown et al) • Five human summaries needed to avoid changes in rank (Nenkova and Passonneau) • DUC2003 data • 3 topic sets, 1 highest scoring and 2 lowest scoring • 10 model summaries

How many summaries are enough?

Scoring • Two main approaches used in DUC • ROUGE (Lin and Hovy) • Pyramids (Nenkova and Passonneau) • Problems: • Are the results stable? • How difficult is it to do the scoring?

ROUGE: Recall-Oriented Understudy for Gisting Evaluation Rouge – Ngram co-occurrence metrics measuring content overlap Counts of n-gram overlaps between candidate and model summaries Total n-grams in summary model

ROUGE • Experimentation with different units of comparison: unigrams, bigrams, longest common substring, skip-bigams, basic elements • Automatic and thus easy to apply • Important to consider confidence intervals when determining differences between systems • Scores falling within same interval not significantly different • Rouge scores place systems into large groups: can be hard to definitively say one is better than another • Sometimes results unintuitive: • Multilingual scores as high as English scores • Use in speech summarization shows no discrimination • Good for training regardless of intervals: can see trends

Pyramids • Uses multiple human summaries • Information is ranked by its importance • Allows for multiple good summaries • A pyramid is created from the human summaries • Elements of the pyramid are content units • System summaries are scored by comparison with the pyramid

Content units: better study of variation than sentences • Semantic units • Link different surface realizations with the same meaning • Emerge from the comparison of several texts

Content unit example S1Pinochet arrestedin London on Oct 16at a Spanish judge’s requestfor atrocities against Spaniards in Chile. S2Former Chilean dictatorAugusto Pinochet has been arrestedin Londonat the request of the Spanish government. S3Britain caused international controversy and Chilean turmoil byarrestingformer Chilean dictatorPinochetin London.

SCU: A cable car caught fire (Weight = 4) A. The cause of the firewas unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000. C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps,caught fireinside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacitycaught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

SCU: The cause of the fire is unknown (Weight = 1) A. The cause of the firewas unknown. B. A cable car caught fire just after entering a mountainside tunnel in an alpine resort in Kaprun, Austria on the morning of November 11, 2000. C. A cable car pulling skiers and snowboarders to the Kitzsteinhorn resort, located 60 miles south of Salzburg in the Austrian Alps,caught fireinside a mountain tunnel, killing approximately 170 people. D. On November 10, 2000, a cable car filled to capacitycaught on fire, trapping 180 passengers inside the Kitzsteinhorn mountain, located in the town of Kaprun, 50 miles south of Salzburg in the central Austrian Alps.

Tiers of differentially weighted SCUs Top: few SCUs, high weight Bottom: many SCUs, low weight Idealized representation W=3 W=2 W=1

Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises

Lessons Learned from Evaluation of Summarization Systems: Nightmares and Pleasant Surprises

Presentation Transcript

How to Perform A Lessons Learned Session With Your Project Team

Abrams Tank Systems

Lessons Learned from MICS1 and MICS2

Lessons Learned on Val Ed Performance Evaluation Advisory Council

Lessons learned from the Evaluation of Baltimore Expanded School Mental Health: Challenges and Solutions

Growth in the 1990s: Pleasant and Unpleasant Surprises

Brookhaven National Laboratory Lessons Learned (LL)Program

NERC Lessons Learned and EAS Report

Building and implementing a performance management system to inform evaluation: Lessons learned

Robust-and-evolvable Resilient Software Systems Open Problems and Lessons Learned

Past Applications, Lessons Learned, Current Thinking

LESSONS LEARNED

Air Force EUL Implementation Surprises

AWP Lessons Learned from the Front

Lessons Learned

Lessons Learned and the INL

Lessons Learned

Pleasant Surprises in Online Teaching

Web Content Summarization

OVERALL RESULTS, LESSONS LEARNED AND OVERALL RECOMMENDATIONS FOR FINAL EVALUATION OF THE

Lessons Learned* from PPIN Evaluation