410 likes | 531 Views
Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization. Kathleen McKeown, Rebecca Passonneau, David Elson, Ani Nenkova, Julia Hirschberg Department of Computer Science Columbia University. Status of Multi-Document Summarization. Robust
E N D
Do Summaries Help?A Task-Based Evaluation of Multi-Document Summarization Kathleen McKeown, Rebecca Passonneau, David Elson, Ani Nenkova, Julia Hirschberg Department of Computer Science Columbia University
Status of Multi-Document Summarization • Robust • Many existing systems (e.g. DUC 2004) • http://newsblaster.cs.columbia.edu • http://www.newsinessence.com • Extensive quantitative evaluation (intrinsic) • DUC 2001 – DUC 2005 • Comparison of system summary content against human models • Do system generated summaries help end-users to make better use of the news?
Extrinsic Evaluation • Task-based evaluation of single document summarization using IR • TIPSTER-II, Brandow et al, Mani et al, Mochizuki&Okumura • Other factors can determine result (Jing et al) • Evaluation of evaluation metrics using similar task as ours • Amigo et al
Task Evaluation • Hypothesis: multi-document summaries enable users to find information efficiently • Task: fact-gathering given topic and questions • Resembles intelligence analyst task • Compared 4 parallel news browsing systems • Level 1: Source documents only • Level 2: One sentence multi-document summaries (e.g., Google News) linked to documents • Level 3: Newsblaster multi-document summaries linked to documents • Level 4: Human written multi-document summaries linked to documents
Results Preview • Quality of facts gathered significantly better • Newsblaster vs. documents alone • User satisfaction higher • Newsblaster and human summaries vs. documents and 1 sentence summaries • Summaries contributed important facts • Newsblaster and human summaries vs. 1 sentence summaries • Full multi-document summarization more powerful than no documents or single sentence summarization
Outline • Study design and execution • Scoring • Results
Evaluation Goals • Do summaries help users find information needed to perform a fact gathering task? • Do users use information from the summary in gathering their facts? • Do summaries increase user satisfaction with the online news system? • Do users create better fact sets with an online news system that includes summaries than one without? • How does type of summary (i.e., 1-sentence, system generated, human generated) affect quality of task output and user satisfaction?
Experimental Design • Subjects performed four 30-minute fact-gathering scenarios • Prompt: topic description plus three questions • Given a web page as sole resource • Space in which to compose response • Instructed to cut and paste from summary or article • Four event clusters per page • Two centrally relevant, two less relevant • 10 documents per cluster on average • Complete survey after each scenario
Prompt • The conflict between Israel and the Palestinians has been difficult for government negotiators to settle. Most recently, implementation of the "road map for peace," a diplomatic effort sponsored by the United States, Russia, the E.U. and the U.N., has suffered setbacks. However unofficial negotiators have developed a plan known as the Geneva Accord for finding a permanent solution to the conflict. • Who participated in the negotiations that produced the Geneva Accord? • Apart from direct participants, who supported the Geneva Accord preparations and how? • What has the response been to the Geneva Accord by the Palestinians and Israelis?
Experimental Design • Subjects performed four 30-minute fact-gathering scenarios • Prompt: topic description plus three questions • Produced a report containing a list of facts • Given a web page as sole resource • Space in which to compose response • Instructed to cut and paste from summary or article and make citation • Four event clusters per page • Two centrally relevant, two less relevant • 10 documents per cluster on average • Complete survey after each scenario
Level 2: 1-sentence summary for each event cluster, 1-sentence summary for each article
Full multi-document summaries Neither humans nor systems had access to the prompt • Level 3: Generated by Newsblaster for each event cluster • Level 4 • Human written summary for each event cluster • Summary writers hired to write summaries • English or Journalism students with high verbal SAT
Experimental Design • Subjects performed four 30-minute fact-gathering scenarios • Prompt: topic description plus three questions • Produced a report containing a list of facts • Given a web page as sole resource • Space in which to compose response • Instructed to cut and paste from summary or article and make citation • Four event clusters per page • Two centrally relevant, two less relevant • 10 documents per cluster on average • Complete survey after each scenario
Study Execution • 45 Subjects with varied background • 73% students (BS, BA, journalism, law) • Native speakers of English • Paid, with promise of monetary prize for best report • 3 studies, controlling for scenario and level order, ~11 subjects/scenario/level
Results – What was Measured • Report content across summary conditions: levels 1-4 • User satisfaction per summary condition based on user surveys • Source of report content (summary or article) by counting fact citations
Scoring Report Content • Compare subject reports against a gold standard • Used the Pyramid method [HLT2004] • Avoids postulating an ideal exhaustive report • Predicts multiple equally good reports • Provides a metric for comparison • Gold standard for report x = pyramid of facts constructed from all reports except x • Relative importance of facts determined by report writers • 34 reports per pyramid on average -> very stable
Tiers of differentially weighted facts Top: few facts, high weight Bottom: many facts, low weight Report facts that don’t appear in pyramid have weight 0 Duplicate report facts get weight 0 Pyramid representation W=34 W=33 … W=1
Ideally informative report • Does not include a fact from a lower tier unless all facts from higher tiers are included as well
Ideally informative report • Does not include a fact from a lower tier unless all facts from higher tiers are included as well
Ideally informative report • Does not include a fact from a lower tier unless all facts from higher tiers are included as well
Ideally informative report • Does not include a fact from a lower tier unless all facts from higher tiers are included as well
Ideally informative report • Does not include a fact from a lower tier unless all facts from higher tiers are included as well
Ideally informative report • Does not include a fact from a lower tier unless all facts from higher tiers are included as well
Report Length • Wide variation in length impacts scores • We restricted report length < 1 standard deviation above the mean by truncating question answers
Results - Content Report quality improves from level 1 to level 3. (One scenario was dropped from results as it was problematic for subjects)
Statistical Analysis • ANOVA shows summary is marginally significant factor • Bonferonni method applied to determine differences in summary levels • Difference between Newsblaster and documents-only significant (P=.05) • Differences between Newsblaster and 1-sentence or human not significant • ANOVA shows that scenario, question and subject also significant factors
Results - User Satisfaction • 6 questions in exit survey required response from a 1-5 scale • Average increases by summary type
With Summaries, easier to write report and tended to have more time
Usefulness improves with summary qualityHuman summaries help best with time
Citation Patterns • Report writers were significantly more likely to extract facts from summaries with Newsblaster and human summaries
What we Learned • With summaries, a significant increase in quality of report • We hypothesized summaries would reduce reading time • As summary quality increases, users significantly more often draw facts from summary without decrease in report quality • Users claim they read fewer full documents with level 3 and 4 summaries • Full multi-document summarization better than 1 sentence summaries • Almost 5 times the proportion of subjects using Newsblaster summaries say summaries are helpful than subjects using 1 sentence summaries
Need for Follow-on Studies • Why no significant increase in report quality from level 2 to level 3? • Interface differences • Level 2 had summary for each article, level 3 did not • Level 3 required extra clicks to see list of articles • Studies to investigate controlling report length • Studies to investigate impact of scenario and question
Need for Follow-on Studies • Why no significant increase in report quality from level 2 to level 3? • Interface differences • Level 2 had summary for each article, level 3 did not • Level 3 required extra clicks to see list of articles • Studies to investigate controlling report length • Studies to investigate impact of scenario and question
Conclusions • Do summaries help? • Yes • Our task-based, extrinsic evaluation yielded significant conclusions • Full multi-document summarization (Newsblaster, human summaries) helps users perform better at fact-gathering than documents only • Users are more satisfied with full multi-document summarization than Google News style 1-sentence summaries