560 likes | 692 Views
Introduction to DUC-2001: an Intrinsic Evaluation of Generic News Text Summarization Systems. Paul Over Retrieval Group Information Access Division National Institute of Standards and Technology. Document Understanding Conferences (DUC). Summarization has always been a major TIDES component
E N D
Introduction to DUC-2001:an Intrinsic Evaluation of Generic News Text Summarization Systems Paul Over Retrieval Group Information Access Division National Institute of Standards and Technology
Document Understanding Conferences (DUC) • Summarization has always been a major TIDES component • An evaluation roadmap was completed in the summer of 2000 following the spring TIDES PI meeting • DUC-2000 occurred in November 2000: • research reports • planning for first evaluation using the roadmap
Summarization road map • Specifies a series of annual cycles, with • progressively more demanding text data • both direct (intrinsic) and indirect (extrinsic, task-based) evaluations • increasing challenge in tasks • Year 1 (September 2001) • Intrinsic evaluation of generic summaries, • of newswire/paper stories • for single and multiple documents; • with fixed lengths of 50, 100, 200, and 400 words • 60 sets of 10 documents used • 30 for training • 30 for test
DUC-2001 schedule • Preliminary call out via ACL; • over 55 responses • 25 groups signed up • Creation/Distribution of training and test data • 30 training reference sets released March 1 • 30 test sets of documents released June 15 • System development • System testing • Evaluation at NIST • 15 sets of summaries submitted July 1 • Human judgments of submissions at NIST – July 9-31 • Analysis of results • Discussion of results and plans • DUC-2001 at SIGIR in New Orleans – Sept. 13-14
Goals of the talk • Provide an overview of the: • Data • Tasks • Evaluation • Experience with implementing the evaluation procedure • Feedback from NIST assessors • Introduce the results: • Sanity checking the results and measures • Effect of reassessment with a different model summary (Phase 2) • Emphasize: • Exploratory data analysis • Attention to evaluation fundamentals over “final” conclusions • Improving future evaluations
Data: Formation of training/test document sets • Each of 10 NIST information analysts chose one set of newswire/paper articles of each of the following types: • A single event with causes and consequences • Multiple distinct events of a single type • Subject (discuss a single subject) • One of the above in the domain of natural disasters • Biographical (discuss a single person) • Opinion (different opinions about the same subject) • Each set contains about 10 documents (mean=10.2, std=2.1) • All documents in a set to be mainly about a specific “concept”
Human summary creation Single-document summaries A B Documents C Multi-document summaries A: Read hardcopy of documents. B: Create a 100-word softcopy summary for each document using the document author’s perspective. C: Create a 400-word softcopy multi-document summary of all 10 documents written as a report for a contemporary adult newspaper reader. D,E,F: Cut, paste, and reformulate to reduce the size of the summary by half. 400 D 200 E 100 F 50
Training and test document sets • For each of the 10 authors, • 3 docsets were chosen at random to be training sets • the 3 remaining sets were reserved for testing • Counts of docsets by type:
Example training and test document sets • Assessor A: • TR - D01: Clarence Thomas’s nomination to the Supreme Court [11] • TR - D06: Police misconduct [16] • TR - D05: Mad cow disease [11] 4/1. TE - D04: Hurricane Andrew [11] 5. TE - D02: Rise and fall of Michael Miliken [11] 6. TE - D03: Sununu resignation [11] • Assessor B: • TR - D09: America’s response to the Iraqi invasion of Kuwait [16] • TE - D08: Solar eclipses [11] • TR - D07: Antarctica [9] 4/2. TE - D11: Tornadoes [8] 5. TR - D10: Robert Bork [12] 6. TE - D12: Welfare reform [8]
Automatic baselines • NIST created 3 baselines automatically based roughly on algorithms suggested by Daniel Marcu from earlier work • Single-document summaries: • Take the first 100 words in the document • Multi-document summaries • Take the first 50, 100, 200, 400 words in the most recent document. • 23.3% of the 400-word summaries were shorter than the target. • Take the first sentence in the 1st, 2nd, 3rd,… document in chronological sequence until you have the target summary size. Truncate the last sentence if target size is exceeded. • 86.7% of the 400-word summaries and 10% of the 200-word summaries were shorter than the target .
Submitted summaries System Multi- Single- Code Group name doc. doc. L Columbia University 120 ----- M Cogentex 112 ----- N USC/ISI – Webclopedia 120 ----- O Univ. of Ottowa 120 307 P Univ. of Michigan 120 308 Q Univ. of Lethbridge ----- 308 R SUNY at Albany 120 308 S TNO/TPD 118 308 T SMU 120 307 U Rutgers Univ. 120 ----- V NYU ----- 308 W NSA 120 279 X NIJL ----- 296 Y USC/ISI 120 308 Z Baldwin Lang. Tech. 120 308 -------- ------- 1430 3345
Evaluation basics • Intrinsic evaluation by humans using special version of SEE (thanks to Chin-Yew Lin, ISI) • Compare: • a model summary - authored by a human • a peer summary - system-created, baseline, or human • Produce judgments of: • Peer grammaticality, cohesion, organization • Coverage of each model unit by the peer (recall) • Characteristics of peer-only material
PhasesSummary evaluation and evaluation evaluation • Phase 1: Assessor judged peers against his/her own models. • Phase 2: Assessor judged subset of peers for a subset of docsets twice - against two other humans’ summaries • Phase 3 (not implemented): 2 different assessors judge same peers using same models.
Models • Source: • Authored by a human • Phase 1: assessor is document selector and model author • Phase 2: assessor is neither document selector nor model author • Formatting: • Divided into model units (MUs) (EDUs - thanks to William Wong at ISI) • Lightly edited by authors to integrate uninterpretable fragments • Flowed together with HTML tags for SEE
Peers • Formatting: • Divided into peer units (PUs) – • simple automatically determined sentences • tuned slightly to documents and submissions • Abbreviations list • Submission ending most sentences with “…” • Submission formatted as lists of titles • Flowed together with HTML tags for SEE • 3 Sources: • Automatically generated by research systems • For single-document summaries: 5 “randomly” selected, common • No multi-document summaries for docset 31 (model error) • Automatically generated by baseline algorithms • Authored by a human other than the assessor
Origins of the evaluation framworkSEE+++ • Evaluation framework builds on ISI work embodied in original SEE software • Challenges for DUC-2001 • Better explain questions posed to the NIST assessors • Modify the software to reduce sources of error/distraction • Get agreement from DUC program committee • Three areas of assessment in SEE: • Overall peer quality • Per-unit content • Unmarked peer units
Overall peer qualityDifficult to define operationally • Grammaticality: “Do the sentences, clauses, phrases, etc. follow the basic rules of English? • Don’t worry here about style or the ideas. • Concentrate on grammar.” • Cohesion: “Do the sentences fit in as they should with the surrounding sentences? • Don’t worry about the overall structure of the ideas. • Concentrate on whether each sentence naturally follows the preceding one and leads into the next.” • Organization: “Is the content expressed and arranged in an effective manner? • Concentrate here on the high-level arrangement of the ideas.”
Overall peer quality: assessor feedback • How much should typos, truncated sentences, obvious junk characters, headlines vs. full sentences, etc. affect grammaticality score? • Hard to keep all three questions separate – especially cohesion and organization. • 5-values answer scale is ok. • Good to be able to go back and change judgments for correctness and consistency. • Need rule for small and single-unit summaries – cohesion and organization as defined don’t make much sense for these.
Counts of peer units (sentences) in submissionsWidely variable
Grammaticality across all summaries • Most scores relatively high • System score range very wide • Medians/means: Baselines < Systems < Humans • But why are baselines (extractions) less than perfect? Notches in box plots indicate 95% confidence intervals around the mean if and only if: - the sample is large (> 30), or - the sample has an approximate normal distribution.
Most baselines contained a sentence fragment • Single-document summaries: • Take the first 100 words in the document • 91.7% of these summaries ended with a sentence fragment. • Multi-document summaries • Take the first 50, 100, 200, 400 words in the most recent document. • 87.5% of these summaries ended with a sentence fragment. • Take the first sentence in the 1st, 2nd, 3rd,… document in chronological sequence until you have the target summary size. Truncate the last sentence if target size is exceeded. • 69.2 % of these summaries ended with a sentence fragment.
Grammaticality: singles vs multisSingle- vs multi-document seems to have little effect
Grammaticality: among multisWhy more lower scores for baseline 50s and human 400s?
Cohesion across all summariesMedian baselines = systems < humans
Cohesion: singles vs multis • Better results on singles than multis • For singles: median baselines = systems = humans
Cohesion: among multisWhy more higher system summaries in 50s?
Organization across all summariesMedian baselines > systems > humans
Organization: singles vs multis Generally lower scores for multi-document summaries than single-document summaries
Organization: among multisWhy more higher system summaries in 50s?Why are human summaries worse for the 200s?
Cohesion vs Organization Any real difference for assessors?Why is organization ever higher than cohesion?
Per-unit content: evaluation details • “First, find all the peer units which tell you at least some of what the current model unit tells you, i.e., peer units which express at least some of the same facts as the current model unit. When you find such a PU, click on it to mark it. • “When you have marked all such PUs for the current MU, then think about the whole set of marked Pus and answer the question.” • “The marked PUs, taken together, express [ All, Most, Some, Hardly any, or None ]of the meaning expressed by the current model unit”
Per-unit content: assessor feedback • This is a laborious process and easy to get wrong – loop within a loop. • How to interpret fragments as units, e.g., a date standing alone? • How much and what kind of information (e.g., from context) can/should you add to determine what a peer unit means? • Criteria for marking a PU need to be clear - sharing of what?: • Facts • Ideas • Meaning • Information • Reference
Per-unit content: measures • Recall • Average coverage - average of the per-MU completeness judgments [0..4] for a peer summary • Recall at various threshold levels: • Recall4: # MUs with all information covered / # MUs • Recall3: # MUs with all/most information covered / # MUs • Recall2: # MUs with all/most/some information covered / # MUs • Recall1: # MUs with all/most/some/any information covered / # MUs • Weighted average? • Precision: problems • Peer summary lengths fixed • Insensitive to: • Duplicate information • Partially unused peer units
Average coverage across all summaries • Medians: baselines <= systems < humans • Lots of “outliers” • Best system summaries approach, equal, or exceed human models
Average coverage : singles vs multis • Relatively lower baseline and system summaries for multi-document summaries
Average coverage : among multisSmall improvement as size increases
Average coverage by system for singles T R O Q P W Y V X S Z Base Humans Systems
Average coverage by system for multis T N Y L P S R M O Z W U Bases Humans Systems
Average coverage by docset for 2 systemsAverages hide lots of variation by docset-assessor
Unmarked peer units: evaluation details • “Think of 3 categories of unmarked PUs: • really should be in the model in place of something already there • not good enough to be in the model, but at least relevant to the model’s subject • not even related to the model • Answer the following question for each category: [ All, Most, Some, Hardly any, or None] of the unmarked PUs belong in this category. • Every PU should be accounted for in some category. • If there are no unmarked PUs, then answer each question with “None” • If there is only one unmarked PU, then the answers can only be “All” or “None”.
Unmarked peer units: assessor feedback • Many errors (illogical results) had to be corrected, e.g., if one question is answered “all”, then the others must be answered “none”. • Allow identification of duplicate information in the peer. • Very little peer material that deserved to be in the model in place of something there. • Assessors were possessive of their model formulations.
Phase 2 initial results • Designed to gauge effect of different models • Restricted to multi-document summaries of size 50- and 200-words • Assessor used 2 models created by other authors • Within-assessor differences mostly very small: • Mean = 0.020 • Std = 0.55 • Still want to compare to original judgments…