Multi-Document Summary Space:What do People Agree is Important?

Multi-Document Summary Space:What do People Agree is Important? John M. Conroy Institute for Defense Analyses Center for Computing Sciences Bowie, MD

Outline • Problem statement. • Human Summaries. • Oracle Estimates. • Algorithms.

Query-Based Multi-document Summarization • User types query. • Relevant documents are retrieved. • Retrieved documents are clustered. • Summaries for each cluster are displayed.

Example Query:“hurricane earthquake”

columbia

michagan

Recent Evaluation and Problem Definition Efforts • Document Understanding Conferences • 2001-2004: 100 word generic summaries. • 2005-2006: 250 word “focused” summaries. • http://duc.nist.gov/ • Multi-lingual Summarization Evaluation 2005-2006. (MSE) • Given a cluster of translated documents and English documents produce100 word. • http://www.isi.edu/~cyl/MTSE2005/

Overview of Techniques • Linguistic Tools (find sentence boundaries, to shorten sentences, extract features). • Part of speech. • Parsing. • Entity Extraction. • Bag of words, position in document. • Statistical Classifier. • Linear classifiers. • Bayesian methods, HMM, SVM, etc. • Redundancy Removal. • Maximum marginal relevance (MMR). • QR.

Sample Data DUC 2005. • 50 topics. • 25 to 50 relevant documents per topic. • 4 or 9 human summaries.

Linguistic Processing • Use heuristic patterns to find phrases/clauses/words to eliminate • Shallow processing • Value of full sentence elimination?

Linguistic Processing • Phrase elimination • Gerund phrases Example: “Suicide bombers targeted a crowded open-air market Friday, setting off blasts that killed the two assailants, injured 21 shoppers and passersby and prompted the Israeli Cabinet to put off action on ….”

Example Topic Description Title: Reasons for Train Wrecks Narative: What causes train wrecks and what can be done to prevent them? Train wrecks are those events that result in actual damage to the trains themselves not just accidents where people are killed or injured. Type: General

Example Human Summary Train wrecks are caused by a number of factors: human, mechanical and equipment errors, spotty maintenance, insufficient training, load shifting, vandalism, and natural phenomenon. The most common types of mechanical and equipment errors are: brake failures, signal light and gate failures, track defects, and rail bed collapses. Spotty maintenance is characterized by failure to consistently inspect and repair equipment. Lack of electricians and mechanics results in letting equipment run down until someone complains. Engineers are often unprepared to detect or prevent operating problems because of the lack of follow-up training needed to handle updated high technology equipment. Load shiftings derail trains when a curve is taken too fast or there is a track defect. Natural phenomenon such as heavy fog, torrential rain, or floods causes some accidents. Vandalism in the form of leaving switches open or stealing parts from them leads to serious accidents. Human errors may be the most common cause of train accidents. Cars and trucks carelessly crossing or left on tracks cause frequent accidents. Train crews often make inaccurate tonnage measurements that cause derailments or brake failures, fail to heed single-track switching precautions, make faulty car hook-ups, and, in some instances, operate locomotives while under the influence of alcohol or drugs. Some freak accidents occur when moving trains are not warned about other trains stalled on the tracks. Recommendations for preventing accidents are: increase the number of inspectors, improve emergency training procedures, install state-of-the-art warning, control, speed and weight monitoring mechanisms, and institute closer driver fitness supervision.

Another Example Topic Title: Human Toll of Tropical Storms • What has been the human toll in death or injury of tropical storms in recent years? Where and when have each of the storms caused human casualties? What are the approximate total number of casualties attributed to each of the storms? • Granularity: Specific

Example Human Summary • January 1989 through October 1994 tolled 641,257 tropical storm deaths and 5,277 injuries world-wide. • In May 1991, Bangladesh suffered 500,000 deaths; 140,000 in March 1993; and 110 deaths and 5,000 injuries in May 1994. • The Philippines had 29 deaths in July 1989 and 149 in October; 30 in June 1990, 13 in August and 14 in November. • South Carolina had 18 deaths and two injuries in October 1989; 29 deaths in April 1990 and three in October. • North Carolina had one death in July 1989 and three in October 1990. • Louisiana had three deaths in July 1989; and two deaths and 75 injuries in August 1992. • Georgia had three deaths in October 1990 and 19 in July 1994. • Florida had 15 in August 1992. • Alabama had one in July 1994. • Mississippi had five in July 1989. • Texas had four in July 1989 and two in October. • September 1989 Atlantic storms killed three. • The Bahamas had four in August 1992. • The Virgin Islands had five in December 1990. • Mexico had 19 in July 1993. • Martinique had six in October 1990 and 10 injuries in August 1993. • September 1993 Caribbean storms killed three Puerto Ricans and 22 others. • China had 48 deaths and 190 injuries in September 1989, and 216 deaths in August 1990. • Taiwan had 30 in October 1994. • In September 1990, Japan had 15 and Vietnam had 10. • Nicaragua had 116 in January 1989. • Venezuela had 300 in August 1993.

Inter-Human Word Agreement

Evaluation of Summaries Ideally each machine summary would be judged by multiple humans for 1. Responsiveness to query. 2. Cohesiveness, grammar, etc. Reality: This would take too much time! Plan: Use Metric which correlates at 90-97% with human responsiveness judgments.

Recall Oriented Understanding for Gisting Evaluation

ROUGE-1 Scores

ROUGE-2 Scores

Frequency and Summarization • Ani Nenkova, Columbia and Lucy Vanderwende, Microsoft report: • High frequency content words correlate with high frequency words chosen by humans. • SumBasic, a simple method based on this principle, produces “state of the art” generic summaries, e.g., DUC 04 and MSE 05. • Van Halteren and Teufel 2003, Radev et. Al. 2003, Copeck and Szpakowicz 2004.

What is Summary Space? • Is there enough information in the documents to approach human performance as measured by ROUGE? • Do humans abstract so much that extracts don’t suffice? • Is a unigram distribution enough?

A Candidate • Suppose an oracle gave us: • Pr(t)=Probability that a human will choose term t to be included in a summary. • t is a non-stop word term. • Estimate based on our data. • E.g., 0, 1/4, 1/2, 3/4, or 1 if 4 human summaries are provided.

A Oracle Simple Score • Generate extracts: • Score sentences by the expected percentage of abstract terms they contain. • Discard any short sentences or any long sentences. • Pivoted QR to remove redundancy.

The Oracle Pleases Everyone!

Approximate Pr(t) • Two bits of Information: • Topic Description. • Extract query phrases. • Documents Retrieved. • Extract terms which are indicative or give the “signature” of the documents.

Query Terms • Given Topic Description. • Tag it for part of speech. • Take any NN (noun), VB (verb), JJ (adjective), RB (adverb), multi-word groupings of NNP. • E.g. train, wrecks, train wrecks, causes, prevent, events, result, actual, actual damage,trains, accidents, killed, injured.

Signature Terms • Term: space-delimited string of characters from {a,b,c,…,z}, after text is lower cased and all other characters and stop words are removed. • Need to restrict our attention to indicative terms (signature terms). • Terms that occur more often then expected.

Signature Terms Terms that occur more often than expected • Based on a 22 contingency table of relevance counts. • Log-likelihood; equivalent to mutual information. • Dunning 1993, Hovy Lin 2000.

Hypothesis Testing H0: P(C|ti)=p=P(C|~ti) H1: P(C|ti)=p1p2=P(C|~ti) ML Estimate p, p1, and p2

Likelihood of H0 vs. H1 and Mutual Information

Example Signature Terms accident accidents ammunition angeles avenue beach bernardino blamed board boulevard boxcars brake brakes braking cab car cargo cars caused cc cd collided collision column conductor coroner crash crew crews crossing curve derail derailed desk driver edition emergency engineer engineers equipment failures fe fog freight ft grade holland injured injuries investigators killed line loaded locomotives los maintenance mechanical metro miles nn ntsb occurred pacific page part passenger path photo pipeline rail railroad railroads railway runaway safety san santa scene seal shells sheriff signals southern speed staff station switch track tracks train trains transportation truck weight westminster words workers wreck yard yesterday

An Approximation of Pr(t) • For a given data set and topic description • Let Q be the set of query terms. • Let S be the set of signature terms. • Estimate of Pr(t)=(Q(t) + S(t))/2where (t)=1 if tA and 0 otherwise.

Our Approach Use expected abstract word score to select candidate sentences (~2w). • Terms as sentence features • Terms: {t1, …, tm} Rm • Sentences: {s1, …, sn} Rn • Scaling: each column scaled to score. • Use Pivoted QR to select sentences.

Redundancy Removal • Pivoted QR • Choose column with maximum norm (aj) • Subtract components along aj from remaining columns, i.e., remaining columns are orthogonal to the chosen column • Stop criteria: chosen sentences (columns)  ~w (~2w) words • Removes semantic redundancy

Results

Conclusions • Pr(t), the oracle score produces summaries which “please everyone.” • A simple estimate of Pr(t) induced by query and signature terms gives rise to a top scoring system.

Future Work • Better estimates for Pr(t). • Pseudo-relevance feedback. • LSI or similar dimension reduction tricks? • Ordering of sentences for readability is important. (with Dianne O’Leary) • A 250 word summary has approximately 12 sentences. • Two directions in linguistic preprocessing: • Eugene Charniak’s parser. (with Bonnie Dorr and David Zaijac) • Simple rule based (POS lite). (Judith Schlesinger).

On Brevity

Multi-Document Summary Space:What do People Agree is Important?

Multi-Document Summary Space:What do People Agree is Important?

Presentation Transcript

Summary of Relay Trip Circuit Design

vs. The AGREE Instrument

Charting for Real People

1. The Magna Carta was a document that

Chapter 22 – Project Management

Local Public Agency (LPA) Basic Certification Training March 2014

Communication

Multi Agent Systems

Do you agree or disagree?

This document is for discussion purposes only and is not a statement of Government policy

Chapter 5 Advanced Program

Summary of Important Concepts

Multi-Channel Protocols for Wireless Mesh Networks

Multi-users, multi-organizations, multi-objectives : a single approach

Reading Street

Laboratory Automation Integrating equipment into a multi-vendor world

Databases 2

NewCo Business Model PwC Document

Probability

Space and Time

Summary of Multi Mega Watt (MMW) Workshop