Automatic Summarization using IE Technologies and Pattern Discovery

NYU/CRL system for DUCandProspect for Single Document Summaries September 14, 2001 DUC2001 Workshop Satoshi Sekine (New York University) Chikashi Nobata (CRL – Japan)

Objective • Use IE technologies for Summarization • Named Entity • Automatic pattern discovery Find important phrases (patterns) of the domain • Combine with Summarization technologies • Important Sentence Extraction • Sentence position, length, TF/IDF, Headline

Important Sentence Extraction • Combining 5 scores • Sentence position • Sentence length • TF/IDF • Similarity to Headline • Pattern • Optimize functions/weights on training data

Alternative scores forSentence position 1 (i<T) 0 (otherwise) max(1/i, 1/(n-i+1)) Score 1/i n 1 T Sentence position

Alternative scores forSentence length & TF/IDF • Sentence length 1. Score = Length 2. Score = Length (if L>C) Length – C (other wise) • TF/IDF TF = tf(w), (tf(w)-1)/tf(w), tf(w)/(tf(w)+1)

Alternative scores for Headline • TF/IDF ratio between words overlapping words in headline and all words in sentence • TF ratio between overlapping Named Entities (NE), and all NE’s in sentence TF = tf(e)/(1+tf(e))

Pattern • Assumption Patterns (phrases) that appear often in the domain are important • Strategy • Intended to use IR to find a larger set of documents in the domain, but used the given document set • NE’s were treated as class rather than the literal

Pattern discovery • Procedure • Analyze sentences (NE, dependency) • Extract all sub-trees from the dependency trees in the domain • Score the trees based on frequency of the tree and TF/IDF of the words • High score trees are regarded as important patterns

Optimal weight • Optimal weights are found on training set • Contribution

Evaluation Result • Subjective evaluation (V; out of 12) • Average over all documents

Prospect for Single Document Summaries Important Sentence Extraction CAN be Summarization but Summarization is NOT Important Sentence Extraction

DUC • We are aiming for Document understanding • How can understanding be instantiated? • Make summary • Extract essential point, principle relations • Answer questions • Comprehension test

Example Earthquake jolts Los Angeles area LOS ANGELES (AP) — An earthquake shook the greater Los Angeles area Sunday, but there were no immediate reports of damage or injuries. The quake had a preliminary magnitude of 4.2 and was centered about one mile southeast of West Hollywood, said Lucy Jones of the U.S. Geological Survey. The quake was felt in downtown Los Angeles where it rolled for about four seconds and also shook in the suburban areas of Van Nuys, Whittier and Glendale.

Essential points • Event (Earthquake) • When: Sunday, September 9, 2001 • Where: greater Los Angeles area • Magnitude: 4.2 • Injury: No • Death: No • Damage: No

How can we make it • IE is a hint (a step) • IE is a version of document understanding limited to a specific domain and task which are given in advance • Document understanding can be achieved by upgrading IE technologies by deleting “specific” and “given in advance”

Our approach • Essential points can be found by searching frequently mentioned patterns in the same domain • Strategy • Given a document, find its domain by IR • Find frequently mentioned patterns • Extract information matching those patterns

Single Document Summarization • Has to be continued • To pursue researches on “Understanding” • Tofind something more than sentence extraction • To observe human in summary task • To have new comers (like us)

Automatic Summarization using IE Technologies and Pattern Discovery

Automatic Summarization using IE Technologies and Pattern Discovery

Presentation Transcript

Do Summaries Help? A Task-Based Evaluation of Multi-Document Summarization

Around the World with Prospect Research

Requirements Document for the Banking System

Document Management System (DMS)

Technical Writing Course 5

Prospect Research in a Campaign

Analysing Data

Week 4: Notes

Summarizing documents based on cue-phrases and references

Clinical Summaries

Prospecting

The Design Document

Security Control Families

Overview of PROSPECT and SAIL Model

MG4J: Managing Gigabytes for Java Exercise

Quick Data Summaries in SAS

Near-term Prospect

Using the XML-Based Clinical Document Architecture for Exchange of Structured Discharge Summaries

Summaries in the SBU Rate-A-Course Dialog System

System Directory for Document Sharing (SDDS)

Abstracts (Executive Summaries)

Status and Prospect of sKEKB