170 likes | 177 Views
This paper explores the use of Information Extraction (IE) technologies for text summarization by combining Named Entity recognition and automatic pattern discovery. The goal is to extract important phrases and sentences from a document to create a concise summary. The paper also discusses the evaluation and optimization of scores for sentence position, length, TF/IDF, and similarity to headline.
E N D
NYU/CRL system for DUCandProspect for Single Document Summaries September 14, 2001 DUC2001 Workshop Satoshi Sekine (New York University) Chikashi Nobata (CRL – Japan)
Objective • Use IE technologies for Summarization • Named Entity • Automatic pattern discovery Find important phrases (patterns) of the domain • Combine with Summarization technologies • Important Sentence Extraction • Sentence position, length, TF/IDF, Headline
Important Sentence Extraction • Combining 5 scores • Sentence position • Sentence length • TF/IDF • Similarity to Headline • Pattern • Optimize functions/weights on training data
Alternative scores forSentence position 1 (i<T) 0 (otherwise) max(1/i, 1/(n-i+1)) Score 1/i n 1 T Sentence position
Alternative scores forSentence length & TF/IDF • Sentence length 1. Score = Length 2. Score = Length (if L>C) Length – C (other wise) • TF/IDF TF = tf(w), (tf(w)-1)/tf(w), tf(w)/(tf(w)+1)
Alternative scores for Headline • TF/IDF ratio between words overlapping words in headline and all words in sentence • TF ratio between overlapping Named Entities (NE), and all NE’s in sentence TF = tf(e)/(1+tf(e))
Pattern • Assumption Patterns (phrases) that appear often in the domain are important • Strategy • Intended to use IR to find a larger set of documents in the domain, but used the given document set • NE’s were treated as class rather than the literal
Pattern discovery • Procedure • Analyze sentences (NE, dependency) • Extract all sub-trees from the dependency trees in the domain • Score the trees based on frequency of the tree and TF/IDF of the words • High score trees are regarded as important patterns
Optimal weight • Optimal weights are found on training set • Contribution
Evaluation Result • Subjective evaluation (V; out of 12) • Average over all documents
Prospect for Single Document Summaries Important Sentence Extraction CAN be Summarization but Summarization is NOT Important Sentence Extraction
DUC • We are aiming for Document understanding • How can understanding be instantiated? • Make summary • Extract essential point, principle relations • Answer questions • Comprehension test
Example Earthquake jolts Los Angeles area LOS ANGELES (AP) — An earthquake shook the greater Los Angeles area Sunday, but there were no immediate reports of damage or injuries. The quake had a preliminary magnitude of 4.2 and was centered about one mile southeast of West Hollywood, said Lucy Jones of the U.S. Geological Survey. The quake was felt in downtown Los Angeles where it rolled for about four seconds and also shook in the suburban areas of Van Nuys, Whittier and Glendale.
Essential points • Event (Earthquake) • When: Sunday, September 9, 2001 • Where: greater Los Angeles area • Magnitude: 4.2 • Injury: No • Death: No • Damage: No
How can we make it • IE is a hint (a step) • IE is a version of document understanding limited to a specific domain and task which are given in advance • Document understanding can be achieved by upgrading IE technologies by deleting “specific” and “given in advance”
Our approach • Essential points can be found by searching frequently mentioned patterns in the same domain • Strategy • Given a document, find its domain by IR • Find frequently mentioned patterns • Extract information matching those patterns
Single Document Summarization • Has to be continued • To pursue researches on “Understanding” • Tofind something more than sentence extraction • To observe human in summary task • To have new comers (like us)