450 likes | 641 Views
Some interesting directions in Automatic Summarization. Annie Louis CIS 430 12/02/08. Today’s lecture. Multi-strategy summarization Is one method enough? Performance Confidence Estimation Would be nice to have an indication of expected system performance on an input
E N D
Some interesting directions in Automatic Summarization Annie Louis CIS 430 12/02/08
Today’s lecture Multi-strategy summarization Is one method enough? Performance Confidence Estimation Would be nice to have an indication of expected system performance on an input Evaluation without human models Can we come up with cheap and fast evaluation measures? Beyond generic summarization Query focused, updates, blogs, meeting, speech..
Relevant Papers : Lacatusu et al. LCC’s GISTexter at DUC 2006: Multi-Strategy Multi-Document Summarization. In Proceedings of the Document Understanding Workshop (DUC-2006) McKeown et al. Columbia multi-document. summarization: Approach and evaluation. In Proceedings of the Document Understanding Conference (DUC01), 2001. Nenkova et al. Can You Summarize This? Identifying Correlates of Input Difficulty for Multi-Document Summarization. In Proceedings of ACL-08: HLT
More about DUC 2002 data… /project/cis/nlp/tools/Summarization_Data/Inputs2002 Newswire texts Has 3 categories of inputs ! ?
DUC 2002 input categories Single event - 30 inputs Eg: d061- Hurricane Gilbert Same place, roughly same time, same actions Multiple distinct events – 15 inputs Eg: d064 - Opening of Mac Donald’s at Russia, Canada, South Korea.. Different places, different times, different agents Biographies – 15 inputs Eg: d065 – Dan Quayle, Bush’s nominee for vice president One person – one event, background info – events from the past Do you think a single method will do well for all ?
Tf-idf summary - d061 Hurricane Gilbert Heads Toward Dominican Coast . Tropical Storm Gilbert formed in the eastern Caribbean and strengthened into a hurricane Saturday night. Gilbert Reaches Jamaican Capital With 110 Mph Winds . Hurricane warnings were posted for the Cayman Islands, Cuba and Haiti. Hurricane Hits Jamaica With 115 mph Winds; Communications. Gilbert reached Jamaica after skirting southern Puerto Rico, Haiti and the Dominican Republic. Gilbert was moving west-northwest at 15 mph and winds had decreased to 125 mph. What Makes Gilbert So Strong? With PM-Hurricane Gilbert, Bjt . Hurricane Gilbert Heading for Jamaica With 100 MPH Winds . Tropical Storm Gilbert
Tf-idf summary - d064 First McDonald's to Open in Communist Country . Police Keep Crowds From Crashing First McDonald's . McDonald's and Genex contribute $1 million each for the flagship restaurant. A Bolshoi Mac Attack in Moscow as First McDonald's Opens . McDonald's Opens First Restaurant in China . McDonald's hopes to open a restaurant in Beijing later. The 500-seat McDonald's restaurant in a three-story building is operated by McDonald's Restaurant Shenzhen Ltd., a wholly owned subsidiary of McDonald's Hong Kong. McDonald's Hong Kong is a 50-50 joint venture with McDonald's in the United States. McDonald's officials say it is not a question that
Tf-idf summary - d065 Tucker was fascinated by the idea, Quayle said. But Dan Quayle's got experience, too. Quayle's Triumph Quickly Tarnished . Quayle's Biography Inflates State Job; Quayle Concedes Error . Her statement was released by the Quayle campaign. But he would go no further in describing what assignments he would give Quayle. ``I will be a very close adviser to the president,'' Quayle said. ``You're never going to see Dan Quayle telling tales out of It was everything Quayle had hoped for. Quayle had said very little and he had said it very well. There are windows into the workings of the
Multi-strategy summarization Multiple summarization modules within a single system Better than a single method How to employ a multi-strategy system? Use all methods, produce multiple summaries, choose best Use a router and summarize by only one specific method
Produce multiple summaries and choose – LCC GISTexter Task - Query focused summarization Query is decomposed by 3 methods Sent to a QA system and a multi-document summarizer 6 different summaries Select the best summary Textual entailment + pyramid scoring
Route to a specific module – Columbia’s multi-document summarizer Features to classify an input as Single event Biography Loosely connected documents The result of classification is used to route the input to one of 3 different summarizers
Features - Single Event To identify Time span between publication dates < 80 days More than 50% documents published on same day To summarize Exploit redundancy, cluster similar sentences into themes Rank themes based on size, similarity, ranking of contained sentences by lexical chains Select phrases from each theme Generate sentences
Features - Biographies To identify Frequency of most frequent capitalized letter > X (compensate for NE) Frequency of personal pronouns > Y To summarize Target individual mentioned in sentence ? Another individual found in the sentence ? Position of most prominent capitalized word in the sentence
Features – Weakly related documents To identify Not single event nor biographical To summarize Words likely to be used in first paragraphs ie important words – learnt from corpus analysis Verb specificity Semantic themes – wordnet concepts Positional and length features More weight to recent articles Downweight sentences with pronouns
Characterizing/ Classifying inputs Important if you want to route to a specialized summarizer Classification can be made along several lines Theme of input – Columbia’s summarizer Scientific/ News articles Long/ Short documents News articles about events/ Editorials Difficult/ Easy ??
Input difficulty and Performance Confidence Estimation Some inputs are more difficult than others – Most summarizers produce poor summaries for these inputs
Some inputs are easier than others ! Average system scores obtained on different inputs for 100 word summaries mean 0.55 min 0.07 max 1.65 Data: DUC 2001 score range 0 - 4 Input to summarizer
Input difficulty & Content coverage scores • Content coverage score • extent of coverage of important content • Poor content selection –> low score • If most summaries for an input get low score.. • Most systems could not identify important content • “ Difficult Input ”
Multi-document inputs were from 5 categories: A set of documents describing... Single event The Exxon Valdez Oil Spill Subject Mad Cow Disease Biographical Elizabeth Taylor Multiple distinct events Different occasions of police misconduct Opinion Views of the senate, public, congress, lawyers etc on the decision by the senate to count illegal aliens in the 1990 census Single task – generic summarization Did system performance vary with DUC 2001 input categories? Cohesive / “On topic” Inputs Non Cohesive/ “Multiple facets” Inputs
Input type influenced scores obtained • Biographical • Single event • Subject • are easier to summarize than • Multiple distinct events • Opinions
Cohesive inputs are easier to summarize Scores for cohesive inputs are significantly* higher than those for non-cohesive inputs at 100, 200 and 400 words *One sided t-tests 95% significance level • Cohesive • Biographical • Single event • Subject • Non Cohesive • Multiple distinct events • Opinions
Inputs can be easy or difficult => Better summarizers ~ different methods to summarize different inputs multi-strategy Enhancing user experience ~ system can flag summaries that are likely to be poor in content low system confidence on difficult inputs
First step.. What characterizes difficult inputs? Find useful features Can we identify difficult inputs with high accuracy? Classification task – difficult vs easy
Smaller inputs ~ less loss of information ~ better summaries Number of sentences ~ information to be captured in the summary Vocabulary size ~ number of unique words Features – Simple length-based
% of words used only once ~ lexical repetition less repetition of content ~ difficult inputs Type- token ratio ~ lexical variation in the input fewer types ~ easy inputs Entropy of the input ~ descriptive words ~ high probabilities ~ less entropy ~ easy Features – Word distributions in input
documents with overlapping content ~ easy input Pair-wise cosine overlap (average, min, max) ~ similarity of the documents High cosine overlaps overlapping content easy to summarize Features – Document similarity and relatedness
Features– Document similarity and relatedness tightly-bound by topic ~ easy input KL Divergence ~ distance from a large collection of random documents Difference between 2 language models • input & random collection Greater divergence • input is unlike random documents, tightly bound input
Features – Log likelihood ratio based • more topic terms, similar topic terms ~ topic-oriented, easy input • Number of topic signature terms • Percentage of topic signatures in the vocabulary • ~ control for length of the input • Pair-wise topic signature overlap (average, min, max) • ~ similarity between the topic vectors of different documents • ~ cosine overlap with reduced & specific vectors
What makes some inputs easy? • Easy inputs have • smaller vocabulary • smaller entropy • greater divergence from a random collection • higher % of topic signatures in the vocabulary • higher avg cosine and topic overlap
Input difficulty hypothesis for systems • Indicator of an input’s difficulty • Average system coverage score • Difficult, if most systems select poor content • Defining difficulty of inputs • 2 classes • Abv/ Below “ mean average system score ” • > mean score – easy • < mean score – difficult • Equal classes
Baseline performance : 50% Test set: DUC 2002 - 04 10 fold cross validation on 192 observations Precision and recall of difficult inputs Accuracy 69.27 Precision 0.696 Recall 0.674 Classification Results
Summary Evaluation without Human Models Current Evaluation Measures - Recap Content Coverage Pyramid Responsiveness ROUGE * My work with Ani
Need for cheap, fast measures All current evaluations require human effort Human summaries (content overlap, pyramid, rouge) Manual marking of summaries (responsiveness) Human summaries are biased several summaries for the same input are needed to remove bias (Pyramid, ROUGE) Can we come up with cheaper evaluation techniques that will produce the same rankings for systems as human evaluations ?
Compare with input – No human models Estimate closeness of summary to input The more close a summary is to the input, the better its content should be How do we verify this ? Design some features that can reflect how close a summary is to the input Rank summaries based on the value of this feature Compare the obtained rankings to rankings given by humans Similar rankings (high correlation) – you have succeeded
What features should we use? We want to know how well a summary reflects the input’s content. Guesses ?
Features - Divergence between input and summary Smaller divergence ~ better summary KL divergence input – summary KL divergence summary – input Jensen Shannon Divergence
Features – Use of topic words from the input More topic words ~ better summary % of summary composed of topic words % of input’s topic words carried over to the summary
Features – Similarity between input and summary More similar to the input ~ better summary Cosine similarity input – summary words Cosine similarity input’s topic signatures – summary words
Features - Summary Probability Higher likelihood of summary given input ~ better summary Unigram summary probability Multinomial summary probability
Analysis of features The value of the feature will be the score for the summary Average the feature values for a particular system over all inputs Compare to average human score Spearman (rank) correlation
Results TAC 2008 Query focused summarization 48 inputs, 57 systems
Evaluation without human models Comparison with input – correlates well with human judgements Cheap, fast, unbiased No human effort needed
Other summarization tasks of interest Update summaries The user has read a set of documents A Produce a summary of updates from a set B of documents published later in time Query focused A topic statement is given to focus content selection
Other summarization tasks of interest Blog/ Opinion Summarization Mine opinions, good/ bad product reviews etc Meeting/ Speech Summarization How would you summarize a brainstorming session ?
What you have learnt today.. How simple features you already know can be put to use for interesting applications Beyond a simple sentence extractor engine – customizing for inputs/ user/ task-setting is important There are a lot of interesting tasks in summarization and language processing