230 likes | 252 Views
Explore the evolution of summarization techniques from cue phrases to semantic networks, with a focus on aggregation, weighting, and reference resolution. Understand the impact of template-based approaches and challenges of abstract generation in varying genres.
E N D
Summarization and Personal Information Management Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute
Announcements • Questions? • Plan for Today • Paice article on Cue Phrases • Next time we’ll again talk about old work, but then starting next Tuesday we’ll get into more recent techniques • Critique of Summary design
Quote from paper • The alternative of picking sentences from here and there in a document is an unnatractive proposition
Paice on Cue Phrases • Luhn 1958: consider certain words as keywords, and select sentences with a high density of them • Baxendale 1958: position of a sentence in a document is an important factor
Paice on Cue Phrases • Edmundson 1969: compared different strategies • Location method: sentences at beginning or ending of paper, first sentence in a paragraph, or sentence right under a significant heading • Cue method: used cue phrases that mark important sentences • Key method: Luhn 58 approach • Title method: weighted words higher if they were either in the title or some significant heading
Paice on Cue Phrases • Earl 1970: investigated whether syntax made predictions, but her syntactic patterns were too specific and she didn’t get any generalization • Skorokhod’ko 1972: pointed out that different genres structure their texts differently, and so the approach needs to vary from one genre to another • Taylor 1977: construct a semantic network from the text and then generate a summary from a maximally connected subnetwork
Paice on Cue Phrases • Rush et al., 1971 and Pollack and Zamora, 1975, mark words with status as likely to be important or not, and remove the sentences most likely to be unimportant • Care taken to avoid dangling references • Trim off extraneous text • Aggregate sentences where possible to remove redundancies • Karasev 1978: similar cue phrase approach
Types of Abstracts • Indicative Abstracts: just tell you what the article is about • Informative Abstracts: give you an overview of the content • Critical and Comparative Abstracts: like a book review, etc.
Exophoric Links • These are links that show that two sentences “go together” • Extracts that include chunks of text where the sentences were adjacent already in the initial text will be more coherent
4 Stage Process • Identify and weight indicator phrases • Aggregate regions of text with exophoric references • Most highly weighted aggregates are selected • Texts are trimmed to remove extraneous text, etc. • ** Uses techniques from prior work, but only first two have been evaluated
Notes on Cue Phrases • If they were listed exhaustively, there would be several thousand • We use templates that represent “paradigm cases” • Work like “semantic grammars” • The actual phrases typically include some extra “fluff”, so each template comes with a “skip limit” • Stemming also helps • Some words in a template may carry more weight
Aggregating Sentences • Sentences more strongly related to other sentences within the same paragraph • Less to adjacent paragraphs • Even less to those in more distant paragraphs
Exophoric References • Reference resolution is necessary • “this” in “this paper” (not exophoric since it refers to the paper rather than something in the paper) • We did … and this was a good thing (this not exophoric because it’s resolved within the sentence) • Cataphora versus anaphora: both exophoric, but point in different directions • Discourse connectives such as “First”, “However”, Moreover” are also exophoric
Neutralizers • References to figures and tables • References to other documents • An algorithm of this kind is found in (XXX, 1980) • What makes references hard is that people have such a variety of styles of using them, not all of which conform to any standard of “correctness”
Student Comment • I knew the facts the automatic abstract process is extracting, but they were not necessarily the most salient facts for me when I selected this paper as important for reading. Surprisingly, very little of what made the paper interesting to me in the first place was captured in the abstract by this technique.
Student Comment • I hypothesize this is because this paper contained lots of useful "background facts" which were relevant, while the specific test results were of only secondary importance. If I already knew everything about the subject, the abstracts would probably contain the information that I was most interested in. This could be an example of different perspectives on a document.
Student Comment • After reading the paper on literature abstract generation, I feel that most of the techniques cannot be generalized over a long period of time as the scientific jargon changes. Some of the templates might not be able to keep up with the ever evolving method of scientific article meta-discourse. • How much time was there between when the paper was written and when your example paper was written?
Student Comment • Cue phrases worked pretty well. In fact all the sentences in the abstracted matched the templates as presented in the paper. I could still find the cue phrases in other parts of the paper. Hence the generated abstract would contain almost all the sentences of the original abstract. • But what’s the problem?
Student Comment • The cue phrase technique has worked well since the list seems quite exhaustive. The original abstract is mentioning almost the same points as mentioned in this abstracts. There is a problem in tense usage and repetition of cue phrases. Another major point is that the original abstract has a natural flow of sentences while this one lacks good coherence.
Homework 1 (Due Jan 25, 8pm) • 1 page write-up, posted to Drupal • Feel free to post comments in response to write-ups submitted by your class mates • Select one of the Grand Challenges • Describe the scenario you are targeting • What is the main problem in connection with information overload here? • What is your proposed solution and why do you think it will work? • Mock up an example summary to illustrate your idea
Critique • Feedback from your peers should help you decide how to formulate your project proposal • Due one week from today • Same format as homework assignment