1.85k likes | 2.16k Views
Text summarization . Tutorial ACM SIGIR New Orleans, Louisiana September 9, 2001. Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University of Michigan http://www.si.umich.edu/~radev. Part I Introduction.
E N D
Text summarization TutorialACM SIGIRNew Orleans, LouisianaSeptember 9, 2001 Dragomir R. Radev School of Information, Department of Electrical Engineering and Computer Science, and Department of Linguistics University of Michigan http://www.si.umich.edu/~radev
The BIG problem • Information overload: 1.39 Billion URLs catalogued by Google • Possible approaches: • information retrieval • document clustering • information extraction • visualization • question answering • text summarization
Some concepts • Abstracts: “a concise summary of the central subject matter of a document” [Paice90]. • Indicative, informative, and critical summaries • Extracts (representative sentences)
Informative summaries . . . . . .
Lines sometimes blurred Net Tax Moratorium Clears House The House passed a bill to extend the current moratorium on new Internet taxes until 2006. The moratorium forbids states from trying to find new ways of taxing Internet use, like imposing taxes on monthly access charges for Internet service providers.
http://www.nytimes.com/library/tech/00/05/biztech/articles/11tax.html House Votes to Ban Internet Taxes for 5 More Years By LIZETTE ALVAREZ WASHINGTON, May 10 -- In a Republican bid to woo the high-technology industry and please taxpayers, the House today rushed to the floor and then handily passed a bill to extend the current moratorium on new Internet taxes until 2006. The moratorium, which is due to expire in October 2001, forbids states to try to find new ways of taxing Internet use, like imposing taxes on monthly access charges for Internet service providers. The legislation passed today, which faces an uncertain future in the Senate, does not directly address the question of sales taxes; it would not stop states from trying to collect taxes for goods sold on the Internet. By failing to address sales taxes, however, the measure alarmed some traditional retailers, as well as state governments that say they have found it nearly impossible to collect taxes for goods sold online. "The single largest contributor to our economic prosperity has been the growth of information technology -- the Internet," said Representative John R. Kasich, an Ohio Republican. "Why would we try to tax something, why would we try to abuse something, why would we try to limit something that generates unprecedented growth, wealth, opportunity and unprecedented individual power?" Critics of the bill say the moratorium, while seemingly benign, ignores the thorny question of how state and local governments can best collect taxes on the billions of dollars of merchandise sold over the Internet each year. These taxes are expected to provide a crucial future source of revenue for states, especially as more consumers buy goods online. The bill's opponents -- a consortium of retailers, small-business groups and governors -- say that consumers who buy merchandise over the Internet can easily circumvent the sales and "use" taxes that would be collected automatically if the same merchandise is bought at a bricks-and-mortar retail store. The National Governors' Association is working on the best way to collect electronic sales tax. Estimates have put the loss in sales tax revenue to the states at $8 billion a year by 2004.
Retailers and small businesses have complained that the current system unfairly places at a disadvantage the traditional retailers that do not sell their wares online and must charge sales tax. "It's easy to imagine how these kinds of losses can affect state and local governments' ability to provide essential services," said Representative William D. Delahunt, a Massachusetts Democrat, citing the concerns of many governors. "They will be compelled to cut back local services or raise income taxes or property taxes." The bill even drew criticism from a few Republicans. Representative Ernest J. Istook Jr. of Oklahoma circulated a letter stating, "The Internet should not be singled out to be taxed, nor to be freed from tax." Still, the House voted overwhelmingly, 352 to 75, to pass the bill. A number of Democrats approved the measure after they received assurance that Congress would hold hearings concerning sales taxes and would try to come up with a solution. The moratorium "has absolutely nothing to do with the sales tax -- we will have the opportunity to have that debate," said Representative Robert Goodlatte, a Virginia Republican. The House bill faces a murkier future in the Senate. Senator John McCain, chairman of the Commerce Committee, who advocates a permanent tax moratorium, canceled a hearing on the bill last month after Republican senators, some of them former governors, expressed reservations about extending the moratorium. The legislation also faces opposition from the Clinton administration, which signaled support today for a two-year moratorium. The full House today rejected a two-year extension in a separate vote. Gov. George W. Bush, the likely Republican presidential nominee, has said he will support an extension of the moratorium. But the governor must tread carefully around the issue because Texas, which does not have a state income tax, would stand to lose substantial revenue if sales taxes are not made workable on the Internet. A spokesman for Al Gore said the vice president supported a two-year extension of the moratorium "at a minimum." If a five-year moratorium is put into place, "it should include flexibility" to adjust federal policies on Internet taxation "to take into account the fast-paced change in the Internet world.”
Types of summaries • dimensions • genres • context
Dimensions • Single-document vs. multi-document
Genres • headlines • outlines • minutes • biographies • abridgments • sound bites • movie summaries • chronologies, etc. [Mani and Maybury 1999]
Context • Query-specific • Query-independent
What does summarization involve? • Three stages (typically) • content identification • conceptual organization • realization
Spärck Jones’s three sets of factors • Input factors (source form, subject type, unit) • Purpose factors (situation, audience, use) • Output factors (material, format, style) [Spärck Jones 99]
ProSum http://transend.labs.bt.com/prosum/word/index.html • Profile-based summarization • Control of summarization length • Retention of user-defined text • Customizable heading treatment • Customizable table treatment • Customizable text differentiation
Example (New York Times) Net Tax Moratorium Clears House The House passed a bill to extend the current moratorium on new Internet taxes until 2006.The moratorium forbids states from trying to find new ways of taxing Internet use, like imposing taxes on monthly access charges for Internet service providers.
http://www.nytimes.com/library/tech/00/05/biztech/articles/11tax.html House Votes to Ban Internet Taxes for 5 More Years By LIZETTE ALVAREZ WASHINGTON, May 10 -- In a Republican bid to woo the high-technology industry and please taxpayers, the House today rushed to the floor and then handily passed a bill to extend the current moratorium on new Internet taxes until 2006. The moratorium, which is due to expire in October 2001, forbids states to try to find new ways of taxing Internet use, like imposing taxes on monthly access charges for Internet service providers. The legislation passed today, which faces an uncertain future in the Senate, does not directly address the question of sales taxes; it would not stop states from trying to collect taxes for goods sold on the Internet. By failing to address sales taxes, however, the measure alarmed some traditional retailers, as well as state governments that say they have found it nearly impossible to collect taxes for goods sold online. "The single largest contributor to our economic prosperity has been the growth of information technology -- the Internet," said Representative John R. Kasich, an Ohio Republican. "Why would we try to tax something, why would we try to abuse something, why would we try to limit something that generates unprecedented growth, wealth, opportunity and unprecedented individual power?" Critics of the bill say the moratorium, while seemingly benign, ignores the thorny question of how state and local governments can best collect taxes on the billions of dollars of merchandise sold over the Internet each year. These taxes are expected to provide a crucial future source of revenue for states, especially as more consumers buy goods online. The bill's opponents -- a consortium of retailers, small-business groups and governors -- say that consumers who buy merchandise over the Internet can easily circumvent the sales and "use" taxes that would be collected automatically if the same merchandise is bought at a bricks-and-mortar retail store. The National Governors' Association is working on the best way to collect electronic sales tax. Estimates have put the loss in sales tax revenue to the states at $8 billion a year by 2004.
Retailers and small businesses have complained that the current system unfairly places at a disadvantage the traditional retailers that do not sell their wares online and must charge sales tax. "It's easy to imagine how these kinds of losses can affect state and local governments' ability to provide essential services," said Representative William D. Delahunt, a Massachusetts Democrat, citing the concerns of many governors. "They will be compelled to cut back local services or raise income taxes or property taxes." The bill even drew criticism from a few Republicans. Representative Ernest J. Istook Jr. of Oklahoma circulated a letter stating, "The Internet should not be singled out to be taxed, nor to be freed from tax." Still, the House voted overwhelmingly, 352 to 75, to pass the bill. A number of Democrats approved the measure after they received assurance that Congress would hold hearings concerning sales taxes and would try to come up with a solution. The moratorium "has absolutely nothing to do with the sales tax -- we will have the opportunity to have that debate," said Representative Robert Goodlatte, a Virginia Republican. The House bill faces a murkier future in the Senate. Senator John McCain, chairman of the Commerce Committee, who advocates a permanent tax moratorium, canceled a hearing on the bill last month after Republican senators, some of them former governors, expressed reservations about extending the moratorium. The legislation also faces opposition from the Clinton administration, which signaled support today for a two-year moratorium. The full House today rejected a two-year extension in a separate vote. Gov. George W. Bush, the likely Republican presidential nominee, has said he will support an extension of the moratorium. But the governor must tread carefully around the issue because Texas, which does not have a state income tax, would stand to lose substantial revenue if sales taxes are not made workable on the Internet. A spokesman for Al Gore said the vice president supported a two-year extension of the moratorium "at a minimum." If a five-year moratorium is put into place, "it should include flexibility" to adjust federal policies on Internet taxation "to take into account the fast-paced change in the Internet world.”
Microsoft Autosummarize output House Votes to Ban Internet Taxes for 5 More Years The moratorium, which is due to expire in October 2001, forbids states to try to find new ways of taxing Internet use, like imposing taxes on monthly access charges for Internet service providers. By failing to address sales taxes, however, the measure alarmed some traditional retailers, as well as state governments that say they have found it nearly impossible to collect taxes for goods sold online. 10% summary
http://www.nytimes.com/library/tech/00/05/biztech/articles/11tax.html House Votes to Ban Internet Taxes for 5 More Years By LIZETTE ALVAREZ WASHINGTON, May 10 -- In a Republican bid to woo the high-technology industry and please taxpayers, the House today rushed to the floor and then handily passed a bill to extend the current moratorium on new Internet taxes until 2006. The moratorium, which is due to expire in October 2001, forbids states to try to find new ways of taxing Internet use, like imposing taxes on monthly access charges for Internet service providers. The legislation passed today, which faces an uncertain future in the Senate, does not directly address the question of sales taxes; it would not stop states from trying to collect taxes for goods sold on the Internet. By failing to address sales taxes, however, the measure alarmed some traditional retailers, as well as state governments that say they have found it nearly impossible to collect taxes for goods sold online. "The single largest contributor to our economic prosperity has been the growth of information technology -- the Internet," said Representative John R. Kasich, an Ohio Republican. "Why would we try to tax something, why would we try to abuse something, why would we try to limit something that generates unprecedented growth, wealth, opportunity and unprecedented individual power?" Critics of the bill say the moratorium, while seemingly benign, ignores the thorny question of how state and local governments can best collect taxes on the billions of dollars of merchandise sold over the Internet each year. These taxes are expected to provide a crucial future source of revenue for states, especially as more consumers buy goods online. The bill's opponents -- a consortium of retailers, small-business groups and governors -- say that consumers who buy merchandise over the Internet can easily circumvent the sales and "use" taxes that would be collected automatically if the same merchandise is bought at a bricks-and-mortar retail store. The National Governors' Association is working on the best way to collect electronic sales tax. Estimates have put the loss in sales tax revenue to the states at $8 billion a year by 2004.
Retailers and small businesses have complained that the current system unfairly places at a disadvantage the traditional retailers that do not sell their wares online and must charge sales tax. "It's easy to imagine how these kinds of losses can affect state and local governments' ability to provide essential services," said Representative William D. Delahunt, a Massachusetts Democrat, citing the concerns of many governors. "They will be compelled to cut back local services or raise income taxes or property taxes." The bill even drew criticism from a few Republicans. Representative Ernest J. Istook Jr. of Oklahoma circulated a letter stating, "The Internet should not be singled out to be taxed, nor to be freed from tax." Still, the House voted overwhelmingly, 352 to 75, to pass the bill. A number of Democrats approved the measure after they received assurance that Congress would hold hearings concerning sales taxes and would try to come up with a solution. The moratorium "has absolutely nothing to do with the sales tax -- we will have the opportunity to have that debate," said Representative Robert Goodlatte, a Virginia Republican. The House bill faces a murkier future in the Senate. Senator John McCain, chairman of the Commerce Committee, who advocates a permanent tax moratorium, canceled a hearing on the bill last month after Republican senators, some of them former governors, expressed reservations about extending the moratorium. The legislation also faces opposition from the Clinton administration, which signaled support today for a two-year moratorium. The full House today rejected a two-year extension in a separate vote. Gov. George W. Bush, the likely Republican presidential nominee, has said he will support an extension of the moratorium. But the governor must tread carefully around the issue because Texas, which does not have a state income tax, would stand to lose substantial revenue if sales taxes are not made workable on the Internet. A spokesman for Al Gore said the vice president supported a two-year extension of the moratorium "at a minimum." If a five-year moratorium is put into place, "it should include flexibility" to adjust federal policies on Internet taxation "to take into account the fast-paced change in the Internet world.”
Microsoft Autosummarize output House Votes to Ban Internet Taxes for 5 More Years The moratorium, which is due to expire in October 2001, forbids states to try to find new ways of taxing Internet use, like imposing taxes on monthly access charges for Internet service providers. The legislation passed today, which faces an uncertain future in the Senate, does not directly address the question of sales taxes; it would not stop states from trying to collect taxes for goods sold on the Internet. By failing to address sales taxes, however, the measure alarmed some traditional retailers, as well as state governments that say they have found it nearly impossible to collect taxes for goods sold online. The National Governors' Association is working on the best way to collect electronic sales tax. Representative Ernest J. Istook Jr. of Oklahoma circulated a letter stating, "The Internet should not be singled out to be taxed, nor to be freed from tax." Senator John McCain, chairman of the Commerce Committee, who advocates a permanent tax moratorium, canceled a hearing on the bill last month after Republican senators, some of them former governors, expressed reservations about extending the moratorium. 25% summary
http://www.nytimes.com/library/tech/00/05/biztech/articles/11tax.html House Votes to Ban Internet Taxes for 5 More Years By LIZETTE ALVAREZ WASHINGTON, May 10 -- In a Republican bid to woo the high-technology industry and please taxpayers, the House today rushed to the floor and then handily passed a bill to extend the current moratorium on new Internet taxes until 2006. The moratorium, which is due to expire in October 2001, forbids states to try to find new ways of taxing Internet use, like imposing taxes on monthly access charges for Internet service providers. The legislation passed today, which faces an uncertain future in the Senate, does not directly address the question of sales taxes; it would not stop states from trying to collect taxes for goods sold on the Internet. By failing to address sales taxes, however, the measure alarmed some traditional retailers, as well as state governments that say they have found it nearly impossible to collect taxes for goods sold online. "The single largest contributor to our economic prosperity has been the growth of information technology -- the Internet," said Representative John R. Kasich, an Ohio Republican. "Why would we try to tax something, why would we try to abuse something, why would we try to limit something that generates unprecedented growth, wealth, opportunity and unprecedented individual power?" Critics of the bill say the moratorium, while seemingly benign, ignores the thorny question of how state and local governments can best collect taxes on the billions of dollars of merchandise sold over the Internet each year. These taxes are expected to provide a crucial future source of revenue for states, especially as more consumers buy goods online. The bill's opponents -- a consortium of retailers, small-business groups and governors -- say that consumers who buy merchandise over the Internet can easily circumvent the sales and "use" taxes that would be collected automatically if the same merchandise is bought at a bricks-and-mortar retail store. The National Governors' Association is working on the best way to collect electronic sales tax. Estimates have put the loss in sales tax revenue to the states at $8 billion a year by 2004.
Retailers and small businesses have complained that the current system unfairly places at a disadvantage the traditional retailers that do not sell their wares online and must charge sales tax. "It's easy to imagine how these kinds of losses can affect state and local governments' ability to provide essential services," said Representative William D. Delahunt, a Massachusetts Democrat, citing the concerns of many governors. "They will be compelled to cut back local services or raise income taxes or property taxes." The bill even drew criticism from a few Republicans. Representative Ernest J. Istook Jr. of Oklahoma circulated a letter stating, "The Internet should not be singled out to be taxed, nor to be freed from tax." Still, the House voted overwhelmingly, 352 to 75, to pass the bill. A number of Democrats approved the measure after they received assurance that Congress would hold hearings concerning sales taxes and would try to come up with a solution. The moratorium "has absolutely nothing to do with the sales tax -- we will have the opportunity to have that debate," said Representative Robert Goodlatte, a Virginia Republican. The House bill faces a murkier future in the Senate. Senator John McCain, chairman of the Commerce Committee, who advocates a permanent tax moratorium, canceled a hearing on the bill last month after Republican senators, some of them former governors, expressed reservations about extending the moratorium. The legislation also faces opposition from the Clinton administration, which signaled support today for a two-year moratorium. The full House today rejected a two-year extension in a separate vote. Gov. George W. Bush, the likely Republican presidential nominee, has said he will support an extension of the moratorium. But the governor must tread carefully around the issue because Texas, which does not have a state income tax, would stand to lose substantial revenue if sales taxes are not made workable on the Internet. A spokesman for Al Gore said the vice president supported a two-year extension of the moratorium "at a minimum." If a five-year moratorium is put into place, "it should include flexibility" to adjust federal policies on Internet taxation "to take into account the fast-paced change in the Internet world.”
Outline Introduction I Traditional approaches II Multi-document summarization III Knowledge-rich techniques IV Evaluation methods V The MEAD project VI Language modeling VII
Human summarization and abstracting • What professional abstractors do • Ashworth: • “To take an original article, understand it and pack it neatly into a nutshell without loss of substance or clarity presents a challenge which many have felt worth taking up for the joys of achievement alone. These are the characteristics of an art form”.
Borko and Bernier 75 • The abstract and its use: • Abstracts promote current awareness • Abstracts save reading time • Abstracts facilitate selection • Abstracts facilitate literature searches • Abstracts improve indexing efficiency • Abstracts aid in the preparation of reviews
Cremmins 82, 96 • American National Standard for Writing Abstracts: • State the purpose, methods, results, and conclusions presented in the original document, either in that order or with an initial emphasis on results and conclusions. • Make the abstract as informative as the nature of the document will permit, so that readers may decide, quickly and accurately, whether they need to read the entire document. • Avoid including background information or citing the work of others in the abstract, unless the study is a replication or evaluation of their work.
Cremmins 82, 96 • Do not include information in the abstract that is not contained in the textual material being abstracted. • Verify that all quantitative and qualitative information used in the abstract agrees with the information contained in the full text of the document. • Use standard English and precise technical terms, and follow conventional grammar and punctuation rules. • Give expanded versions of lesser known abbreviations and acronyms, and verbalize symbols that may be unfamiliar to readers of the abstract. • Omit needless words, phrases, and sentences.
Original version:There were significant positive associations between the concentrations of the substance administered and mortality in rats and mice of both sexes.There was no convincing evidence to indicate that endrin ingestion induced and of the different types of tumors which were found in the treated animals. Edited version:Mortality in rats and mice of both sexes was dose related.No treatment-related tumors were found in any of the animals. Cremmins 82, 96
Redundancy of English • 75% redundancy of English [Shannon 51] • [Burton & Licklider 55] show that humans are as good at guessing the next letter after seeing 32 letters as after 10,000 letters.
Morris et al. 92 • Reading comprehension of summaries • Compare manual abstracts, Edmundson-style extracts, and full documents • Extracts containing 20% or 30% of original document are effective surrogates of original document • Performance on 20% and 30% extracts is no different than informative abstracts
Extraction models • Extracts vs. abstracts • Linear model • Text structure based • New techniques Information content |S| Compression Ratio = |D| i (S) Retention Ratio = i (D)
Text compaction techniques Missam ad amicum pro onsolatione epistolam, dilectissime, vestram ad me forte quidam nuper attulit. Quam ex ipsa statim tituli fronte vestram esse considerans, tanto ardentius eam cepi legere quanto scriptorem ipsum karius amplector, ut cuius rem perdidi verbis saltem tanquam eius quadam imagine recreer. Erant, memini, huius epistole fere omnia felle et absintio plena, que scilicet nostre conversionis miserabilem hystoriam et tuas, unice, cruces assiduas referebant. Complesti revera in epistola illa quod in exordio eius amico promisisti, ut videlicet in omparatione tuarum suas molestias nullas vel parvas reputaret; ubi quidem expositis prius magistrorum tuorum in te persequutionibus, deinde in corpus tuum summe proditionis iniuria, ad condiscipulorum quoque tuorum Alberici videlicet Remensis et Lotulfi Lumbardi execrabilem invidiam et infestationem nimiam stilum contulisti. Missam ad amicum pro onsolatione epistolam, dilectissime, vestram ad me forte quidam nuper attulit. Erant, memini, huius epistole fere omnia felle et absintio plena, que scilicet nostre conversionis miserabilem hystoriam et tuas, unice, cruces assiduas referebant.
Text compaction techniques Missam ad amicum pro onsolatione epistolam, dilectissime, vestram ad me forte quidam nuper attulit. Erant, memini, huius epistole fere omnia felle et absintio plena, que scilicet nostre conversionis miserabilem hystoriam et tuas, unice, cruces assiduas referebant. Missam vestram nuper attulit. Erant, scilicet nostre conversionis miserabilem hystoriam referebant.
Luhn 58 • Very first work in automated summarization • Computes measures of significance • Words: • stemming • bag of words E FREQUENCY WORDS Resolving power of significant words
Luhn 58 • Sentences: • concentration of high-score words • Cutoff values established in experiments with 100 human subjects SENTENCE SIGNIFICANT WORDS * * * * 1 2 3 4 5 6 7 ALL WORDS SCORE = 42/7 2.3
Cue method: stigma words (“hardly”, “impossible”) bonus words (“significant”) Key method: similar to Luhn Title method: title + headings Location method: sentences under headings sentences near beginning or end of document and/or paragraphs (also [Baxendale 58]) Edmundson 69
Linear combination of four features:1C + 2K + 3T + 4L Manually labelled training corpus Key not important! Edmundson 69 1 C + T + L C + K + T + L LOCATION CUE TITLE KEY RANDOM 0 10 20 30 40 50 60 70 80 90 100 %
Survey up to 1990 Techniques that (mostly) failed: syntactic criteria [Earl 70] indicator phrases (“The purpose of this article is to review…) Problems with extracts: lack of balance lack of cohesion anaphoric reference lexical or definite reference rhetorical connectives Paice 90
Lack of balance later approaches based on text rhetorical structure Lack of cohesion recognition of anaphors [Liddy et al. 87] Example: “that” is nonanaphoric if preceded by a research-verb (e.g., “demonstrat-”), nonanaphoric if followed by a pronoun, article, quantifier,…, external if no later than 10th word,else internal Paice 90
ANES: commercial news from 41 publications “Lead” achieves acceptability of 90% vs. 74.4% for “intelligent” summaries 20,997 documents words selected based on tf*idf sentence-based features: signature words location anaphora words length of abstract Brandow et al. 95
Sentences with no signature words are included if between two selected sentences Evaluation done at 60, 150, and 250 word length Non-task-driven evaluation:“Most summaries judged less-than-perfect would not be detectable as such to a user” Brandow et al. 95
Optimum position policy Measuring yield of each sentence position against keywords (signature words) from Ziff-Davis corpus Preferred order[(T) (P2,S1) (P3,S1) (P2,S2) {(P4,S1) (P5,S1) (P3,S2)} {(P1,S1) (P6,S1) (P7,S1) (P1,S3)(P2,S3) …] Lin & Hovy 97
Extracts of roughly 20% of original text Feature set: sentence length |S| > 5 fixed phrases 26 manually chosen paragraph sentence position in paragraph thematic words binary: whether sentence is included in manual extract uppercase words not common acronyms Corpus: 188 document + summary pairs from scientific journals Kupiec et al. 95
Kupiec et al. 95 • Uses Bayesian classifier: • Assuming statistical independence:
Kupiec et al. 95 • Performance: • For 25% summaries, 84% precision • For smaller summaries, 74% improvement over Lead