580 likes | 610 Views
Automatic Text Summarization: A Solid Base. Martijn B. Wieling, Rijksuniversiteit Groningen. November, 25 th 2004. Outline. Why should we bother at all? (a.k.a. Introduction) A frequency based ATS [Luhn, 1958] An ATS based on multiple features [Edmundson, 1969]
E N D
Automatic Text Summarization: A Solid Base Martijn B. Wieling, Rijksuniversiteit Groningen November, 25th 2004
Outline • Why should we bother at all? (a.k.a. Introduction) • A frequency based ATS [Luhn, 1958] • An ATS based on multiple features [Edmundson, 1969] • Automatically combining the features (1) [Kupiec et al, 1995] • Automatically combining the features (2) [Teufel & Moens, 1997] • Why should we still bother? (a.k.a. Conclusion) 0000001 ATS: A Solid Base
Why should we bother at all? • Time saving • Large scale application possible, e.g. • ‘Google-xtract’ • Extract translation • Abstracts will be consistent and objective 0000010 ATS: A Solid Base
And in the beginning there was … • Hans Peter Luhn (“father of Information Retrieval”): The Automatic Creation of Literature Abstracts - 1958 0000011 Image: Courtesy IBM ATS: A Solid Base
Luhn’s method: basic idea • Target documents: technical literature • The method is based on the following assumptions: • Frequency of word occurrence in an article is a useful measurement of word significance • Relative position of these significant words within a sentence is also a useful measurement of word significance • Based on limited capabilities of machines (IBM 704) no semantic information IBM 704 - Courtesy IBM 0000100 ATS: A Solid Base
Why word frequency? • Important words are repeated throughout the text • examples are given in favor of a certain principle • arguments are given for a certain principle • Technical literature one word: one notion • Simple and straightforward algorithm cheap to implement (processing time is costly) • Note that different forms of the same word are counted as the same word 0000101 ATS: A Solid Base
When significant? • Too low frequent words are not significant • Too high frequent words are also not significant (e.g. “the”, “and”) • Removing low frequent words is easy • set a minimum frequency-threshold • Removing common (high frequent) words: • Setting a maximum frequency threshold (statistically obtained) • Comparing to a common-word list 0000110 ATS: A Solid Base Figure 1 from [Luhn, 1958]
Using relative position • Where greatest number of high-frequent words are found closest together probability very high that representative information is given • Based on the characteristic that an explanation of a certain idea is represented by words closely together (e.g. sentences – paragraphs - chapters) 0000111 ATS: A Solid Base
The significance factor • The “significance factor” of a sentence reflects the number of occurrences of significant words within a sentence and the linear distance between them due to non-significant words in between • Only consider portion of sentence bracketed by significant words with maximum of 5 non-significant words in between, e.g. “ (*) - - - [ * - * * - - * - - * ] - - (*) “ • Significance factor formula: (Σ[*])2 / |[.]| (2.5 in the above example) 0001000 ATS: A Solid Base
Generating the abstract • For every sentence the significance factor is calculated • The sentences with a significance factor higher than a certain cut-off value are returned (alternatively the N highest-valued sentences can be returned) • For large texts, it can also be applied to subdivisions of the text • No evaluation of the results present in the journal paper! 0001001 ATS: A Solid Base
A new method by Edmundson • H.P. Edmundson: New methods in Automatic Extracting - 1969 IBM 7090 - Courtesy IBM 0001010 ATS: A Solid Base
Four methods for weighting • Weighting methods: • Cue Method • Key Method • Title Method • Location Method • The weight of a sentence is a linear combination of the weights obtained with the above four methods • The highest weighing sentences are included in the abstract • Target documents: technical literature 0001011 ATS: A Solid Base
Cue Method • Based on the hypothesis that the probable relevance of a sentence is affected by presence of pragmatic words (e.g. “Significant”, “Greatest”, Impossible”, “Hardly”) • Three types of Cue words: • Bonus words: positively affecting the relevance of a sentence (e.g. “Significant”, “Greatest”) • Stigma words: negatively affecting the relevance of a sentence (e.g. “Impossible”, “Hardly”) • Null words: irrelevant 0001100 ATS: A Solid Base
Obtaining Cue words • The lists were obtained by statistical analyses of 100 documents: • Dispersion (λ): number of documents in which the word occurred • Selection ratio (η): ratio of number of occurrences in extractor-selected sentences to number of occurrences in all sentences • Bonus words: η > thighη • Stigma words: η < tlowη • Null words: λ > tλ and tlowη< η < thighη 0001101 ATS: A Solid Base
Resulting Cue lists • Bonus list (783): comparatives, superlatives, adverbs of conclusion, value terms, etc. • Stigma list (73): anaphoric expressions, belittling expressions, etc. • Null list (139): ordinals, cardinals, the verb “to be”, prepositions, pronouns, etc. 0001110 ATS: A Solid Base
Cue weight of sentence • Tag all Bonus words with weight b > 0, all Stigma words with weight s < 0, all Null words with weight n = 0 • Cue weight of sentence: Σ (Cue weight of each word in sentence) 0001111 ATS: A Solid Base
Key Method • Principle based on [Luhn], counting the frequency of words. • Algorithm differs: • Create key glossary of all non-Cue words in the document which have a frequency larger than a certain threshold • Weight of each key word in the key glossary is set to the frequency it occurs in the document • Assign key weight to each word which can be found in the key glossary • If word is not in key glossary, key weight: 0 • No relative position is used ([Luhn]) • Key weight of sentence: Σ (Key weight of each word in sentence) 0010000 ATS: A Solid Base
Title Method • Based on the hypothesis that an author conceives title as circumscribing the subject matter of the document (similarly for headings vs. paragraphs) • Create title glossary consisting of all non-Null words in the title, subtitle and headings of the document • Words are given a positive title weight if they appear in this glossary • Title words are given a larger weight than heading words • Title weight of sentence: Σ (Title weight of each word in sentence) 0010001 ATS: A Solid Base
Location Method • Based on the hypothesis that: • Sentences occurring under certain headings are positively relevant • Topic sentences tend to occur very early or very late in a document and its paragraphs • Global idea: • Give each sentence below his heading the same weight as the heading itself (note that this is independent from the Title Method) – Heading weight • Give each sentence a certain weight based on its position - Ordinal weight • Location weight of sentence: Ordinal weight of sentence + Heading weight of sentence 0010010 ATS: A Solid Base
Location Method: Heading weight • Compare each word in a heading with the pre-stored Heading dictionary • If the word occurs in this dictionary, assign it a weight equal to the weight it has in the dictionary • Heading weight of a heading: Σ (heading weight of each word in heading) • Heading weight of a sentence = Heading weight of its heading 0010011 ATS: A Solid Base
Creating the Heading dictionary • The Heading dictionary was created by listing all words in the headings of 120 documents and calculating the selection ratio for each word: • Selection ratio (η): ratio of number of occurrences in extractor-selected sentences to number of occurrences in all headings • Deletions from this list were made on the basis of low frequency and unrelatedness to the desired information types (subject, purpose, conclusion, etc.) • Weights were given to the words in the Heading dictionary proportional to the selection ratio • The resulting Heading dictionary contained 90 words 0010100 ATS: A Solid Base
Location Method: Ordinal weight • Sentences of the first paragraph are tagged with weight O1 • Sentences of the last paragraph are tagged with weight O2 • The first sentence of a paragraph is tagged with weight O3 • The last sentence of a paragraph is tagged with weight O4 • Ordinal weight of sentence: O1 + O2 +O3 +O4 0010101 ATS: A Solid Base
Generating the abstract • Calculate the weight of a sentence: aC + bK + cT + dL, with a,b,c,d constant positive integers, C: Cue Weight, K: Key weight, T: Title weight, L: Location weight • The values of a, b, c and d were obtained by manually comparing the generated automatic abstracts with the desired (human made) abstract • Return the highest N sentences under their proper headings as the abstract (including title) • N is calculated by taking a percentage of the size of the original documents, in this journal paper 25% is used 0010110 ATS: A Solid Base
Which combination is best? • All combinations of C, K, T and L were tried to see which result had (on average) the most overlap with the handmade extract • As can be seen in the figure below (only the interesting results are shown), the Key method was omitted and only C, T and L are used to create the best abstract • Surprising result! (Luhn used only keywords to create the abstract) Figure 4 from [Edmundson, 1969] 0010111 ATS: A Solid Base
Evaluation • Evaluation was done on unseen data (40 technical documents), comparison with handmade abstracts • Result: 44% of the sentences co-selected, 66% similarity between abstracts (human judge) • Random ‘abstract’: 25% of the sentences co-selected, 34% similarity between abstracts • Another evaluation criterion: ‘extract-worthiness’ • Result: 84% of the sentences selected is extract-worthy • Therefore: for one document many possible abstracts (differing in length and content) 0011000 ATS: A Solid Base
Comments • [Goldstein e.a., 1999]: Not good to base length of abstract on length of document • Summary length is independent of document length • The longer the document, the smaller the compression ratio ( |doc.| / |abstract| ) • Better to use constant summary length • [Rath e.a., 1961]Human selection of sentences in abstracts is very variable • 6 abstracts of 20 sentences: only 32% overlap between 5 subjects (6: 8%) • Abstracting the same document 2 times by the same person with 8 weeks in between: only 55% overlap (average for 6 subjects) • Perhaps the Key Method algorithm used here is not that good (Luhn’s algorithm could be better) 0011001 ATS: A Solid Base
Time and cost of this system • Speed of extracting: 7800 words/minute • Cost: $ 0,015 / word • Including keypunching costs: $ 0.01 / word • Used corpus of 29,500 words $ 442.50 total cost • CPI 2003: $ 2798.00 total cost 0011010 ATS: A Solid Base
A jump in time • 1969: First man on the moon • 1972: Watergate scandal • 1980: John Lennon killed • 1981: First identification of AIDS & Birth of me • 1986: Space Shuttle Challenger explodes after launch • 1989: Fall of Berlin Wall • 1990: Start Gulf War & Introduction WWW • 1991: Soviet Union breaks up • 1992: Formal end of Cold War • 1993: Creation of European Union (“Verdrag van Maastricht”) • 1994: Nelson Mandela president of South Africa 0011011 ATS: A Solid Base
1995: Trained summarization • Julian Kupiec, Jan Pedersen and Francine Chen: A Trainable Document Summarizer - 1995 0011100 ATS: A Solid Base
Trained weighting • Edmundson used subjective weighting of the features (Cue, Key, Title, Location) to create an abstract • In this journal paper generating the abstract is approached as a statistical classification problem • Given a training set of documents with handmade abstracts: • Develop a classification function that estimates the probability a given sentence is included in the abstract • This requires a training corpus of documents with abstracts • Target documents: technical literature 0011101 ATS: A Solid Base
Features • Five features were used: • Sentence Length Cut-off Feature • Fixed Phrase Feature • Paragraph Feature • Thematic Word Feature • Uppercase Word Feature • The above features were chosen by experimentation 0011110 ATS: A Solid Base
Sentence Length Cut-off Feature • Based on the principle that short sentences are often not included in abstracts • Given a threshold (e.g. 5 words): • SLC-value is true for sentences longer than the threshold • SLC-value is false otherwise • Note that this feature is not similar to any of the features Edmundson used 0011111 ATS: A Solid Base
Fixed-Phrase Feature • Based on the hypothesis that: • sentences containing any of a list of fixed phrases (mostly 2 words long) are likely to be in the abstract (e.g. “in conclusion”, “this result” – total: 26 elements) • Sentences following a heading containing a certain keyword are more likely to be in the abstract (e.g., “conclusions”, “results”, “summary”) • FP-value is true for sentences in the above situations, false otherwise • Note that this feature is a combination of Edmundson’s Location Method and Cue Method, though in reduced form 0100000 ATS: A Solid Base
Paragraph Feature • Each sentence in the first ten and last five paragraphs is tagged based on it’s location • Paragraph-initial • Paragraph-final (|P| > 1 sentence) • Paragraph-medial (|P| > 2 sentences) • Note that this feature is a reduced form of Edmundson’s Location Method 0100001 ATS: A Solid Base
Thematic Word Feature • The most frequent words in a document are defined as thematic words • A small number of thematic words is selected and each sentence is scored as a function of frequency of these thematic words • TW-value is true if it is one of the highest scoring sentences • TW-value is false otherwise • Note that this feature is an adapted version of Edmundson’s Key Method 0100010 ATS: A Solid Base
Uppercase Word Feature • Based on the hypothesis that proper names often are important, since it is the explanatory text for acronyms (e.g. “… the ISO (International Standards Organization) …”) • Count the frequency of each proper name • Constraint: the uppercase thematic word is not sentence initial and begins with a capital letter • The word must occur several times and may not be an abbreviated measurement unit • Score each sentence based on the number of frequent proper names in each sentence • The score of a sentence in which the frequent proper name appears first is twice as high as later occurrences • UW-value is true if it is one of the highest scoring sentences, false otherwise • Note that this feature is a bit similar to Edmundson’s Key Method 0100011 ATS: A Solid Base
Classification • For each sentence s the probability P is calculated that it will be included in the summary S given the k features (Bayes’ rule): • Assuming statistical independence of the features: • is constant, and and can be estimated directly from the training set by counting occurrences • This function assigns for each s a score which can be used to select sentences for inclusion in the abstract 0100100 ATS: A Solid Base
The training material • 188 documents with professionally created abstracts from the scientific/technical domain, the average length of the abstracts is 3 sentences (3.5% of the total size of the document) • Sentences from the abstract were matched to the original document: • 79% direct sentence matches • 3% direct joins (2 sentences combined) • 18% no direct match or join possible • Therefore the maximum performance of the automatic system is 82% 0100101 ATS: A Solid Base
Evaluation (1) • Too little material Cross-validation used to evaluate • Two evaluation measures • Fraction of manually selected sentences which were reproduced correctly: average result: 35% • Fraction of the matchable selected sentences which were reproduced correctly: average result: 42% • Performance of features (2nd measure): 0100110 ATS: A Solid Base
Evaluation (2) • Best combination is: Paragraph + Fixed Phrase + Length Cut-off (44% performance) • Addition of frequency keyword features results in a slight decrease of performance (44% 42%) • Note that Edmundson in this case also reports a decrease in performance • In final implementation frequency keyword features are retained in favor of robustness • Baseline used in this experiment: Selecting N sentences from the beginning (Length Cut-off, thus positively biased) • Full feature set has an improvement of 74% over baseline (24% 42%) 0100111 ATS: A Solid Base
Evaluation (3) • If the size of the generated abstract is increased to 25%, the performance improves to 84% • Edmundson ‘only’ had a performance of 44% 0101000 ATS: A Solid Base
Comments • The features used in this paper were chosen by experimentation • No results/discussions of these experiments are given in the paper, so the reason for the choices remain unclear… • The comparison to Edmundson is not very fair • Handmade reference abstracts of Edmundson had a size of 25% (here 3.5%) • Also the comments which were given about [Edmundson] apply here: • Not good to base length of abstract on length of document • Human selection of sentences in abstracts is very variable • Perhaps the Key Method algorithm used here is too simple (Luhn’s algorithm could be better) 0101001 ATS: A Solid Base
Revisited: [Kupiec e.a., 1995] • Simone Teufel and Marc Moens: Sentence extraction as a classification task - 1997 0101010 ATS: A Solid Base
Main research questions • Could Kupiec e.a.’s methodology (training a model with a corpus) be used for another evaluation criterion? • What was the difference in extracting performance of both evaluation criterions for different types of documents? • Note that another set of features is used here than Kupiec e.a. used 0101011 ATS: A Solid Base
Another evaluation method • Kupiec e.a. used the ‘match sentences’ evaluation criterion • Here the training and test set abstracts are created by the authors themselves (as opposed to Kupiec e.a.) • Hence less alignable sentences are available in the document • 32% on average vs. 79% in Kupiec e.a. • This does not mean there are less ‘extract-worthy’ sentences in the document another evaluation method is chosen • Evaluation: ask human to identify abstract-worthy non-matchable sentences in the original document 0101100 ATS: A Solid Base
Features • The features used here are different from Kupiec e.a. • Cue Phrase Method (1670 cue phrases): • Location Method • Sentence Length Method • Thematic Word Method • Title Method 0101101 ATS: A Solid Base
Cue Phrase Method • Similarly as in Edmundson, with some differences: • A 5-point scale (-1 … +3) is used instead of 3 (Bonus, Null, Stigma) • Cue phrases are used instead of Cue words • If a phrase was entered into the list, also syntactically and semantically similar phrases were manually included in the list • A sentence gets the score of it’s maximum-scored Cue phrase, if no Cue phrases are present it gets a score of 0 • The list was manually created by inspecting extracted sentences • Also based on relative frequency in abstract and relative frequency in document • Sentences occurring directly after headings like ‘Introduction’ or ‘Conclusion’ are given a prior score of +2 (in Edmundson this is part of the Location Method) 0101110 ATS: A Solid Base
Location Method • As in Edmundson, with the exception of the sentences directly after headings previously mentioned • Sensitive for certain headings (e.g. “Introduction”); if such headings cannot be found: only the sentences of the first 7 and last 3 paragraphs are tagged (initial, medial, final) 0101111 ATS: A Solid Base
Sentence Length Method • As in Kupiec e.a. • The threshold is set to 15 tokens (including punctuation) 0110000 ATS: A Solid Base
Thematic Word Method • As in Kupiec e.a., with a few differences: • Selecting (non-Cue) words which occur frequently in this document, but rarely in the overall collection of documents • For each (non-Cue) word the term-frequency*inverse-document-frequency value is calculated: • score(w) = floc * log (100*N / fglob) • with N: total number of documents, floc: frequency of word w in document, fglob: number of documents containing word w • Top 10 scoring words are defined as thematic words • Top 40 sentences based on the frequency of thematic words (meaned by sentence length) are given a TW-value of 1, all others 0 0110001 ATS: A Solid Base