210 likes | 360 Views
Revision. Chapter 4 in Automatic Summarization 한 경 수 2001-11-15 고려대학교 자연어처리연구실. Contents. Introduction Shallow coherence smoothing Full revision to improve informativeness Text compaction Conclusion. Introduction.
E N D
Revision Chapter 4 in Automatic Summarization 한 경 수 2001-11-15 고려대학교 자연어처리연구실
Contents • Introduction • Shallow coherence smoothing • Full revision to improve informativeness • Text compaction • Conclusion Revision
Introduction • Human abstractors revise abstracts to improve the abstract’s conciseness, fluency, coherence, etc. • Local(intra-sentence) revision • Global(inter-sentence) revision • The dropping of vague or redundant terms • Generalization/specialization operations • Reference adjustments • Lexical substitutions • Situations where local revision especially useful • Wireless hand-held devices • TV captions • Word processors • automatic style checkers suggesting ways of making text more concise Revision
Shallow coherence smoothing Coherence problems • Nanba & Okumura(2000) • Coherence problems • Lack of conjunctions between sentences / dangling sentential conjunctions • E.g. “however” • Lack of / extraneous adverbial particles • E.g. “too” • Syntactically complex sentences • Redundant repetition • E.g. repetition of proper names • Lack of information • Dangling anaphors • E.g. “in such a situation” • Proper names without adequate descriptions • E.g. “CEO Young-In Song” Revision
Shallow coherence smoothing Dangling anaphors • Knowing when an anaphor will dangle … • Requires identifying which of the referring expressions corresponds to the referent of the anaphor. • Difficult in the general case • Requiring both linguistic knowledge & domain knowledge • Some systems try to address this problem in shallow ways • Addressing dangling anaphors problem • Exclude all sentences containing specific anaphors • Include some window of previous sentences for readability • Delete an anaphor if the previous sentence is not in the summary. Revision
Shallow coherence smoothing Dangling anaphors • Addressing dangling anaphors problem (continued) • Use within-sentence position [Paice(1990)] • “he” might be considered as requiring at least the previous sentences as context if … • There is no other occurrence of “he”, “his”, or “him” earlier in the sentence • It occurs within the first 10 words of the sentence • Present the referent alongside the resolved anaphor • Reducing the need for the introduction of referent-containing sentences • Adjust all the references to the new context Revision
Shallow coherence smoothing Dangling conjunctions • Nanba & Okumura(2000) • Examine the rhetorical structure of the text within 3 sentences of the conjunction in order to find the sentence related by the conjunction. • If it isn’t found, the conjunction is deleted. • Johnson et al.(1993) • Describe an extensive set of rules for excluding sentences • Recognize dangling anaphors • Recognize comparatives missing the comparand • E.g. “greater in 1986” • Recognize connectives like “hence” • Pollock & Zamora(1975) • Exclude sentences which contain introductory and parenthetical clauses • Comma string comma verb/to • Exclude leading strings before that • Normalization of vocabulary in output • Abbreviation of words or phrases, standardizing non-US spelling • Replacement of chemical compound names with formulas Revision
Shallow coherence smoothing Structured environments & Gaps • Structural environments • Document element such as • Itemized lists, tables, logical arguments • Whose structural integrity need to be preserved in the summary. • It is often very difficult to parse the structure • But it may be simpler to recognize that one is in a structured environment. • In either case, one has the choice of … • Either recognizing such an environment and excluding it, • Or else attempting to summarize it • Gaps • Include otherwise excluded lower-ranked sentences immediately between two selected sentences • Add the first sentence of a paragraph • If the 2nd or 3rd sentence has been selected previously Revision
Shallow coherence smoothing Post-processing ‘repair’ step • Loss of compression • Due to introduction of extraneous material • The result of coherence-based repairs is normally outside the scope of the scoring metric used for extraction. • Strzalkowski et al.(1999) • Every passage to be extracted is tested to see if it requires a background passage by … • The presence of anaphors in the first six words of the referential passage • The presence of cue phrases in those passages • Both main and background passages are scored • Based on overlap with terms in the query • Groups of passages are then merged into new passages • With the merging being extended to include any needed background passages • The passages are re-scored • With a penalty for passages that contain elided material • The merge-and-score steps are repeated • Until there is no further change in the top scoring passage over 2 iterations. Revision
Full revision to improve informativeness Case study: Full revision • Introduction • Construct an initial draft summary of a source text • Add additional background information from the source to it • Information in the draft is combined and excised based on compaction rules • sentence aggregation operation • Gather and draw in relevant background information, in the form of description of discourse entities from different parts of the source • elimination operation • Increase the amount of compression available • Hypothesis • These operations can result in packing in more information per unit compression than possible by concatenation Revision
Full revision to improve informativeness Case study: Full revision • The revision program • Input • Source document • Each input sentence is represented as a syntactic tree • Whose nodes are annotated with coreference information • Draft summary specification • Specified in terms of a task dependent weighting function • Target compression rate • The program generates a revised summary draft whose compression rate is no more than above the target compression rate. • Revision process • A unary rule applies to a single sentence. • A binary rule applies to a pair of sentences • at least one of which must be in the draft • Where the first sentence precedes the second in the input • Control over sentence complexity is imposed • By failing rule application • when the draft sentence is too long or the parse tree is too deep • The program terminates … • No more rule to apply or excess of the compression rate by more than Revision
Full revision to improve informativeness Case study: Full revision • Revision rules • Sentence compaction operations • Eliminate constituents from a sentence • Parentheticals, sentence-initial PPs, adverbial phrases satisfying lexical tests (“In particular”, “Accordingly”, “In conclusion”, …) • Sentence aggregation operations • Combine constituents from two sentences, at least one of which must be a sentence in the draft, into a new constituent which is inserted into the draft sentence • The basis for combining sentences is that of referential identity. • If there is an NP in sentence i which is coreferential with an NP in sentence j, • Then sentences i and j are candidates for aggregation • The most common form of aggregation is expressed as tree-adjunction • Relative clause introduction rule turns a VP of a sentence whose subject is coreferential with an NP of an earlier sentence into a relative clause modifier of the draft sentence NP • “The virus, which infected only unclassified computers, destroyed no data.” • Other appositive phrase insertion rules include copying and inserting relative clause modifiers, appositive modifiers of proper names, and proper name appositive modifiers of definite NPs. Revision
Full revision to improve informativeness Case study: Full revision • Revision rules (continued) • Sentence smoothing operations • Apply to a single sentence • Performing transformations so as to arrive at more compact, stylistically preferred sentences • Coordination reduction operations • Ellipsis • “The rogue computer program destroyed files over a five month period and infected close to 100 computers at NASA facilities” • Relative clause reduction • Relative pronoun(RP) + “be” deleted • RP + “have” “with” • RP + V V+ing • Coordination rules • Relative clause coordination • Reference adjustment operations • Fix up the results of other revision operations in order to improve discourse-level coherence, and as a result, they are run last. • Name aliasing • Pronoun expansion • Indefinitization: a definite NP a coreferential indefinite Revision
Full revision to improve informativeness Case study: Full revision • Coreference Rules • A singular definite NP is treated as coreferential with the last singular definite or singular indefinite atomic NP with the same head • Provided they are within a particular threshold distance from each other. • “He” is marked as coreferential with the last person name mentioned • With gender agreement enforced when the person’s gender is known. • Proper names are marked as coreferential with each other based on rules for abbreviating names of various entities • People, organizations, places • These rules doesn’t disambiguate different word senses of the same word, nor address inferential relationships between the definite NP and its antecedent. Revision
Full revision to improve informativeness Case study: Full revision • Evaluations • Evaluation involving revision of topic-focused summaries using informativeness measures from the TIPSTER SUMMAC evaluation • Q&A evaluation • 3 TREC topics * 30 relevant docs • Prepare questions & answer keys • Informativeness is measured by automatic scoring based on the overlap measures. • Readability is measured in terms of word and sentence complexity. • Gains in informativeness without compromising readability. Revision
Full revision to improve informativeness Related work • Sentence reduction (Jing 1999) • Sentence extraction • Text cohesion feature • Repetition, synonymy, hypernymy, meronymy links from WordNet • Edmundsonian features • Syntactically obligatory material is identified • By finding required arguments for verbs using a subcategorization lexicon • Each word is given a weight based on text cohesion links. • Each phrase is given a weight by adding up the score of its child nodes in the parse tree. • The probability that a phrase should be eliminated, compacted, or left unchanged is computed • Using a corpus of articles & their aligned sentences in its abstracts • A phrase is eliminated … • If it isn’t in a syntactically obligatory role, • If it doesn’t have a high cohesion score, • And has a high probability of being eliminated • The advantage of using a corpus to guide the revisions cannot be overemphasized. Revision
Full revision to improve informativeness Implications • Emulating some revision processes of abstractors • Redundant and vague term dropping, and prescribed lexical substitutions • Problem of ambiguous words • Reference adjustment • Requires accurate resolving of anaphoric references • Local & global revision • Global revision is potentially computationally more expensive • Context-dependent lexical paraphrasing is much harder • Semantic & discourse-level understanding has to be carried out. Revision
Text compaction Telegraphic text reduction • Grefenstette (1998) • A sentence truncation method which uses a number of syntactic reduction preference or levels, producing a kind of ‘telegraphese’ at high reduction levels. • Proper nouns are more important than common nouns • Nouns are more important than adjectives • Adjectives are more important than articles • Nouns are more important than adverbs • Main clauses are more important than subordinate clauses • Negations are always important • Relies exclusively on elimination operations • With no use of any sentence weighting measures • The notion of informativeness, reflected in the reduction levels, is rather adhoc. • It is not clear that the ordering of levels reflects an ordering as to how much information is preserved. Revision
Text compaction Sentence compression • Knight & Marcu (2000) • Use a Ziff-Davis corpus • Parallel corpus • Given this ‘parallel corpus’ of document sentences and their corresponding abstract sentences, • They identify those cases of abstract sentences which are similar to and shorter than a corresponding document sentence using a vocabulary overlap measure. • Such abstract sentences are viewed as truncated forms of the corresponding document sentence • Each sentence is syntactically parsed using a Collins parser. • 210 atomic revision operations • Stated in terms of the machinery of shift-reduce parsing using a tree transducer • Given the syntactic tree of both sentences, the rewriting of one tree into the other is automatically decomposed into a sequence of these operations. Revision
Text compaction Sentence compression • Decision-based model • A decision-tree classifier is trained on this data. • During testing, the tree for a source sentence is given to the tree transducer in its initial state • The classifier identifies which particular operation to apply • Moves the transducer to the next state, where the classifier is queried again • Noisy channel model • One can view a long sentence as a garbled version of a short one. • The task of compaction is to recover the original short sentence. • Use a syntactic analysis based on PCFG • Evaluation • Informativeness • The extent to which the compacted sentences selected the most important words in the original sentence • Grammaticalness • Both machine methods were more informative and grammatical than the baseline. Revision
Conclusion • Different revision approaches my be fruitfully combined. (Table 4.3 p.88) • No semantic-level analysis is involved here. • No specializaton/generalization operations • The coreference methods ignore word-sense distinctions and inferential relationships between referring expressions. • A severe limitation • Open question • How far can one carry out full revision without understanding what is being talked out? Revision