170 likes | 301 Views
A Corpus-based Analysis for the Ordering of Clause Aggregation Operators. James Shaw Multimedia/Video Technology Department 755 College Road East Siemens Corporate Research, Inc. Princeton, NJ 08540 shaw@scr.siemens.com. Introduction.
E N D
A Corpus-based Analysis for the Ordering of Clause Aggregation Operators James Shaw Multimedia/Video Technology Department 755 College Road East Siemens Corporate Research, Inc. Princeton, NJ 08540 shaw@scr.siemens.com
Introduction • General goal: Automatic generation of concise and fluent complex sentences. Input propositions: • Jones is a female patient. • Jones has hypertension. • Jones has peptic ulcer. • Jones is 80 years old. • Jones … Output sentence: • Ms. Jones is an 80-year-old hypertensive patient with peptic ulcer underwent coronary artery bypass grafting. • Syntactic clause aggregation operators, i.e., • Paratactic constructions: • Conjunction transformations • Hypotactic constructions: • Adjective, prepositional phrase, reduced relative clause, and relative clause transformations COLING 2002
Specific Goal • What is the correct ordering of applying clause aggregation operators in a domain independent natural language generation system? • In our first implementation of MAGIC system (McKeown97), the ordering of the operators is the following: • Paratactic operators first (conjunction transformations) • Hypotactic operators • Paratactic operators again • Why are the paratactic operators applied twice and hypotactic operators only once? • We cannot simply permute all the clause aggregation operators and find an optimal ordering. • Instead of finding an optimal ordering, our goal is to find an ordering which performs well. COLING 2002
Why the ordering should be identified • Clause aggregation operators are not communicative. Applying one of the operators to input propositions prevents the application of others. • The ordering affects meaning: • Input propositions: • John drank cider • John ate oranges • (even though) John didn’t like fruits • Potential output sentences: • John drank cider and even though he didn’t like fruits, he ate oranges. • Even though John didn’t like fruits, he drank cider and ate oranges. COLING 2002
Related Work • Syntactically simple expressions of embedding are to be preferred over more complex ones (Scott and de Souza90, Shaw98) • Rhetorical Structure Theory (Mann and Thompson 1988) • Cohesion analysis (Halliday and Hasan76) • Similar to other works in clause aggregation (Moser and Moore95, Rösner and Stede92) • Graphical tools to facilitate discourse annotation (O'Donnell00, Garside and Rayson97) • Automating the discourse annotation (Marcu00) • Ordering of applying the same operator, such as adjective transformation (Shaw99, Malouf00) COLING 2002
Methodology • Collect a corpus with sentences containing paratactic and hypotactic constructions • De-aggregate those sentences into propositions • Specify rhetorical relations between the propositions • Specify a sequence of transformation operators to combine the de-aggregated proposition into the original sentences • Evaluate how well the proposed operator ordering works by check the sequence of transformation against our propose ordering of clause aggregation operators. COLING 2002
Corpus Collection • Corpus is taken from from medical domain and Wall Street Journal. • Only sentence contain the conjunctor “and” are selected to increase the likelihood of encountering interactions between paratactic and hypotactic constructions. • Due to the amount of effort needed to annotate complex sentences, only 100 sentences from each domain are annotated. COLING 2002
Corpus Annotation • Use XML as the markup language • Manual de-aggregation by the author • Each annotated sentence entry consists of 5 parts: • The original sentence. • A list of de-aggregated propositions after manual reconstruction of ellided constituents. These propositions are enclosed in propset. • The rhetorical relations which link the de-aggregated propositions or propset. • The sequence of transformations that can be used to reproduce the original sentence. • The annotator’s comments. COLING 2002
Annotation Section 1 & 2:The original sentence & the propositions • Section 1 (Original sentence): “Local sports fans themselves, long known for their passive demeanor at games and propensity to leave early, don’t resist the image.” • Section 2 (Propositions) <propset id="pset32-1"> <prop id="p32-1">Local sports fans don't resist the image. </prop> <prop id="p32-2">Local sports fans are long known for their passive demeanor at games.</prop> <prop id="p32-3">Local sports fans are long known for their propensity to leave early.</prop> </propset> • Section 3 (Rhetorical relations): <focus entity=“local sports fans”/> <rst-rel id="r32-1" name="elab" nuc="p32-1" sat="p32-2" /> <rst-rel id="r32-2" name="elab" nuc="p32-1" sat="p32-3" /> COLING 2002
Annotated Section 3 & 4:Rhetorical relations and operator sequences • Section 2 (Propositions): <propset id="pset32-1"> <prop id="p32-1">Local sports fans don't resist the image. </prop> <prop id="p32-2">Local sports fans are long known for their passive demeanor at games.</prop> <prop id="p32-3">Local sports fans are long known for their propensity to leave early.</prop> </propset> • Section 4 (Operator sequences): <trans id="tx32-1" name="conj-simp" nuc="p32-2" sat="p32-3" /> Local sports fans are long known for their passive demeanor at games andlocal sports fans are long known for their propensity to leave early. <trans id="tx32-2" name="rel-reduced-del-wh-be" nuc="p32-1" sat="tx32-1" /> Local sports fans themselves, who are long known for their passive demeanor at games and propensity to leave early, don’t resist the image. COLING 2002
Annotated Section 5:Annotator’s comment • Our proposed aggregation operator ordering: • Adjective (conjunction optional) • Prepositional phrase (conjunction optional) • Reduced relative clause, including apposition (conjunction optional) • Relative clause (conjunction optional) • Transformations for other rhetorical relations (conjunction optional) • Simple conjunction • Complex conjunction • Section 5 (annotator’s comment): <seqorder valid="true" /> <conj id="c32-1" type="dist" /> • If the sequence of transformation does not differ from the proposed ordering, seqorderisassigned true. COLING 2002
The concept of Propset • Issues • Simply de-aggregated propositions did not provide sufficient information to reproduce the original sentence. • John drank cider • John ate oranges • (even though) John didn’t like fruits They can be mapped to either one of the following sentences: • John drank cider and even though he didn’t like fruits, he ate oranges. • Even though John didn’t like fruits, he drank cider and ate oranges. • The number of rhetorical relations can be greater than number of propositions, up to for n propositions. COLING 2002
The Benefits of using Propset • Propset allows annotators to do the following: • Group related propositions are more tightly related, i.e., a smoker quit 10 years ago. • Specify the scope of modifying propositions, as the earlier example. • Simply annotation for certain constructions, such as “say” and “believe”: [propset [prop John believed [propset [prop Tim invested in stock] [prop Tim invested in real estate] ] ] ] Instead of [prop John believed Tim invested in stock] [prop John believed Tim invested in real estate] • Minimize redundant specification of multiple modifying rhetorical relations COLING 2002
Why minimize redundancy rhetorical relations? • The sentence: Even though John didn’t like fruits, he drank cider and ate oranges. • Input propositions: <prop id="p1-1">John drank cider. </prop> <prop id="p1-2">John ate oranges.</prop> <prop id="p1-3">(even though) John didn’t like fruits.</prop> Instead of <rst-rel id="r1-1" name="elab" nuc="p1-1" sat="p1-3" /> <rst-rel id="r1-2" name="elab" nuc="p1-2" sat="p1-3" /> <rst-rel id="r1-3" name=“join" nuc="p1-2" sat="p1-2" /> The annotated relations are <rst-rel id="r1-1" name=“join" nuc="p1-2" sat="p1-2" /> <rst-rel id="r1-2" name="elab" nuc="propset1-1" sat="p1-3" /> • The number of rhetorical relations is always n-1, n is the number of propositions. • Simplifies clause aggregation algorithm because one transformation maps to one rhetorical relation. COLING 2002
Results • In our evaluation, we assume the operators applied earlier should result in constituents closer to the head than the constituents result from operators applied later. • 200-sentence corpus was de-aggregated manually (average sentence length is 23 words) • 763 propositions • 3.8 propositions per sentence • 2.6 transformations per sentence • 523 rhetorical relations, with 440 relations either Elaboration, Joint, or Sequence. • 20% of the annotated transformations cannot not be handled in by our operators. • 195 out of 200 original sentence can be re-synthesized using the proposed ordering. • Why such a good result? • The use of propset in the annotation removed many potential conflicts. COLING 2002
What We Learned? • The use of propset during discourse annotation is very useful for discourse annotation. • The first application of paratactic operator is a sub-step of the hypotactic operation which combines satellite propositions with similar syntactic structures and modify the same entity in the nucleus proposition. • The correct ordering of operators is a 2-stage process • hypotactic operators (with conjunction operator optional as local optimization) • paratactic operators • Hypotactic operators are applied first because their operations are local in nature, “Bob is a reputable stock-broker [with deep pocket] who is interested in dot-coms.” In contrast, paratactic operators are is not local. They are sensitive to the surface position of identical constituents across all the propositions being combined -- directional constraint (Ross70, Shaw98b) . COLING 2002
Conclusion • Even though researchers have studied rhetorical relations in conjunction with clause aggregation operations, the explicit use of propset in discourse annotation in such a context is new. • We explained why some paratactic operators are applied before hypotactic operators while others are applied afterward. • By imposing our proposed ordering onto de-aggregated propositions and try to re-synthesize the original sentences, we identified an ordering of clause aggregation operators which work well based on a human-written corpus. • Such ordering can be implemented and reused in domain independent natural language generation systems to create complex sentences that are also concise and fluent. COLING 2002