220 likes | 395 Views
Syntactic Aggregation in Bengali Text Generation. Sumit Das, Anupam Basu, Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur, India. 2. Overview. Introduction / Motivation Role of aggregation in NLG
E N D
Syntactic Aggregation in Bengali Text Generation Sumit Das, Anupam Basu, Sudeshna Sarkar Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur, India
2 Overview • Introduction / Motivation • Role of aggregation in NLG • Language dependency of text aggregation • Our Work • Identification of prevalent syntactic aggregation constructs in Bengali through corpus analysis • Rule based approach to perform syntactic aggregation • Evaluation of the proposed approach • Summary / Conclusions
3 Introduction: Text Aggregation • Combines coherent simple text spans removing repetiting entities • Improves fluncy, conciseness, and coherence • Preserves meaning • Example • Jack went up the hill. Jill went up the hill. • Jack and Jill went up the hill.
4 Text Aggregation • No general concensus regarding the types of text aggregation. • Existing theories propose following categories • Interpretive • Referential • Syntactic • Lexical • Performed either in sentence planner or surface realizer depending on application requirment
5 Motivation • Syntactic aggregation is the most common form of text aggregation observed in real discourse • Simple linguistic components are combined in accordance with linguistic rules • Language dependent process, so linguistic knowledges of the target language, e.g., preferred word ordering , special verb form usage are required
6 Corpus Analysis • Narrative compound sentences used to identify syntactic aggregation constructs in Bengali • Prevalent constructs are • Conjunction reduction • Right node raising (RNR) • Coordinating one constituent • Non-finite verb generation • Any combination of these constructs is allowed
7 Theoritical Framework • Our work is grounded in the Rhetorical Structure Theory (RST) framework (Mann and Thompson, 1988) • RST uniformly captures semantic, intentional and textual features of a given text • Among the 23 RST relations discussed in the original theory we consider the following • Conjunction ▫ Disjunction • Contrast ▫ Sequence • Parallel
8 Semantic Representation • Elementery discourse units e.g., sentence are represented by recursive frame based structures • Each frame corresponds to the higher syntactic and functional informations of a sentence • This higher syntactic and functional informations are represented as a set of attribute-value pairs
9 Problem Specification • Input : Two simple clauses in their semantic representaion, the rhetorical relation and the discourse marker realizing the relation • Output: Surface-form of the fluent, concise and coherent compound sentence
10 Our Approach • Step 1: Ordering arguments in the constituent clauses • Step 2: Repeating entity identification • Step 3: Ordering constituent clauses • Step 4: Superfluous word identification and non-finite verb generation • Step 5: Correct surface-form generation
11 Ordering arguments in the constituent clauses • The arg frames in the clause frames are ordered by using a total order among the arg roles • The total order is developed from the Bengali compound sentences used in the corpus analysis and using transitivity rules • Example • AmiAgAmIkAlabAbAra sAtheskule yAba. • ke kakhana kAra sAthe kothAYa • The total order among the roles is • ke < kakhana < kAra sAthe < kothAYa
12 Repeating entity identification • The entities present in both the input simple clauses with the same syntactic and semantic role are marked as REPEATING Case –frame representation of “rAma eba.n shyAma bhAta khAbe”
13 Ordering constituent clauses • Constituent clauses are reordered depending on their cronological order and polarity according to somes rules • Increases the fluency and coherency of the generated compound sentences • Example • Ami bA.Di yAba. rAma skule gechhe. (before ordering) • ( I shall go home. Ram has gone to school.) • rAma skule gechhe eba.n Ami bA.Di yAba. (After ordering) • (Ram has gone to school and I shall go home.)
14 Superfluous word identification and non-finite verb generation • Super fluous words are identified using the following two methods • Forward deletion: • rAma gatakAla khAbAra kheYechhila eba.n rAma gatakAla skule giYechhila (Ram ate food yesterday and Ram went to school yesterday). • Backward deletion: • rAma bhAta khAbe eba.n shyAma bhAta khAbe (Ram will eat rice and Shaym will eat rice). • For Sequence and Parallel relation after forward deletion the verb of the first clause is modified to non-finite form
15 Correct surface-form generation • Superfluous word deletion form the surface form takes place in this stage • In case of subject coordination and RNR the correct form of the common verb is generated • Example • Ami kAla skule yAba. tumi kAla skule yAbe. • ( I shall go to school tomorrow. You will go to school tomorrow.) • Ami Ara tumi kAla skule yAba. • (I and you will go to school tomorrow.)
16 Evaluation • We develop a system that performs syntactic aggregation of two simple clauses following the steps described • Evaluation of the system validates our approach • Due to the lack of sufficient gold standard data automatic evaluation techniques are not followed • We perform user based evaluation
17 Evaluation • Evaluation is performed depending upon the following two criteria: • Well-formedness • Faithfulness • 250 test sentences • Output sentence shown to 3 human experts • They are asked to score the outputs on a scale of 1 to 5
18 Results: Well-formedness
19 Results: Faithfulness
20 Conclusions • Our approach generates aggregated and elliptic sentences in Bengali from clause-sized semantic representations using rules • Current system produce paratactic constructions and use ellipsis to omit repeated entities • Performs all the syntactic aggregation constructs identified during the corpus analysis
21 Future Scope • Anaphoric pronoun generation to preserve meaning and increase fluency • Discourse marker comes as input. Current system can be extended to select appropriate discourse marker • Current system can be extended to generate multi-sentential textual output