160 likes | 293 Views
Gross-grained RST through XML Metadata for Multilingual Document Generation. G. Barrutieta, J. Abaitua & J. Díaz. MT Summit VIII Santiago de Compostela Spain.
E N D
Gross-grained RST through XML Metadata for Multilingual Document Generation G. Barrutieta, J. Abaitua & J. Díaz MT Summit VIII Santiago de Compostela Spain
The web is full of documents (web pages) and it can be seen as a huge database containing a lot of useful information for a wide range of users. But it is difficult to find relevant documents or relevant information within a document. Problem: the web contains a lot of data but the data is unstructured [Sobrino] Text to text generation (or text regeneration) Is NLG a selection problem? (8th EWNLG Toulouse 2001) The web is full of text that can be used to generate “new” text by taking bits from here and there. This approach requires structured data. The above is roughly what the CourseViewGenerator does. A “view” is a new document generated by parts of a master document [Hirst et al.] Introduction
At least 3 research issues • Where to generate from? Creation of the corpus. This is the source of the generated documents. This is the main focus of this paper. • Content selection/determination algorithm – User profiles or aspects to choose parts of the discourse that are relevant to the students. • Presentation selection algorithm – User profiles or aspects to help students read and understand, to motivate them and to involve them in the learning process.
Multilingual parallel corpus – Master document • “Master document contains all the information, including illustrations, that the system might wish to include in any individual brochure, along with annotations as to when each piece of information is relevant.” [Hirst et al.] HealthDoc project. • The same idea is used here. The master document contains all the information about the subject matter that the students, the administration and the professor might need before, during and after the course.
Data and metadata –Level of segmentation • How is this information going to be represented in the multilingual parallel corpus? • The data is text encapsulated in XML [Bray et al.] tags • The metadata (tags) is data about the text • The metadata is RST discourse trees [Mann & Thompson] • What is the size of the text spans (segments) to be encapsulated or the level of segmentation? • The text spans are “typically” clauses (minimal text spans or elementary units) with a discourse function [Marcu et al.]
. . . <EXPLANATION> <RST> <RST-N> <RST> <RST-N> <S> Darwin as a geologist </S> </RST-N> <RST-S> <CONCESSION> <S> He tends to be viewed now as a biologist, </S> </CONCESSION> <RST-S> <RST> </RST-N> <RST-S> <EVIDENCE> <S> but in his five years on the Beagle his main work was geology </S> </EVIDENCE> <EVIDENCE> <S> and he saw himself as a geologist. </S> </EVIDENCE> <EVIDENCE> <S> His work contributed significantly to the field. </S> </EVIDENCE> </RST-S> </RST> </EXPLANATION> . . . RST in XML
. . . <!ELEMENT EXPLANATION (RST+ )> <!ELEMENT RST (RST-S|RST-N)*> <!ELEMENT RST-N (S|RST)*> <!ELEMENT RST-S ( EVIDENCE| CONCESSION)*> <!ELEMENT EVIDENCE (S+)> <!ELEMENT CONCESSION (S+)> <!ELEMENT S (#PCDATA)> . . . DTD
Non-isomorphism • Marcu et al. show examples of multilingual text analysis that are not isomorphic and proposes 4 possible solutions for further research. One of them is explored later in this presentation. • Moore & Pollack show examples of monolingual RST text analysis that are not isomorphic either. This ambiguiety is due to intentional and informational level discrepancies. • This non-isomorphism supposes that one content selection/determination algorithm might be necessary for each language. This is not workable.
Moore & Pollack´s non-isomorphism • Not a problem in this particular case since the corpus is manually created. The ambiguieties are going to be addressed and resolved by the human author.
Marcu et al.´s non-isomorphism • They propose 4 possible solutions. • One of them is “Derive a language-independent discourse structure and then linearize it”. • This idea in their paper triggered our “gross-grained RST”.
Gross-grained RST • The language independent discourse structure is bigger segments of text (bigger than clauses). • Our segments are never smaller than sentences. • Our segments are groups of sentences (at least one) with a clear communicative goal. • The RST theory is flexible enough to allow this. Of course, we lose rhetorical information within those bigger chunks of text but this approach makes senses because we are not going to give the students only a portion of an exercises, for example.
Discussion • Tests to be carried-out in the near future. The documents generated by the CourseViewGenerator are going to be presented to the students to let them judge. • A lot of fine tuning and refining will be required to decide what are going to be the final user aspects and the final content selection and presentation selection algorithms. The refining will be guided by the tests. • Open questions: • Is this approach applicable to other domains and contexts? • Will this approach work with bigger corpora?
All the bibliografical references are in the paper except the following: Moore, J.D. and Pollack, M.E. (1992). A Problem for RST: The Need for Multi-Level Discourse Analysis. Computational Linguistics. Bibliography