Coherence-based strategies for text-to-hypertext-conversion

h Angelika Storrer & Anke Holler January 2004 Coherence-based strategies for text-to-hypertext-conversion term definition definition term

Overview The HyTex-Project: user scenario and approach Coherence-based text-to-hypertext-conversion: main types of strategies Focus: achieving cohesive closedness in hypertext nodes Annotation scheme for co-reference phenomena

About the HyTex Project HyTex: »Hypertextualisierung auf textgrammatischer Grundlage« (Text-to-hypertext conversion on a text-grammatical basis) Tasks: • Segmentation: breaking down the document into nodes • Linking: connecting the nodes through intratextual, intertextual and extratextual hyperlinks • … on a text-grammatical basis: No simple 1:1-conversion; instead generation of hypertext-nodes through text-grammar-based annotations in the documents.

Conversion Guidelines Guidelines: • Recoverability: generating hypertext views as additional layers while preserving the original document • coherence-based criteria for segmentation and linking

User Scenario User with previous though no expert knowledge in a particular area (semi-expert) must acquire knowledge from a pool of scientific/expert texts within a given time interval e.g. within in the framework of • interdisciplinary cooperations • scientific journalism • specialised lexicography Our vision: make selective reading in this scenario more effective and convenient than it would be possible with printmedia.

reading path (author) reading path (user) Form-based conversion strategies

Problems of the form-based approach Problems on the Micro-Level: Solution: generating cohesive closedness based on text-grammatical annotation

term occurence term definition reading path (author) reading path (user) Problems of selective reading (Macro-level)

User Model Domain Knowledge Level Document Level Three-level architecture

Weiterhin unterscheidet ernoch nach der Anzahl der in einen Link involvierten Anker in 1:1-Links, in denen ein Ausgangs-Anker mit genau einem Zielanker verknüpft ist; 1:n-Links, in denen ein Ausgangs-Anker mit mehreren Zielankern verbunden ist, und n:m-Links, in denen mehrere Anker unabhängig von der Traversierungsrichtung miteinander zu einem Linking-Muster kombiniert sind. Im Linking-Element von HTML sind nur 1:1-Links vorgesehen; die obige Spezifikation und das Konzept des „Extended Link“ (im Sinne der XLink-Spezifikation) sehen auch Links mit mehreren Ankern vor. Sequential text Operations for achieving cohesive closedness Strategies on the Micro-Level: cohesive closedness • anaphora resolution • linking • elision • expansion

Strategies on the Micro-Level: cohesive closedness cohesively autonomous version combined with expansion of the visual field

Prerequisites 1. Annotation of cohesive markers in the corpus documents: • co-reference und co-specification • connectives • text-deictic expressions 2. Rules for the automatic transformation of cohesive cues in order to achieve cohesive closedness

Part II Annotation of coreference phenomena • Objective: • Strict separation of the annotation of the relation of coreference • from the annotation of anaphoric relations. • Motivation: • cross-document coreference (cf. Baldwin&Bagga 1998, Mitkov 2002)

Why existing annotation schemes ... ... need to be extended? • guidelines published by Text Encoding Initiative (TEI) • task definition of the Message Understanding Conferences (MUC) • guidelines published by the project Multilevel Annotation Tools Engineering (MATE) All three formats are SGML or XML-based.

Text Encoding Initiative (TEI) Example 1: The show was not listed on <name id='nbc'> NBC </name>‘s new schedule, although <seg id='network'> the network </seg> says it is still being considered. <linkGrptype='anaphoric link' targFunc='antecedent anaphor'> <link targType='name seg' targets='nbc network'/> </linkGrp> Problem: • No distinction between coreference and anaphora annotation.

Message Understanding Conferences (MUC) Example 2: <COREF ID="9" TYPE="IDENT" REF="2" MIN="company">The New Orleans oil and gas exploration and diving operations company</COREF> added that <COREF ID="10" TYPE="IDENT" REF="9">it</COREF> doesn't expect any further adverse financial impact from the restructuring. (Hirschmann/Chinchor 1997)

Message Understanding Conferences (MUC) Problems: (Van Deemter&Kibble 2001) • Elements of genuine coreference are mixed with elements of anaphora and predication. • Nonreferring expressions: (a) Whenever a solution emerged, we embraced it. • Bound anaphora: (b) Every TV network reported its profits. • Intensional contexts (c) Henry Higgins, who was formerly sales director of Sudsy Soaps, became president of Dreamy Detergents.

Multilevel Annotation Tools Engineering (MATE) Example 3: The show was not listed on <coref:de ID = 'de_01'> NBC </coref:de>‘s new schedule, although <coref:de ID = 'de_02'> the network </coref:de> says it is still being considered. <coref:linktype = 'ident'href = 'coref.xml#id(de_02)'> <coref:anchor href = 'coref.xml#id(de_01)'/> </coref:link> Problem: • Every coreference relation is marked as an anaphoric relation.

Conclusion Result: • None of the presented markups is suitable to account for cross-document coreference phenomena. Alternative proposal: • An annotation scheme that encodes coreference as a relation between the document level and the domain knowledge level.

User Model Domain Knowledge Level (TermNet) - terminological knowledge (concepts and technical terms) • representation format: topic map Document Level Three-level Architecture Coreference annotation

Annotation of coreference Markup: <corefLink deIDref = Value tmIDRef = Value /> Example: Das von <discourseEntity deID="deID_1" deType="nom">Kuhlen 1991</discourseEntity> skizzierte Grundmodell eines <discourseEntity deID="deID_2" deType="nom">Hypertextsystems </discourseEntity>orientiert sich am Vorbild von Datenbankmanagementsystemen. <semRel><corefLink deIDRef="deID_1" tmIDRef="unknown"/></semRel> <semRel><corefLinkdeIDRef="deID_2" tmIDRef="TermNet-inferiert.xtm#Hypertextsystem"/> </semRel>

Annotation of cospecification Markup: <cospecLink relType = Value phorIDRef = Value antecedentIDRefs = Value /> Example: <discourseEntity deID="deID_3" deType="nom">Ein Link</discourseEntity> ist im Text meist farbig markiert. <discourseEntity deID="deID_4" deType="nom">Er</discourseEntity> ist dadurch gut sichtbar. <cospecLinkrelType="substitution" phorIDRef="deID_4" antecedentIDRefs="deID_3"/>

Cross-document phenomena (1) Das von Kuhlen 1991 skizzierte Grundmodell eines <discourseEntity deID="deID_1" deType="nom">Hypertextsystems </discourseEntity>orientiert sich am Vorbild von Datenbankmanagementsystemen. <semRel><corefLinkdeIDRef="deID_1" tmIDRef="TermNet-inferiert.xtm#Hypertextsystem"/> </semRel> (2) Die Verwaltung dieser Annotationen galt lange als ein wichtiges Desiderat von <discourseEntity deID="deID_2" deType="nom"> Hypertextsystemen. </discourseEntity> <semRel><corefLinkdeIDRef="deID_2" tmIDRef="TermNet-inferiert.xtm#Hypertextsystem"/> </semRel> DOC 1 DOC 2

Cross-document phenomena DOC 1 (3) Unter <discourseEntity deID="deID_1" deType="nom"> "Annotationen" </discourseEntity> werden in der Hypertextliteratur Anmerkungen und Notizen verstanden, die ein Hypertextnutzer während des Rezeptionsvorgangs zu den Inhalten eines Moduls anbringt. <semRel><corefLinkdeIDRef="deID_3" tmIDRef="TermNet-inferiert.xtm#Annotation1"/> </semRel> Annotation = gloss DOC 2 Annotation = markup (4) In der SGML/XML-Terminologie wird der Ausdruck <discourseEntity deID="deID_1" deType="nom"> "Annotation" </discourseEntity> allerdings meist in einem anderen Sinne verwendet, nämlich als Bezeichnung für die Auszeichnung von Dokumenten mittels Markup. <semRel><corefLinkdeIDRef="deID_4" tmIDRef="TermNet-inferiert.xtm#Annotation2"/> </semRel>

Summary • We have discussed a general approach of text-to-hypertext conversion which is pursued in our HyTex-project. • We have presented strategies • for a coherence-based conversion of sequential text into hypertext • for achieving cohesive closedness of hypertext nodes • We have argued for a coreference annotation that relates expressions of the text to a WordNet-like model which represents terminological knowledge of the domain investigated.

Coherence-based strategies for text-to-hypertext-conversion