470 likes | 660 Views
Relations between multiple annotations: Representation, Inferences, Context Specification, and Unification. Andreas Witt Dieter Metzing Jens Pönninghaus Daniela Goecke. www.text-technology.de. Contents. Project description Approaches to Multiple Annotations multiple Levels
E N D
Relations between multiple annotations: Representation, Inferences, Context Specification, and Unification Andreas Witt Dieter Metzing Jens Pönninghaus Daniela Goecke www.text-technology.de
Contents • Project description • Approaches to Multiple Annotations • multiple Levels • multiple Layers • Representation • Inferences • Context Specification • Unification
Project description • The Project secondary information structuring and comparative discourse analysis (Sekimo) is part of the DFG-Forschergruppe 437 Text-technological modelling of information • Within this Project a corpus is annotated on different (linguistic) levels • Aim of the project: Inferring, Describing, and Modelling relations between these levels
Contents • Project description • Approaches to Multiple Annotations • multiple Levels • multiple Layers • Representation • Inferences • Context Specification • Unification
Standard Methodology • A corpus is annotated according to a given tag set • The tag set is defined in a document grammar (e.g. the TEI-DTD) • In general, different tag sets exist for annotating different kinds of documents (e.g. poems, encyclopedia) or different kinds of information (e.g. linguistic information) • In particular, a linguistic annotation can depend on: • theoretical assumptions • constituent structure, • functional structure, or • a (more) specific theory • the language • research questions
Problems of the standard methodology • Levels of description are neglected or • Different levels of annotation are mixed up Difficulties • Multiple hierarchies within one document
General Solutions (c.f. TEI-Guidelines) • concur: an optional feature of SGML (not available in XML) which allows multiple hierarchies to be marked up concurrently in the same document • milestone elements: empty elements which mark the boundaries between elements in a non-nesting structure • fragmentation of an item: the division of what logically is a single element into two or more parts, each of which nests properly within its context • virtual joins: the recreation of a virtual element from fragments of text, (requires a separate interpretation) • redundant encoding of information in multiple forms
Multiple hierarchies and language data • Hypertext linking techniques are used for connecting multiple layers of annotation, e.g.: • Within the EU-Project NITE an annotation format has been developed which allows for specifying links between separate annotation layers • The annotation graphs (AGs) format uses a (possibly abstract) timeline as linking-layer • Modified versions of the AGs are applied by • the TASX-Annotator • the EXMARaLDA-Project
Alternative Methodology • XML-based multi-layer annotation • Technically, each layer becomes a separate and independent XML-document • The same text is annotated several times • Advantages: • seems to be the only way to annotate multiple hierarchies without workarounds • each document instance uses its own DTD (or Schema), i.e. annotation formats are not mixed up • at any time a new annotation can be produced • transformation tools to the NITE and the TASX-format exist (Master’s Thesis by Jan F. Maas)
Contents • Project description • Approaches to Multiple Annotations • multiple Levels • multiple Layers • Representation • Inferences • Context Specification • Unification
Layer vs. Level • We distinguish annotation level vs. annotation layer • Annotation level refers to an abstract level of analysis • Annotation layer refers to the realisation of an annotation in e.g. XML • Examples of annotation levels: morphology in a linguistic grammar, text structure (sections, paragraphs,...), layout (lines and pages), thematic structure, rhetorical structure • Sometimes one layer contains several levels (e.g. HTML), but a level can also be distributed over several layers
Annotation Process • Given: • the textual representation of language material (text) • the text is regarded as primary data • For each annotation layer the primary data is copied • The (copy of the) primary text is annotated according to a schema (e.g. a DTD) • Annotation can be prepared • in any XML-Editor (e.g.: XMetaL, XML-Spy, psgml-emacs) • special purpose annotation tool
Sample annotation with a web-based, special purpose annotation tool This tool is used only for flat xml-structures, i.e. xml-annotations with non-nested elements
Example:XML-Annotation with the emacs editor(useable for deep and flat annotations)
Multi-layer-annotation tool (master's thesis by Stefan Michel; work in progress)
Multiple Annotations • Drawbacks: • redundant • the separate documents are independent (i.e. not connected) • But: • since the documents contain exactly the same text, the text can function as the link • Solution: • a common representation format for all separate XML-documents
Contents • Project description • Approaches to Multiple Annotations • multiple Levels • multiple Layers • Representation • Inferences • Context Specification • Unification
Prolog-Representation • The Prolog-representation is based on work by Renear, Huitfeld, Dubin and Sperberg-McQueen • Original representation for an XML-Elementnode/2, i.e. the predicate node has two arguments • the position in the document tree • a value, e.g. element(corpus) • Extension node/2 is replaced by node/5 • The 3 new arguments: • annotation layer • starting–point of the annotated text • end-point of the annotated text
Conversion from XML to Prolog (xml2prolog) • Implemented in Python • Input: 1 or more XML-Documents • Result: Collection of Prolog facts • Example: • the element <Root> is represented as the fact: node(AnnotationLayer, 0, 42332, [1], element(Root)). • the attribute att=val of the Element <Root> is represented as the fact: attr(AnnotationLayer, 0, 42332, [1], 'att', 'val').
xml2prolog.py • Some options for the transformation process • compare: the primary data of the XML files are compared, if the primary data is not identical, the first difference is shown • pcdata/pcdatanodes: character data can be included • aggressive: whitespace is added or removed anywhere in document if whitespace is the reason for differences of the primary data • filter: some elements in some files should be filtered (including their textual content), e.g. <script> within HTML-documents
Example: s h u c c h o u n o k e N NP s h u c c h o u n o k e N .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... COMP HD s h u c c h o u n o k e N NP.no - s h u c c h o u n o k e N COMP HD HD s h u c c h o u n o k e N VN PGen NF s h u c c h o u n o k e N joshi meishi meishi s h u c c h o u n o k e N bunsetsu[@type=dependent] bunsetsu
Example (Collection of Prolog-Facts) annotation layer start- and endpoint nodes in DOM-tree element names attribute-value-pair data-contents
Contents • Project description • Approaches to Multiple Annotations • multiple Levels • multiple Layers • Representation • Inferences • Context Specification • Unification
Relations between annotation Layers • Relations are inferred automatically • Special Prolog predicates have been implemented, for: • compare the annotation layers • Example (Identity): <w>tree</w> <m>tree</m> <syll>tree</syll>
Relations between Annotations Vgl. Durusau & Brook O'Donnell (2002) und Durand (1999) 1. <a>....................</a> <b>......</b> 2. <a>....................</a> <b>.........</b> 3. <a>....................</a> <b>.....................</b> 4. <a>....................</a> <b>................................</b> 5. <a>....................</a> <b>..........................................</b> 6. <a>....................</a> <b>......</b> 7. <a>....................</a> <b>....................</b> 8. <a>....................</a> <b>...............................</b> etc.
Relations between annotation layers Relation Visualisation identity independence inclusion start point identity end point identity end point is starting point overlap range of element a range of element b
Comparison of annotation layers • We distinguish two kinds of relations between • elements:relations between single instances of an element (relations) • relations between all occurrences of instances an element (meta-relations) • Prolog programs have been developed to infer both kinds of relations
Prolog Implementation • Aims: • statistics on annotation layers • relations between occurrences of elements • meta-relations
Statistics of the annotation according to HPSG ?- get_statistics. Please enter layer name or type "q" to exit, "h" for help : |: hpsg. Statistics for hpsg Number of Nodes : 14, Number of different Elements : 5 Number of Attributes : 1, Number of different A/V-pairs : 4 ------------------------------------------ Different elements and their occurrences : hpsg 1 nodesAndLabels 3 nonannotated-text 4 phrase 2 punctuation 4 ------------------------------------------ Attribute # occurrences # different values type 5 4 For information on occurrences of Attribute-Value-Pairs enter Attribute name or type q to quit. |: type. ( edgeCOMP,1 ) , ( edgeHD,2 ) , ( np,1 ) , ( np-no,1 )
Relations between occurrences of elements • Query: How often does a certain relation between elements hold? chk_relation(Relation,Element1,Layer1,Element2,Layer2,L). Relation: a relation between elements (e.g. identity, overlap, or endA_is_starting_pointB) Element1: elementname of annotation Layer1 Element2: element name of annotation Layer2 L : result-list • It is also possible to infer examples and counter-examples of a certain relation
Example:Relations between elements of the HPSG Annotation and the elements of a dialogue-annotation
Ex.: Relations between HPSG-phrases and X ?- chk_relation(Relation,phrase,hpsg,X,dialogue,L). Relation = identity X = _G160 L = [] ; Relation = included_B_in_A X = _G160 L = [] ; Relation = included_A_in_B X = _G160 L = [[[phrase, dialogue, 2], [phrase, 2], [dialogue, 1]]] ; ... Relation = overlap_A X = _G160 L = [] Yes
Meta-relations • If a certain relation holds for all instances of an element we defined meta-relation: • identity: At every occurrence of an element A in Layer1 an element B in Layer2 exists which spans the same range of characters • inclusion: • at every occurrence of an element A in Layer1 an element B in Layer2 exists which is included or is identical • the meta relation identity does not hold • overlap: At every occurrence of an element A in Layer1 an element B in Layer2 exists which overlaps with A • mixed: no meta-relations exist
Meta-relations (cntd.) • identity - For all occurrences, the following configuration can found: <a>....................</a><b>....................</b> • inclusion - For all occurrences, one of the following configurations can be found: <a>....................</a> <b>................................</b> <a>....................</a> <b>..........................................</b> <a>....................</a><b>.......................................</b> <a>....................</a><b>....................</b> • overlap - For all occurrences, the following configuration can found: <a>....................</a> <b>....................</b>
Contents • Project description • Approaches to Multiple Annotations • multiple Levels • multiple Layers • Representation • Inferences • Context Specification • Unification
Context specification 1: Motivation • Often, general Meta-relations do not hold • In these cases, the elements can be classified according to structural properties within their layer • This allows to construct specific Meta-relations • A format to express the structural properties called “Context Specification Document“ (CSD) has been developed
Context specification 2: Realization • Subclassification of element nodes via tree walking automata (TWA) • Underlying path-language for the construction of TWA: Caterpillar-Expressions (cf. Brüggemann-Klein and Wood, 2000): • moves: up, right, left, firstChild, lastChild • tests: isFirst, isLast, isLeaf, isRoot • test for element names • Kleene-star operator ‘*‘
Sample application HD HD Caterpillar expressions caterpillarToComp: left ‘Comp’ caterpillarToNP:up ‘NP’ NP HD COMP NP.NO NF COMP HD VN PGen shucchou no k e N
Context specification 3: Subclassification HD HD Caterpillar expressions caterpillarToComp: left ‘Comp’ caterpillarToNP:up ‘NP’ NP HD COMP NP.NO NF Relation holds for all ‘Comp‘ Elements COMP HD VN PGen Relation holds only for a subset shucchou no k e N
Contents • Project description • Approaches to Multiple Annotations • multiple Levels • multiple Layers • Representation • Inferences • Context Specification • Unification
Unification of annotation layers I • Two document layers can be merged • This process has also been implemented in Prolog • The predicate (semt) receives four arguments. • layer1 (to be unified) • layer2 (to be unified) • list of elements which should be deleted in the process of unification • The result of the merger(again a collection of Prolog facts) is written to a new file specified in the fourth argument • The new database contains a copy of all layers in the input database plus the result layer • In case the unification results to a layer where the elements would not be properly nested, a second result layer (a difference list) is created.
Unification of annotation layers II • The result database is re-converted to XML using a python program • If no difference list exists, the result of the merging of two layers can be linearised as an XML document straightforwardly • In case the result fact base contains a difference list, two different linearisations can be generated. • the default processing uses milestone elements to mark the borders of incompatible elements. • alternatively, the technique of fragmentation of elements can be invoked.
Architecture P r o l o g Document-grammar Document-grammar Document-grammar Secondary level (next talk) Inference/ Query XML-docu-ments via Python Generation of XML – from the fact base Unification of annotation levels via Python External information Rules XML-docu-ments Rules
Contents • Project description • Approaches to Multiple Annotations • Representation • Inferences • Context Specification • Unification
Relations between multiple annotations: Representation, Inferences, Context Specification, and Unification Andreas Witt Dieter Metzing Jens Pönninghaus Daniela Goecke www.text-technology.de