480 likes | 581 Views
Hierarchical XML Layers Representation for Heavily Annotated Corpora. Dan Cristea Cristina Butnariu d cristea @infoiasi.ro cris@infoiasi.ro “ Al. I. Cuza ” University of Iaşi Faculty of Computer Science and Romanian Academy – the Iaşi Branch
E N D
Hierarchical XML Layers Representation for Heavily Annotated Corpora Dan Cristea Cristina Butnariu dcristea@infoiasi.ro cris@infoiasi.ro “Al. I. Cuza” University of Iaşi Faculty of Computer Science and Romanian Academy – the Iaşi Branch Institute for Theoretical Computer Science
XML in LR annotation • A de facto framework to support language annotation • Used to: • record experts views on linguistic phenomena on corpora • store intermediate results in pipe-line NLP applications • post NLP results • BUT: • annotation schemes: a chaos and not reusable • many annotations do share parts in common • not all layers are useful for the task at hand LREC 2004 – Workshop on Richly Annotated Corpora
Presentation • Motivation for a structural view on annotation schemes • Proposal for ahierarchical representation • circular references • classification within the hierarchy • operations within the hierarchy • Conclusions LREC 2004 – Workshop on Richly Annotated Corpora
An annotation session • a source XML annotated document • a database image of the annotation Annotation session both or DTD file LREC 2004 – Workshop on Richly Annotated Corpora
A sequence of annotation sessions Annotation session Annotation session DTD2 DTD1 LREC 2004 – Workshop on Richly Annotated Corpora
Mixing human with automatic annotation Automatic annotation Manual annotation DTD2 DTD1 LREC 2004 – Workshop on Richly Annotated Corpora
+ Multiple parentage of a scheme LREC 2004 – Workshop on Richly Annotated Corpora
Multiple parentage LREC 2004 – Workshop on Richly Annotated Corpora
Multiple parentage < … > < … > LREC 2004 – Workshop on Richly Annotated Corpora
Multiple parentage LREC 2004 – Workshop on Richly Annotated Corpora
Multiple parentage < … > < … > LREC 2004 – Workshop on Richly Annotated Corpora
Multiple parentage LREC 2004 – Workshop on Richly Annotated Corpora
Multiple parentage < … > < … > < … > < … > LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG The hierarchy – a DAG representation LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG The hierarchy – a DAG representation LREC 2004 – Workshop on Richly Annotated Corpora
Definition of a scheme <scheme name=”scheme-name” parents=”list-of-parents”> <tag name="tag-name" attributes="list-of-attributes"/> … <ref source-tag="tag-name" source-attribute="attribute-name" target-tag="tag-name" target-attribute=”attribute-name”> … </scheme> LREC 2004 – Workshop on Richly Annotated Corpora
A The subsumption relation B A node A subsumes a node B in the hierarchy (B is a descendent of A) iff: • any tag-name of A is also in B; • any attribute in the list of attributes of a tag-name in A is also in the list of attributes of the same tag-name of B; • any semantic relation which holds in A also holds in B; • either B has at least one tag-name which is not in A, and/or there is at least one tag-name in B such that at least one attribute in its list of attributes is not in the list of attributes of the homonymous tag-name in A, and/or there is at least one semantic relation which holds in B and which doesn’t hold in A. LREC 2004 – Workshop on Richly Annotated Corpora
Example <?xml version="1.0" encoding="ISO-8859-1" ?> <ROOT> <SEG id="0"> <NP head-id="2" id="0"> <TOK id="2" pos="N" lemma="Winston">Winston</TOK> </NP> <TOK id="3" pos="V" lemma="be">was</TOK> <TOK id="4" pos="ING" lemma="dream">dreaming</TOK> <TOK id="5" pos="PREP" lemma="of">of</TOK> <NP head-id="7" id="2"> <NP head-id="6" id="1" coref="0"> <TOK id="6" pos="PRON" lemma="he">his</TOK> </NP> <TOK id="7" pos="N" lemma="mother">mother</TOK> </NP> <TOK id="8" pos="PUNCT">.</TOK> </SEG> <SEG id="1"> <NP head-id="9" id="3" coref="0"> <TOK id="9" pos="PRON" lemma="he">He</TOK> </NP> <TOK id="10" pos="V" lemma="must">must</TOK> <TOK id="11" pos="PUNCT">,</TOK> </SEG> <SEG id="2"> <NP head-id="12" id="4" coref="0"> <TOK id="12" pos="PRON" lemma="he">he</TOK> </NP> <TOK id="13" pos="V" lemma="think">thought</TOK> <TOK id="14" pos="PUNCT">,</TOK> </SEG> </ROOT> LREC 2004 – Workshop on Richly Annotated Corpora
How can circular referencesbe notated? <SEG id=“seg0" head-id=“vp0"> Winston <VP id=“vp0“ in-seg=“seg0">was dreaming</VP> of his mother </SEG> LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG Representingcircular references SEG annotation <SEG id=“seg0"> Winston was dreaming of his mother </SEG> LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-VP Representingcircular references VP annotation Winston <VP id=“vp0“> was dreaming </VP> of his mother LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-VP ST-SEG Representingcircular references SEG refers into VP <SEG id=“seg0"head-id=“vp0"> Winston <VP id=“vp0“> was dreaming </VP> of his mother </SEG> ST-SEG-TO-VP LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-VP ST-SEG Representingcircular references VP refers into SEG <SEG id=“seg0"> Winston <VP id=“vp0“in-seg=“seg0"> was dreaming </VP> of his mother </SEG> ST-VP-TO-SEG LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-VP Representingcircular references Keeping all references <SEG id=“seg0“ head-id=“vp0”> Winston <VP id=“vp0“ in-seg=“seg0"> was dreaming </VP> of his mother </SEG> ST-SEG ST-SEG-TO-VP ST-VP-TO-SEG ST-SEG-VP LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-ROOT ST-VP ST-VP ST-SEG-VP ST-SEG-VP ST-SEG ST-SEG Representingcircular references Delete unnecessary layers ST-SEG-TO-VP ST-VP-TO-SEG LREC 2004 – Workshop on Richly Annotated Corpora
In what conditions can a document interact with a hierarchy? • Compatibility of names • Matching of semantic relations LREC 2004 – Workshop on Richly Annotated Corpora
In what conditions can a document interact with a hierarchy? • Compatibility of names = tag and attribute names • simple translation • expanding/shrinking values msd=”Ncmso” expands into a set of elementary features pos=”noun” type=”common” gender=”masculine” number=”singular” case=”obligue” LREC 2004 – Workshop on Richly Annotated Corpora
In what conditions can a document interact with a hierarchy? • Matching of semantic relations • only by explicit declaration • automatic detection (intersection of attribute value ranges) is prone to errors LREC 2004 – Workshop on Richly Annotated Corpora
Operations on the lattice:classification • Automatic classification of a document on the lattice proceeds in two steps: • the witness-collection is formed: • the document is parsed tag declarations • semantic-relations declaration in the header ref declarations • the witness-collection is “classified” down the hierarchy LREC 2004 – Workshop on Richly Annotated Corpora
Operations on the lattice:classification • The “programming by classification” paradigm of Mellish&Reiter (1993) • the witness collection satisfies the restrictions of a node collection (is classified under it) if the features of the node collection represent of subset of the features of the witness collection LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice superior borderline LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice superior borderline inferior borderline LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice ST-SEG-NP-VP-1 LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice superior borderline LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice ST-NP-PP LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-PAR ST-TOK ST-NP ST-SEG ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:merge ST-NP-SEG LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-COREF ST-COREF-IN-SEG Operations on the lattice:extract ST-POS LREC 2004 – Workshop on Richly Annotated Corpora
ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-COREF ST-COREF-IN-SEG Operations on the lattice:extract ST-POS LREC 2004 – Workshop on Richly Annotated Corpora
Conclusions • Propose a data structure facilitating: • Definition and exploitation of annotation schemes • Visualization of the hierarchy • Representation of circular references • Concurrent annotations • Automatic classification • Operations • initialize-hierarchy • classify • merge • extract • System developed in Java, freely available on request LREC 2004 – Workshop on Richly Annotated Corpora
Acknowledgements The research presented in this paper has been partly supported by the EC IST-2000-29388 Balkanet project funded by the EC and the Balkanet-MEC project funded by the Romanian Ministry of Education and Research LREC 2004 – Workshop on Richly Annotated Corpora
Thank you… LREC 2004 – Workshop on Richly Annotated Corpora