1 / 48

Hierarchical XML Layers Representation for Heavily Annotated Corpora

Hierarchical XML Layers Representation for Heavily Annotated Corpora. Dan Cristea Cristina Butnariu d cristea @infoiasi.ro cris@infoiasi.ro “ Al. I. Cuza ” University of Iaşi Faculty of Computer Science and Romanian Academy – the Iaşi Branch

Download Presentation

Hierarchical XML Layers Representation for Heavily Annotated Corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hierarchical XML Layers Representation for Heavily Annotated Corpora Dan Cristea Cristina Butnariu dcristea@infoiasi.ro cris@infoiasi.ro “Al. I. Cuza” University of Iaşi Faculty of Computer Science and Romanian Academy – the Iaşi Branch Institute for Theoretical Computer Science

  2. XML in LR annotation • A de facto framework to support language annotation • Used to: • record experts views on linguistic phenomena on corpora • store intermediate results in pipe-line NLP applications • post NLP results • BUT: • annotation schemes: a chaos and not reusable • many annotations do share parts in common • not all layers are useful for the task at hand LREC 2004 – Workshop on Richly Annotated Corpora

  3. Presentation • Motivation for a structural view on annotation schemes • Proposal for ahierarchical representation • circular references • classification within the hierarchy • operations within the hierarchy • Conclusions LREC 2004 – Workshop on Richly Annotated Corpora

  4. An annotation session • a source XML annotated document • a database image of the annotation Annotation session both or DTD file LREC 2004 – Workshop on Richly Annotated Corpora

  5. A sequence of annotation sessions Annotation session Annotation session DTD2 DTD1 LREC 2004 – Workshop on Richly Annotated Corpora

  6. Mixing human with automatic annotation Automatic annotation Manual annotation DTD2 DTD1 LREC 2004 – Workshop on Richly Annotated Corpora

  7. + Multiple parentage of a scheme LREC 2004 – Workshop on Richly Annotated Corpora

  8. Multiple parentage LREC 2004 – Workshop on Richly Annotated Corpora

  9. Multiple parentage < … > < … > LREC 2004 – Workshop on Richly Annotated Corpora

  10. Multiple parentage LREC 2004 – Workshop on Richly Annotated Corpora

  11. Multiple parentage < … > < … > LREC 2004 – Workshop on Richly Annotated Corpora

  12. Multiple parentage LREC 2004 – Workshop on Richly Annotated Corpora

  13. Multiple parentage < … > < … > < … > < … > LREC 2004 – Workshop on Richly Annotated Corpora

  14. ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG The hierarchy – a DAG representation LREC 2004 – Workshop on Richly Annotated Corpora

  15. ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG The hierarchy – a DAG representation LREC 2004 – Workshop on Richly Annotated Corpora

  16. Definition of a scheme <scheme name=”scheme-name” parents=”list-of-parents”> <tag name="tag-name" attributes="list-of-attributes"/> … <ref source-tag="tag-name" source-attribute="attribute-name" target-tag="tag-name" target-attribute=”attribute-name”> … </scheme> LREC 2004 – Workshop on Richly Annotated Corpora

  17. A The subsumption relation B A node A subsumes a node B in the hierarchy (B is a descendent of A) iff: • any tag-name of A is also in B; • any attribute in the list of attributes of a tag-name in A is also in the list of attributes of the same tag-name of B; • any semantic relation which holds in A also holds in B; • either B has at least one tag-name which is not in A, and/or there is at least one tag-name in B such that at least one attribute in its list of attributes is not in the list of attributes of the homonymous tag-name in A, and/or there is at least one semantic relation which holds in B and which doesn’t hold in A. LREC 2004 – Workshop on Richly Annotated Corpora

  18. Example <?xml version="1.0" encoding="ISO-8859-1" ?> <ROOT> <SEG id="0"> <NP head-id="2" id="0"> <TOK id="2" pos="N" lemma="Winston">Winston</TOK> </NP> <TOK id="3" pos="V" lemma="be">was</TOK> <TOK id="4" pos="ING" lemma="dream">dreaming</TOK> <TOK id="5" pos="PREP" lemma="of">of</TOK> <NP head-id="7" id="2"> <NP head-id="6" id="1" coref="0"> <TOK id="6" pos="PRON" lemma="he">his</TOK> </NP> <TOK id="7" pos="N" lemma="mother">mother</TOK> </NP> <TOK id="8" pos="PUNCT">.</TOK> </SEG> <SEG id="1"> <NP head-id="9" id="3" coref="0"> <TOK id="9" pos="PRON" lemma="he">He</TOK> </NP> <TOK id="10" pos="V" lemma="must">must</TOK> <TOK id="11" pos="PUNCT">,</TOK> </SEG> <SEG id="2"> <NP head-id="12" id="4" coref="0"> <TOK id="12" pos="PRON" lemma="he">he</TOK> </NP> <TOK id="13" pos="V" lemma="think">thought</TOK> <TOK id="14" pos="PUNCT">,</TOK> </SEG> </ROOT> LREC 2004 – Workshop on Richly Annotated Corpora

  19. How can circular referencesbe notated? <SEG id=“seg0" head-id=“vp0"> Winston <VP id=“vp0“ in-seg=“seg0">was dreaming</VP> of his mother </SEG> LREC 2004 – Workshop on Richly Annotated Corpora

  20. ST-ROOT ST-SEG Representingcircular references SEG annotation <SEG id=“seg0"> Winston was dreaming of his mother </SEG> LREC 2004 – Workshop on Richly Annotated Corpora

  21. ST-ROOT ST-VP Representingcircular references VP annotation Winston <VP id=“vp0“> was dreaming </VP> of his mother LREC 2004 – Workshop on Richly Annotated Corpora

  22. ST-ROOT ST-VP ST-SEG Representingcircular references SEG refers into VP <SEG id=“seg0"head-id=“vp0"> Winston <VP id=“vp0“> was dreaming </VP> of his mother </SEG> ST-SEG-TO-VP LREC 2004 – Workshop on Richly Annotated Corpora

  23. ST-ROOT ST-VP ST-SEG Representingcircular references VP refers into SEG <SEG id=“seg0"> Winston <VP id=“vp0“in-seg=“seg0"> was dreaming </VP> of his mother </SEG> ST-VP-TO-SEG LREC 2004 – Workshop on Richly Annotated Corpora

  24. ST-ROOT ST-VP Representingcircular references Keeping all references <SEG id=“seg0“ head-id=“vp0”> Winston <VP id=“vp0“ in-seg=“seg0"> was dreaming </VP> of his mother </SEG> ST-SEG ST-SEG-TO-VP ST-VP-TO-SEG ST-SEG-VP LREC 2004 – Workshop on Richly Annotated Corpora

  25. ST-ROOT ST-ROOT ST-VP ST-VP ST-SEG-VP ST-SEG-VP ST-SEG ST-SEG Representingcircular references Delete unnecessary layers ST-SEG-TO-VP ST-VP-TO-SEG LREC 2004 – Workshop on Richly Annotated Corpora

  26. In what conditions can a document interact with a hierarchy? • Compatibility of names • Matching of semantic relations LREC 2004 – Workshop on Richly Annotated Corpora

  27. In what conditions can a document interact with a hierarchy? • Compatibility of names = tag and attribute names • simple translation • expanding/shrinking values msd=”Ncmso” expands into a set of elementary features pos=”noun” type=”common” gender=”masculine” number=”singular” case=”obligue” LREC 2004 – Workshop on Richly Annotated Corpora

  28. In what conditions can a document interact with a hierarchy? • Matching of semantic relations • only by explicit declaration • automatic detection (intersection of attribute value ranges) is prone to errors LREC 2004 – Workshop on Richly Annotated Corpora

  29. Operations on the lattice:classification • Automatic classification of a document on the lattice proceeds in two steps: • the witness-collection is formed: • the document is parsed  tag declarations • semantic-relations declaration in the header  ref declarations • the witness-collection is “classified” down the hierarchy LREC 2004 – Workshop on Richly Annotated Corpora

  30. Operations on the lattice:classification • The “programming by classification” paradigm of Mellish&Reiter (1993) • the witness collection satisfies the restrictions of a node collection (is classified under it) if the features of the node collection represent of subset of the features of the witness collection LREC 2004 – Workshop on Richly Annotated Corpora

  31. ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora

  32. ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora

  33. ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora

  34. ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora

  35. ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice superior borderline LREC 2004 – Workshop on Richly Annotated Corpora

  36. ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice superior borderline inferior borderline LREC 2004 – Workshop on Richly Annotated Corpora

  37. ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice ST-SEG-NP-VP-1 LREC 2004 – Workshop on Richly Annotated Corpora

  38. ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora

  39. ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora

  40. ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice LREC 2004 – Workshop on Richly Annotated Corpora

  41. ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice superior borderline LREC 2004 – Workshop on Richly Annotated Corpora

  42. ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:classification • Automatic classification of a document on the lattice ST-NP-PP LREC 2004 – Workshop on Richly Annotated Corpora

  43. ST-ROOT ST-PAR ST-TOK ST-NP ST-SEG ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-VP ST-POS ST-COREF ST-COREF-IN-SEG Operations on the lattice:merge ST-NP-SEG LREC 2004 – Workshop on Richly Annotated Corpora

  44. ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-COREF ST-COREF-IN-SEG Operations on the lattice:extract ST-POS LREC 2004 – Workshop on Richly Annotated Corpora

  45. ST-ROOT ST-SEG ST-PAR ST-TOK ST-SEG-NP-VP ST-PAR-SEG-NP-VP ST-NP ST-VP ST-COREF ST-COREF-IN-SEG Operations on the lattice:extract ST-POS LREC 2004 – Workshop on Richly Annotated Corpora

  46. Conclusions • Propose a data structure facilitating: • Definition and exploitation of annotation schemes • Visualization of the hierarchy • Representation of circular references • Concurrent annotations • Automatic classification • Operations • initialize-hierarchy • classify • merge • extract • System developed in Java, freely available on request LREC 2004 – Workshop on Richly Annotated Corpora

  47. Acknowledgements The research presented in this paper has been partly supported by the EC IST-2000-29388 Balkanet project funded by the EC and the Balkanet-MEC project funded by the Romanian Ministry of Education and Research LREC 2004 – Workshop on Richly Annotated Corpora

  48. Thank you… LREC 2004 – Workshop on Richly Annotated Corpora

More Related