1 / 29

osis linguistic annotation

definitions and requirements. osis linguistic annotation. kirk e. lowery westminster hebrew institute sbl computer-assisted research group. why osis linguistic annotation?. the context. the goal of osis. to exchange electronic b ibles any language, medium, presentation style

sharla
Download Presentation

osis linguistic annotation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. definitions and requirements osis linguistic annotation kirk e. lowerywestminster hebrew institute sbl computer-assisted research group

  2. why osis linguistic annotation? the context

  3. the goal of osis • to exchange electronic bibles • any language, medium, presentation style • to add “meta-information” to those texts • keywords: “link”, “hierarchy”, “pyramid” • to easily transform these texts • the target transformation is unknown • to cut costs: production, presentation, distribution of bibles plus “meta-data” • time, money, people

  4. why exchange bible texts? • coordination within organizations • cooperation between organizations and between individuals • publish in multiple formats and media from one “canonical” source • long-term archival • the changing definition of “publish” documents have a life cycle!

  5. who wants to exchange texts? • bible publishers • commercial publishing houses • denominations & bible societies • bible translators • translation teams & editors • consultants & supervisors • bible scholars • original languages, text criticism • text analysis and commentary

  6. what informationneeds to be captured? text “meta-data”

  7. translators:managing the translation process • document versions & responsibility • comments & corrections by editors • handling presentation issues • script direction • “rubies” • linking source, relay & target translations • linking supplementary information • notes, glossaries, maps

  8. translators & scholars:focus on the text • manuscript collation & description • text criticism: establishment of the original • linguistic analysis • text segmentation • segment id: from phoneme to text structures • linguistic mapping of source & target • alignment: parallel & synoptic texts

  9. how can we capturethe information? linguistic annotation

  10. required • a way to segment the text • a mechanism for associating labels with an arbitrary text-span • a means to declare labels used in analysis • a common linguistic vocabulary • language-specific grammar terms • a protocol for user redefinition

  11. segmenting text <segid="gn1:1,1.1">B.:</seg><segid="gn1:1,1.2">R")$IYT</seg><seg id="gn1:1,2.1">B.FRF)</seg><segid="gn1:1,3.1">):ELOHIYM </seg><segid="gn1:1,4.1">)"T</seg><segid="gn1:1,5.1">HA</seg><segid="gn1:1,5.2">$.FMAYIM</seg><segid="gn1:1,6.1">W:</seg><segid="gn1:1,6.2">)"T</seg><segid="gn1:1,7.1">HF</seg><segid="gn1:1,7.2">)FREC</seg> start tag unique identification hebrew text end tag

  12. adding annotation (1) <segid="gn1:1,1.1">B.:<lemma>B.</lemma><particle type="preposition" /></seg><segid="gn1:1,1.2">R")$IYT<lemma>R")$IYT</lemma><noun type="common" features="fsa" /></seg><seg id="gn1:1,2.1">B.FRF)<lemma homonym="1">B.R)</lemma><verb stem="q" conjugation="p" pgn="3ms" /></seg><segid="gn1:1,3.1">):ELOHIYM <lemma>):ELOHIYM</lemma><noun type="common" features="mpa" /></seg><segid="gn1:1,4.1">)"T<lemma homonym="1">)"T</lemma><particle type="object_marker" /></seg> content tag “milestone” tag

  13. adding annotation (2) <segid="gn1:1,5.1">HA<lemma>H</lemma><particle type="article" /></seg><segid="gn1:1,5.2">$.FMAYIM<lemma>$FMAYIM</lemma><noun type="common" features="mpa" /></seg><segid="gn1:1,6.1">W:<lemma>W</lemma><particle type="conjunction" /></seg><segid="gn1:1,6.2">)"T<lemma homonym="1">)"T</lemma><particle type="object_marker" /></seg><segid="gn1:1,7.1">HF<lemma>H</lemma><particle type="article" /></seg><segid="gn1:1,7.2">)FREC<lemma>)EREC</lemma><noun type="common" features="fsa" /></seg> content tag “milestone” tag

  14. the hard part: linguistic labels • must be standard • must be applicable to any conceivable language • labels are the “linguistic inventory” • must be compatible with current and future linguistic theories • labels must be linguistic theory-neutral • must be redefinable by the user

  15. standard solutions: labels • expert advisory group on language engineering standards (eagles) • <http://www.ilc.pi.cnr.it/EAGLES/home.html> • an initiative of the european commission (1993) • standard grammar labels of morphology and syntax for european languages • create osis standard labels for hebrew, aramaic and greek

  16. standard solutions: mechanism • the text encoding initiative (tei) guidelines • chapter 14: linking, segmentation, & alignment • chapter 16: feature structures • chapter 26: feature system declaration • “stand-off” markup (xlink) or “up-close-and-personal” (inline)? • separate meta-data about the text from the text itself? • “either-or” or “both-and”?

  17. what we must do, exactly formal requirements

  18. labels • claims made about the data itself vs claims about the claims that can be made! • the linguistic model vs the analysis allowed by the model • example: does Hebrew have “adverbs”? • a library of labels as comprehensive as possible • definitions to clarify what “thing” is being labeled • labels are names for grammatical objects

  19. labels as objects • grammatical “objects” have “attributes” or “features” • features can vary over a range of “values” • objects & features have defaults that could be changed • objects & features could be easily extended • objects & features can be arranged linearly or hierarchically

  20. mechanism • user language declaration • all labels and their relationships • done by “exclusion”, not inclusion • sensitive to linguistic theory • levels of language: resolution of ambiguity • lexical, semantic, phonemic, morphologic, phrase-, clause-, discourse-, theological levels • “context-free” and “context-bound” analysis • part-of-speech resolution

  21. tei feature structures • the feature element • the most basic markup • requires a label and any number of values • <f t="feature name" value="feature value"> • the feature structure element • <fs name="feature structure name"> • may contain any number of nested <f> and <fs> • models some grammatical object

  22. tei feature example <f name="conjugation"> <vAlt mutExcl="Y"> <sym id="pf" value="perfect" /> <sym id="impf" value="imperfect" /> <sym id="qppt" value="qal_passive_participle" /> <sym id="wc" value="wayyiqtol" /> <sym id="impv" value="imperative" /> <sym id="inf" value="infinitive" /> <sym id="pt" value="participle" /> </vAlt> </f>

  23. tei feature structure example <fs type="common noun features"> <f name="gender" org="set" fVal="gm gf gn" /> <f name="number" org="set" fVal="ns np nd" /> <f name="state" org="set" fVal="sa sc" /> </fs>

  24. tei feature library example <fvLib id="g" type="gender feature values"> <vAlt mutExcl="N"> <sym id="gm" value="masculine"/> <sym id="gf" value="feminine" /> <sym id="gn" value="neuter" /> </vAlt> </fvLib>

  25. a different approach Dictionary of Packard-Style Greek Morphology Codes <div type="x-tag" osisID="A_APFC" divTitle="A APFC"> <p>Part of speech: adjective</p> <p>Case: accusative</p> <p>Number: plural</p> <p>Gender: feminine</p> <p>Degree: comparative</p> </div>

  26. what can we do with feature structure marked up text? • self-organizing topic maps • compare linguistic hypotheses with actual usage • XSLT transforms • automated tagging of new features • comparative linguistic study • source↔target language grammar mapping

  27. where do we go from here? conclusions

  28. in the short-term • complete a first pass of language modeling • mark up real biblical text with annotation • distribute to translators and scholars for feedback • does this meet your needs? • is it practical enough that you will use it? • is it flexible enough for your language(s) and linguistic theories

  29. in the long-term • determine if tei feature structures are sufficient • decide whether to require “inline” or “standoff” markup, or to allow either • determine the best way of integrating linguistic markup with the osis core tag set • explore ideas for authoring software or, at least, linguistic annotation utility programs

More Related