290 likes | 499 Views
definitions and requirements. osis linguistic annotation. kirk e. lowery westminster hebrew institute sbl computer-assisted research group. why osis linguistic annotation?. the context. the goal of osis. to exchange electronic b ibles any language, medium, presentation style
E N D
definitions and requirements osis linguistic annotation kirk e. lowerywestminster hebrew institute sbl computer-assisted research group
why osis linguistic annotation? the context
the goal of osis • to exchange electronic bibles • any language, medium, presentation style • to add “meta-information” to those texts • keywords: “link”, “hierarchy”, “pyramid” • to easily transform these texts • the target transformation is unknown • to cut costs: production, presentation, distribution of bibles plus “meta-data” • time, money, people
why exchange bible texts? • coordination within organizations • cooperation between organizations and between individuals • publish in multiple formats and media from one “canonical” source • long-term archival • the changing definition of “publish” documents have a life cycle!
who wants to exchange texts? • bible publishers • commercial publishing houses • denominations & bible societies • bible translators • translation teams & editors • consultants & supervisors • bible scholars • original languages, text criticism • text analysis and commentary
what informationneeds to be captured? text “meta-data”
translators:managing the translation process • document versions & responsibility • comments & corrections by editors • handling presentation issues • script direction • “rubies” • linking source, relay & target translations • linking supplementary information • notes, glossaries, maps
translators & scholars:focus on the text • manuscript collation & description • text criticism: establishment of the original • linguistic analysis • text segmentation • segment id: from phoneme to text structures • linguistic mapping of source & target • alignment: parallel & synoptic texts
how can we capturethe information? linguistic annotation
required • a way to segment the text • a mechanism for associating labels with an arbitrary text-span • a means to declare labels used in analysis • a common linguistic vocabulary • language-specific grammar terms • a protocol for user redefinition
segmenting text <segid="gn1:1,1.1">B.:</seg><segid="gn1:1,1.2">R")$IYT</seg><seg id="gn1:1,2.1">B.FRF)</seg><segid="gn1:1,3.1">):ELOHIYM </seg><segid="gn1:1,4.1">)"T</seg><segid="gn1:1,5.1">HA</seg><segid="gn1:1,5.2">$.FMAYIM</seg><segid="gn1:1,6.1">W:</seg><segid="gn1:1,6.2">)"T</seg><segid="gn1:1,7.1">HF</seg><segid="gn1:1,7.2">)FREC</seg> start tag unique identification hebrew text end tag
adding annotation (1) <segid="gn1:1,1.1">B.:<lemma>B.</lemma><particle type="preposition" /></seg><segid="gn1:1,1.2">R")$IYT<lemma>R")$IYT</lemma><noun type="common" features="fsa" /></seg><seg id="gn1:1,2.1">B.FRF)<lemma homonym="1">B.R)</lemma><verb stem="q" conjugation="p" pgn="3ms" /></seg><segid="gn1:1,3.1">):ELOHIYM <lemma>):ELOHIYM</lemma><noun type="common" features="mpa" /></seg><segid="gn1:1,4.1">)"T<lemma homonym="1">)"T</lemma><particle type="object_marker" /></seg> content tag “milestone” tag
adding annotation (2) <segid="gn1:1,5.1">HA<lemma>H</lemma><particle type="article" /></seg><segid="gn1:1,5.2">$.FMAYIM<lemma>$FMAYIM</lemma><noun type="common" features="mpa" /></seg><segid="gn1:1,6.1">W:<lemma>W</lemma><particle type="conjunction" /></seg><segid="gn1:1,6.2">)"T<lemma homonym="1">)"T</lemma><particle type="object_marker" /></seg><segid="gn1:1,7.1">HF<lemma>H</lemma><particle type="article" /></seg><segid="gn1:1,7.2">)FREC<lemma>)EREC</lemma><noun type="common" features="fsa" /></seg> content tag “milestone” tag
the hard part: linguistic labels • must be standard • must be applicable to any conceivable language • labels are the “linguistic inventory” • must be compatible with current and future linguistic theories • labels must be linguistic theory-neutral • must be redefinable by the user
standard solutions: labels • expert advisory group on language engineering standards (eagles) • <http://www.ilc.pi.cnr.it/EAGLES/home.html> • an initiative of the european commission (1993) • standard grammar labels of morphology and syntax for european languages • create osis standard labels for hebrew, aramaic and greek
standard solutions: mechanism • the text encoding initiative (tei) guidelines • chapter 14: linking, segmentation, & alignment • chapter 16: feature structures • chapter 26: feature system declaration • “stand-off” markup (xlink) or “up-close-and-personal” (inline)? • separate meta-data about the text from the text itself? • “either-or” or “both-and”?
what we must do, exactly formal requirements
labels • claims made about the data itself vs claims about the claims that can be made! • the linguistic model vs the analysis allowed by the model • example: does Hebrew have “adverbs”? • a library of labels as comprehensive as possible • definitions to clarify what “thing” is being labeled • labels are names for grammatical objects
labels as objects • grammatical “objects” have “attributes” or “features” • features can vary over a range of “values” • objects & features have defaults that could be changed • objects & features could be easily extended • objects & features can be arranged linearly or hierarchically
mechanism • user language declaration • all labels and their relationships • done by “exclusion”, not inclusion • sensitive to linguistic theory • levels of language: resolution of ambiguity • lexical, semantic, phonemic, morphologic, phrase-, clause-, discourse-, theological levels • “context-free” and “context-bound” analysis • part-of-speech resolution
tei feature structures • the feature element • the most basic markup • requires a label and any number of values • <f t="feature name" value="feature value"> • the feature structure element • <fs name="feature structure name"> • may contain any number of nested <f> and <fs> • models some grammatical object
tei feature example <f name="conjugation"> <vAlt mutExcl="Y"> <sym id="pf" value="perfect" /> <sym id="impf" value="imperfect" /> <sym id="qppt" value="qal_passive_participle" /> <sym id="wc" value="wayyiqtol" /> <sym id="impv" value="imperative" /> <sym id="inf" value="infinitive" /> <sym id="pt" value="participle" /> </vAlt> </f>
tei feature structure example <fs type="common noun features"> <f name="gender" org="set" fVal="gm gf gn" /> <f name="number" org="set" fVal="ns np nd" /> <f name="state" org="set" fVal="sa sc" /> </fs>
tei feature library example <fvLib id="g" type="gender feature values"> <vAlt mutExcl="N"> <sym id="gm" value="masculine"/> <sym id="gf" value="feminine" /> <sym id="gn" value="neuter" /> </vAlt> </fvLib>
a different approach Dictionary of Packard-Style Greek Morphology Codes <div type="x-tag" osisID="A_APFC" divTitle="A APFC"> <p>Part of speech: adjective</p> <p>Case: accusative</p> <p>Number: plural</p> <p>Gender: feminine</p> <p>Degree: comparative</p> </div>
what can we do with feature structure marked up text? • self-organizing topic maps • compare linguistic hypotheses with actual usage • XSLT transforms • automated tagging of new features • comparative linguistic study • source↔target language grammar mapping
where do we go from here? conclusions
in the short-term • complete a first pass of language modeling • mark up real biblical text with annotation • distribute to translators and scholars for feedback • does this meet your needs? • is it practical enough that you will use it? • is it flexible enough for your language(s) and linguistic theories
in the long-term • determine if tei feature structures are sufficient • decide whether to require “inline” or “standoff” markup, or to allow either • determine the best way of integrating linguistic markup with the osis core tag set • explore ideas for authoring software or, at least, linguistic annotation utility programs