osis linguistic annotation

definitions and requirements osis linguistic annotation kirk e. lowerywestminster hebrew institute sbl computer-assisted research group

why osis linguistic annotation? the context

the goal of osis • to exchange electronic bibles • any language, medium, presentation style • to add “meta-information” to those texts • keywords: “link”, “hierarchy”, “pyramid” • to easily transform these texts • the target transformation is unknown • to cut costs: production, presentation, distribution of bibles plus “meta-data” • time, money, people

why exchange bible texts? • coordination within organizations • cooperation between organizations and between individuals • publish in multiple formats and media from one “canonical” source • long-term archival • the changing definition of “publish” documents have a life cycle!

who wants to exchange texts? • bible publishers • commercial publishing houses • denominations & bible societies • bible translators • translation teams & editors • consultants & supervisors • bible scholars • original languages, text criticism • text analysis and commentary

what informationneeds to be captured? text “meta-data”

translators:managing the translation process • document versions & responsibility • comments & corrections by editors • handling presentation issues • script direction • “rubies” • linking source, relay & target translations • linking supplementary information • notes, glossaries, maps

translators & scholars:focus on the text • manuscript collation & description • text criticism: establishment of the original • linguistic analysis • text segmentation • segment id: from phoneme to text structures • linguistic mapping of source & target • alignment: parallel & synoptic texts

how can we capturethe information? linguistic annotation

required • a way to segment the text • a mechanism for associating labels with an arbitrary text-span • a means to declare labels used in analysis • a common linguistic vocabulary • language-specific grammar terms • a protocol for user redefinition

segmenting text <segid="gn1:1,1.1">B.:</seg><segid="gn1:1,1.2">R")$IYT</seg><seg id="gn1:1,2.1">B.FRF)</seg><segid="gn1:1,3.1">):ELOHIYM </seg><segid="gn1:1,4.1">)"T</seg><segid="gn1:1,5.1">HA</seg><segid="gn1:1,5.2">$.FMAYIM</seg><segid="gn1:1,6.1">W:</seg><segid="gn1:1,6.2">)"T</seg><segid="gn1:1,7.1">HF</seg><segid="gn1:1,7.2">)FREC</seg> start tag unique identification hebrew text end tag

adding annotation (1) <segid="gn1:1,1.1">B.:<lemma>B.</lemma><particle type="preposition" /></seg><segid="gn1:1,1.2">R")$IYT<lemma>R")$IYT</lemma><noun type="common" features="fsa" /></seg><seg id="gn1:1,2.1">B.FRF)<lemma homonym="1">B.R)</lemma><verb stem="q" conjugation="p" pgn="3ms" /></seg><segid="gn1:1,3.1">):ELOHIYM <lemma>):ELOHIYM</lemma><noun type="common" features="mpa" /></seg><segid="gn1:1,4.1">)"T<lemma homonym="1">)"T</lemma><particle type="object_marker" /></seg> content tag “milestone” tag

adding annotation (2) <segid="gn1:1,5.1">HA<lemma>H</lemma><particle type="article" /></seg><segid="gn1:1,5.2">$.FMAYIM<lemma>$FMAYIM</lemma><noun type="common" features="mpa" /></seg><segid="gn1:1,6.1">W:<lemma>W</lemma><particle type="conjunction" /></seg><segid="gn1:1,6.2">)"T<lemma homonym="1">)"T</lemma><particle type="object_marker" /></seg><segid="gn1:1,7.1">HF<lemma>H</lemma><particle type="article" /></seg><segid="gn1:1,7.2">)FREC<lemma>)EREC</lemma><noun type="common" features="fsa" /></seg> content tag “milestone” tag

the hard part: linguistic labels • must be standard • must be applicable to any conceivable language • labels are the “linguistic inventory” • must be compatible with current and future linguistic theories • labels must be linguistic theory-neutral • must be redefinable by the user

standard solutions: labels • expert advisory group on language engineering standards (eagles) • <http://www.ilc.pi.cnr.it/EAGLES/home.html> • an initiative of the european commission (1993) • standard grammar labels of morphology and syntax for european languages • create osis standard labels for hebrew, aramaic and greek

standard solutions: mechanism • the text encoding initiative (tei) guidelines • chapter 14: linking, segmentation, & alignment • chapter 16: feature structures • chapter 26: feature system declaration • “stand-off” markup (xlink) or “up-close-and-personal” (inline)? • separate meta-data about the text from the text itself? • “either-or” or “both-and”?

what we must do, exactly formal requirements

labels • claims made about the data itself vs claims about the claims that can be made! • the linguistic model vs the analysis allowed by the model • example: does Hebrew have “adverbs”? • a library of labels as comprehensive as possible • definitions to clarify what “thing” is being labeled • labels are names for grammatical objects

labels as objects • grammatical “objects” have “attributes” or “features” • features can vary over a range of “values” • objects & features have defaults that could be changed • objects & features could be easily extended • objects & features can be arranged linearly or hierarchically

mechanism • user language declaration • all labels and their relationships • done by “exclusion”, not inclusion • sensitive to linguistic theory • levels of language: resolution of ambiguity • lexical, semantic, phonemic, morphologic, phrase-, clause-, discourse-, theological levels • “context-free” and “context-bound” analysis • part-of-speech resolution

tei feature structures • the feature element • the most basic markup • requires a label and any number of values • <f t="feature name" value="feature value"> • the feature structure element • <fs name="feature structure name"> • may contain any number of nested <f> and <fs> • models some grammatical object

tei feature example <f name="conjugation"> <vAlt mutExcl="Y"> <sym id="pf" value="perfect" /> <sym id="impf" value="imperfect" /> <sym id="qppt" value="qal_passive_participle" /> <sym id="wc" value="wayyiqtol" /> <sym id="impv" value="imperative" /> <sym id="inf" value="infinitive" /> <sym id="pt" value="participle" /> </vAlt> </f>

tei feature structure example <fs type="common noun features"> <f name="gender" org="set" fVal="gm gf gn" /> <f name="number" org="set" fVal="ns np nd" /> <f name="state" org="set" fVal="sa sc" /> </fs>

tei feature library example <fvLib id="g" type="gender feature values"> <vAlt mutExcl="N"> <sym id="gm" value="masculine"/> <sym id="gf" value="feminine" /> <sym id="gn" value="neuter" /> </vAlt> </fvLib>

a different approach Dictionary of Packard-Style Greek Morphology Codes <div type="x-tag" osisID="A_APFC" divTitle="A APFC"> Part of speech: adjective Case: accusative Number: plural Gender: feminine Degree: comparative </div>

what can we do with feature structure marked up text? • self-organizing topic maps • compare linguistic hypotheses with actual usage • XSLT transforms • automated tagging of new features • comparative linguistic study • source↔target language grammar mapping

where do we go from here? conclusions

in the short-term • complete a first pass of language modeling • mark up real biblical text with annotation • distribute to translators and scholars for feedback • does this meet your needs? • is it practical enough that you will use it? • is it flexible enough for your language(s) and linguistic theories

in the long-term • determine if tei feature structures are sufficient • decide whether to require “inline” or “standoff” markup, or to allow either • determine the best way of integrating linguistic markup with the osis core tag set • explore ideas for authoring software or, at least, linguistic annotation utility programs

osis linguistic annotation

osis linguistic annotation

Presentation Transcript

OSIS – A Closer Look

OSIS An Introduction

Annotation

Linguistic Annotation Framework

Annotation

Annotation

Linguistic annotation of learner corpora

LDMT MURI Data Collection and Linguistic Annotation

Ph agocyt osis

Annotation

Annotation as Algebra: a formal framework for linguistic annotation

Linguistic annotation

Mee-osis My-osis Po-ta-toe Poh-ta-toe

Linguistic

Annotation

Mit osis VS Mei osis

Annotation

M ei osis