Annotation as Algebra: a formal framework for linguistic annotation

Annotation as Algebra:a formal framework for linguistic annotation Mark LibermanUniversity of Pennsylvaniamyl@cis.upenn.edu (joint work with Steven Bird, Melbourne University)

Outline • Motivation • Sketch of the idea • Survey of linguistic annotation • Annotation graphs as a formal framework • Practical implementations and experience • Issues for the future

What linguistic annotation is (and isn’t) • “Linguistic annotation” means symbolic descriptions of specific linguistic signals • e.g. transcriptions, parses, etc. • it does not include things like: • metadata • e.g. information about speakers, recordings, documents, etc. • typically stored in RDB referenced by elements of linguistic annotation • lexicons • but these can be treated in a common framework

Motivation • A jungle of annotation file formats • e.g. more than 20 common formatsfor time-marked orthographic transcriptions • Many new formats every year • Multiple annotations of the same data • No good way to search annotations • different coding needed for each format • extra difficulty of searches across formats • Problems for: • tool builders • researchers • corpus builders and maintainers

Basic idea #1: what to do • Abstract away from file formats,to the logical structure of linguistic annotation • Replace two-level model with three-level model • as in database technology several decades ago • so many applications can access many kinds of data through a consistent API • Choose a logical structure with good properties • simple, conceptually natural, computationally efficient • algebra to facilitate boolean combination of queries

Two-level model:

Three-level model:

Basic idea #2: how to do it • Three kinds of assertion recur in linguistic annotation • assigning a label “This chunk of stuff has property X” • sequencing labels “chunk B immediately follows chunk A” • anchoring the edges of labels “this chunk boundary has coordinates k” (in time, space, text...) • Formalized as a labeled DAG, these primitives provides a logical structure adequate for all linguistic annotation • The result also defines an algebra useful for searching and in other ways

Basic assertion type 1: Labeling Associate a “label” (typed, structured symbolic information) with a region of a linguistic signal

Basic assertion type 2: sequencing Example: The stretch of signal labeled “this”is followed by a stretch of signal labeled “is”

Basic assertion type 3: anchoring Example: The stretch of signal labeled “this” begins 137.4592 seconds from the start of file XYZ.

Informal formalization • An “annotation graph” (AG) is: • a directed acyclic graph • whose arcs are labeled with fielded records e.g. phoneme=“p” or word=“this” • whose nodes may be labeled with signal coordinates e.g. 3.45692 seconds • Labeling → arc labelsSequencing → Anchoring → signal coordinates on nodes • That’s all!

Outcome API, open source toolkit (C,C++,TCL,Python); sample tools: Java version (“ATLAS”) developed by NIST

Annotation formats & tools • Surveyed in 1999 by Liberman and Bird • Documented on web pagehttp://ldc.upenn.edu/annotation • Used in designing annotation graphsystem & AG software • Survey is updated periodically

Some animals in the annotation zoo • TIMIT • BAS Partitur • CHILDES • LACITO • LDC CALLHOME • NIST UTF • Switchboard (four types of annotation) • ... etc. ...

Sample TIMIT data train/dr1/fjsp0/sa1.wrd: train/dr1/fjsp0/sa1.phn: 2360 5200she0 2360h# 5200 9680had2360 3720sh 9680 11077your3720 5200iy 11077 16626dark5200 6160hv 16626 22179suit6160 8720ae 22179 24400in8720 9680dcl 24400 30161greasy9680 10173y 30161 36150wash10173 11077axr 36720 41839water11077 12019dcl 41839 44680all12019 12257d 44680 49066year ...

TIMIT interpreted graphically 5200 6160 8720 9680

TIMIT as Annotation Graph W = word level 5200 9680had P = phoneme level 5200 6160hv 6160 8720ae 8720 9680dcl

BAS Partitur • Goal: a common format for research results from many German speech projects. • A multi-tier description of speech signals: • KAN - the canonical transcription • ORT - orthographic transcription • TRL - transliteration • MAU - phonetic transcription • DAS - dialogue act transcription

BAS Partitur: example KAN:0 j'a: ORT:0 ja MAU: 4160 1119 0 j KAN:1 S'2:n@n ORT:1 schönen MAU: 5280 2239 0 a: KAN:2 d'aNk ORT:2 Dank MAU: 7520 2399 1 S KAN:3 das+ ORT:3 das MAU: 9920 1599 1 2: KAN:4 vE:r@+ ORT:4 wäre MAU: 11520 479 1 n KAN:5 z'e:6 ORT:5 sehr MAU: 12000 479 1 n KAN:6 n'Et ORT:6 nett MAU: 12480 479 -1 DAS:0,1,2 @(THANK_INIT BA) DAS:3,4,5,6 @(FEEDBACK_ACKNOWLEDGEMENT BA)

@(THANK_INIT BA) DAS: sch"onen ja ORT: KAN: S'2:n@n j'a: MAU: 4160 5280 7520 BAS Partitur graphical structure: KAN:0 j'a: ORT:0 ja MAU: 4160 1119 0 j KAN:1 S'2:n@n ORT:1 sch"onen MAU: 5280 2239 0 a: DAS:0,1,2 @(THANK_INIT BA)

Partitur differences from TIMIT • File organization: • everything is in a single file (even metadata) • Time marking: • time anchors are in only one tier (MAU) time anchors use <start offset, duration-1> • Relationship between the tiers: • KAN tier supplies a set of identifiers • MAU tier: several lines for each KAN line • DAS tier: one line for several KAN lines • Temporal structure: • MAU and DAS define convex intervals

BAS Partitur: Annotation graph ORT: 0 ja MAU: 4160 1119 0 j ORT: 1 sch"onen MAU: 5280 2239 0 a: MAU: 7520 2399 1 S MAU: 9920 1599 1 2: MAU: 11520 479 1 n DAS:0,1,2 @(THANK_INIT BA)

CHILDES • Child language acquisition data • Archive organized by Brian MacWhinney at CMU • CHAT transcription format • Tools for creating, browsing, searching • Contributions by many researchers around the world

CHILDES Annotation *ROS: yahoo. %snd: "boys73a.aiff" 7349 8338 *FAT: you got a lot more to do # don't you? %snd: "boys73a.aiff" 8607 9999 *MAR: yeah. %snd: "boys73a.aiff" 10482 10839 *MAR: because I'm not ready to go to <the bathroom> [>] +/. %snd: "boys73a.aiff" 11621 13784

CHILDES differences from TIMIT • long recordings with multiple speakers • time specified at turn level only • there are gaps between the turns • the transcription contains embedded annotations

CHILDES annotation graph *ROS: yahoo. %snd: "boys73a.aiff" 7349 8338 *FAT: you got a lot more to do # don't you? %snd: "boys73a.aiff" 8607 9999 NB: incomplete time info, disconnected structure

CHILDES: RDB connection “metadata” about speakers, recordings etc. stored separately in relational tables ID NAME ROLE AGE SEX BIRTH 1 Ross Child 6;3.11 male 23-DEC-1977 2 Mark Child 4;4.15 male 19-NOV-1979 3 Brian Father 4 Mary Mother

LACITO Langues et Civilisations a Tradition Orale • recordings of unwritten languages, collected and transcribed over three decades • preservation and dissemination Based on XML • markup for alignment to audio signal • different XSL style sheets for display • generating HTML • with hyperlinks to audio clips

LACITO example <S id="s1"> <AUDIO start="2.3656" end="7.9256"/> <TRANSCR> <W><FORM>nakpu</FORM> <GLS>deux</GLS></W> <W><FORM>nonotso</FORM> <GLS>soeurs</GLS></W> <W><FORM>si&x014b;</FORM> <GLS>bois</GLS></W> <W><FORM>pa</FORM> <GLS>faire</GLS></W> <W><FORM>la&x0294;natshem</FORM> <GLS>allerent</GLS></W> <W><FORM>are</FORM> <GLS>dit.on</GLS></W> <PONCT>.</PONCT> </TRANSCR> <TRADUC lang="Francais">On raconte que deux soeurs allerent chercher du bois.</TRADUC> <TRADUC lang="Anglais">They say that two sisters went to get firewood.</TRADUC> </S>

LACITO as AG <AUDIO start="2.3656" end="7.9256"/> <W><FORM>nakpu</FORM> <GLS>deux</GLS></W> <W><FORM>nonotso</FORM> <GLS>soeurs</GLS></W> <W><FORM>si&x014b;</FORM> <GLS>bois</GLS></W> <W><FORM>pa</FORM> <GLS>faire</GLS></W> <TRADUC lang="Francais">On raconte que deux ...</TRADUC> <TRADUC lang="Anglais">They say that two ...</TRADUC>

LACITO discussion Two kinds of partiality for times: • where they are simply unknown • where they are inappropriate Unknown times: • the annotation is incomplete • time-alignment is coarse-grained Inappropriate times: • for word boundaries in the phrasal translation • for punctuation?

LDC Call Home example 980.18 989.56 A: you know, given how he's how far he's gotten, you know, he got his degree at &Tufts and all, I found that surprising that for the first time as an adult they're diagnosing this. %um 989.42 991.86 B: %mm. I wonder about it. But anyway. 991.75 994.65 A: yeah, but that's what he said. And %um 994.19 994.46 B: yeah. 995.21 996.59 A: He %um 996.51 997.61 B: Whatever's helpful. 997.40 1002.55 A: Right. So he found this new job as a financial consultant and seems to be happy with that. 1003.14 1003.45 B: Good.

LDC CallHome as AG 995.21 996.59 A: He %um 996.51 997.61 B: Whatever's helpful. 997.40 1002.55 A: Right. So ...

CallHome discussion Speaker overlap • No special devices, just turn time-marks • Scales for an arbitrary number of speakers • Information about word-level overlap is left ambiguous • Additional time references could easily specify word overlap

NIST UTF (circa 1999) NIST: National Institute for Standards and Technology(USA) UTF: “Universal Transcription Format” • Intended to generalize over several earlier LDC broadcast news and conversation transcription formats Special treatment for: • metadata, time stamps, speaker overlap, contractions N.B. now abandoned in favor of AG-based representations

NIST UTF example (from BN) <turn speaker="Roger_Hedgecock" spkrtype="male" dialect= "native" start="2348.811875" end="2391.606000" mode="spontaneous" fidelity="high"> <time sec="2387.353875"> on welfare and away from real ownership \{breath and <contraction e_form="[that=>that]['s=>is]">that's a real problem in this <b_overlap start="2391.115375" end="2391.606000">country<e_overlap></turn> <turn speaker="Gloria_Allred" spkrtype="female" dialect= "native" start="2391.299625" end="2439.820312" mode="spontaneous" fidelity="high"> <b_overlap start="2391.299625" end="2391.606000"> well i<e_overlap>think the real problem is that %uh these kinds of republican attacks <time sec="2395.462500"> i see as code words for discrimination</turn>

NIST UTF: turn element <turn speaker="Roger_Hedgecock" spkrtype="male" dialect= "native" start="2348.811875" end="2391.606000" mode="spontaneous" fidelity="high">

NIST UTF: Contraction <contraction e_form="[that=>that]['s=>is]"> that's

NIST UTF: overlap <b_overlap start="2391.115375" end="2391.606000"> country <e_overlap>

NIST UTF: discussion • Relational data (e.g. speaker demographics) • is embedded in the annotation (redundantly). • Time stamps • are stored in three different places. • Speaker overlap • is convolved with the speaker turn, • so time relation with an external event disrupts the internal structure of a turn • Contractions • are treated in a way that facilitates link to lexicon, • but may be hard to ignore in a search function

NIST UTF as AG

AG contraction treatment • Additional textual annotations: • e.g. for expanding a contraction • don't complicate the existing representation • --facilitates search

NIST UTF / AG version • Metadata • stored in a separate RDB table (cf. CHILDES) • Time stamps • stored in a single place -- AG nodes • Speaker overlap • not convolved with the speaker turn • so temporal relationship with an external event remains external to the structure of a turn • Contractions • no new device, easily ignored in search • No artificial order on speaker turns

Switchboard • Corpus of 2400 5-minute telephone conversations collected at Texas Instruments in 1991 • Transcribed and aligned on three levels: • conversation, speaker turn, word • Subsequently annotated for: • POS, syntactic structure, • breath groups, disfluencies, • speech acts, • phonetic segments, • etc. • Then re-transcribed with many corrections! • --Proliferation of layers with different tokenizations • --Problem of correction after annotation

SWB example (1, 2) B 21.86 0.26Metric B 22.12 0.26system, B 22.38 0.18no B 22.56 0.06one's B 22.86 0.32very, B 23.88 0.14uh, B 24.02 0.16no B 24.18 0.32one B 24.52 0.28wants B 24.80 0.06it B 24.86 0.12at B 24.98 0.22all B 25.66 0.22seems B 25.88 0.22like. [Metric/JJ system/NN ] ,/, [no/DT one/NN ] 's/BES very/RB ,/, [uh/UH ],/, [no/DT one/NN ] wants/VBZ [it/PRP ] at/IN [all/DT ] seems/VBZ like/IN ./.

SWB example (3, 4) B.22: Yeah, / no one seems to be adopting it. / Metric system,[no one's very,+ {F uh, } no one wants]it at all seems like. / ((S (NP-TPC Metric system) , (S-TPC-1 (EDITED (RM [) (S (NP-SBJ no one) (VP 's (ADJP-PRD-UNF very))) , (IP +)) (INTJ uh) , (NP-SBJ no one) (VP wants (RS ]) (NP it) (ADVP at all))) (NP-SBJ *) (VP seems (SBAR like (S *T*-1))) . E_S))

Switchboard: AG

Another multiple annotation It is quite realistic to have this many diverse annotations (and more!) for the same material...

AG formalization: Background Annotation - the basic action: • associate a label with an extent of signal • labels may be of different types • different types may span different amounts of time; need not form a hierarchy Minimal formalization: • directed graph • typed, fielded records on the arcs • optional time references on the nodes

Annotation as Algebra: a formal framework for linguistic annotation

Annotation as Algebra: a formal framework for linguistic annotation

Presentation Transcript

Cultural and Linguistic Competence and the CLAS Standards

The emergence of linguistic productivity

Computational Tools for Linguists

Let’s Do Algebra Tiles

ALGEBRA TILES

Generative Historical Syntax and the Linguistic Cycle

Text Extraction from Big Data

CROWDSOURCING (ANAPHORIC) ANNOTATION

From Relational Algebra to SQL

Informal vs. Formal Language

Boolean Algebra

Accurate and Efficient Gesture Spotting via Pruning and Subgesture Reasoning

Finding Genes In a Genome

Discourse Processing

Tectogrammatical Annotation of English

Chapter 5

The Relational Algebra and Calculus

Learning in NLP: When can we reduce or avoid annotation cost?

Chapter 3 Linear Algebra

Algebra Geek Patrol