Querying Spoken Language Corpora

Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim

Outline • Background: EXMARaLDA, FOLKER, AGD, DGD2 • Transcription: Data models, data formats, TEI • Corpora: Recordings, transcripts, metadata • Query requirements • Query technologies • Demo • Future directions

Background • EXMARaLDA: System for building and querying spoken language corpora • Used in many individual projects, at the HZSK CLARIN Centre • Transcription editor, Corpus management tool, query tool EXAKT • FOLKER: Transcription tool – same technical basis, optimised for Research and Teaching Corpus of Spoken German (FOLK)

Background • Archive for Spoken German (AGD): central archive for oral corpora in Germany, IDS Mannheim • Dialect corpora, conversation corpora • Database for Spoken German (DGD2): access (browsing and query) for AGD data

Model: Single timeline, multiple tiers • Annotation tuples: text label + timeline reference • Timeline: fully ordered, reference to a recording • Tiers: collections of annotations of a specific category, a specific speaker, annotations in a tier do not overlap  Annotation Graph Framework (Bird/Liberman 2001)

EXMARaLDA Basic Transcription: (Flat) hierarchy of events in tiers Use of ID and IDREFS to encode temporal relations No additional markup, no „deep“ semantics

EXMARaLDA ELAN

EXMARaLDA ELAN Praat

Data formats • Schmidt, Loehr et al. (2008): An exchangeformatfor multimodal annotations. • XML formatfordataexchangebetweenseventoolswith STMT datamodels  improvesinteroperabilityfordatacreation • Drawbacks • nodocumentorder (non-linear, non-hierachical) • whatisthe „fulltext“ / the „primarydata“ / the „characterdata“? • no explicit representationofdependencies • temporal structure, not linguisticstructure  badforquerying?

STMT to OHCO transformation

STMT to OHCO transformation • Segment chain = any temporally connected chain of annotations within one tier • Assumption: all other hierarchical structure beneath the level of segment chains • Correspondence: segment chain ↔ <u>

Unparsed (EXAKT) Parsed (DGD2)

Free annotation (EXAKT) Token annotation (DGD2)

Schmidt (2011): A TEI-based Approach to Standardising Spoken Language Transcription. jTEI (1) • Romary, Witt, Schmidt: ISO/DIN PWI 24624: TranscriptionOf Speech

Transcripts, recordings, metadata • Interaction metadata • date, „genre“, place, degree of formality, etc. • pertains to a (set of) transcription(s) • Speaker metadata • age, sex, language biography, speech impediments, etc. • pertains to (a) part(s) of a transcription • Audio and video recordings • for checking transcription quality • for obtaining information not encoded in transcripts • Transcripts • not (the) primary data! • a „convenient index into the recording“? • selective, theory-dependent, …

Corpora

Corpora AGD Corpora: 8 mill. tokens CGN Corpus: 9 mill. tokens BNC Spoken: 10 mill. tokens MICASE: 2 mill. tokens Most other corpora: < 1 mill. Tokens (at least) one order of magnitude smaller than written corpora Query speed is (not that) important

„In informal conversation in Northern Scotland, older female speakers tend to use ‚aye‘ as a backchannel signal with a rising intonation“ • Situational context  Interaction metadata • Speaker metadata • Text data / Surface form  Transcript text • Interactional context  Temporal transcript structure • Prosodic properties  Recording Requirement #1: Access to all types of context Requirement #2: (Manual) postprocessing of query results

„After a cut-off wordfollowedby a pause ofmorethan 0.3 seconds, thecut-off wordisfrequentlyrepeated“ • specialwordtokens (incompletewords, semi-lexical material, …) • non-wordtokens (pauses, non-verbal articulations, …) • temporal measurements (pause length) Requirement #3: Queriesfor „special“ tokens Requirement #4: Querieswithspecialproperties (numericalvalues, repetition)

„Filledpausesarelessfrequent in overlappingspeechthanatthebeginningofturns“ • „Modal particlesand modal adverbsoftenoccurnearoneanother in an utterance“ vs. „Filledpausesoccurmorefrequentlynearanotherspeaker‘sbackchannel“ Requirement #5: Queriesforposition in temporal structure Requirement#6: Multiple distancemeasures, queryscopes […]

Requirements Access to all typesofcontext Manual post-processingofqueryresults Queriesforspecialtokens Querieswithspecialproperties Queriesforposition in temporal structure Multiple distancemeasures, queryscopes …

Postprocessing Query Transcripts Query result Corpus Recordings Metadata Context

EXAKT • Regular expression on „full text“ of <u> • (XPath on <u> with markup) • (XSL on transcripts) • DGD2 • Oracle full text on documents • SQL on <w> with attributes

Demo 1: EXAKT with HaMaTaC corpus • HaMaTaC: Hamburg Map Task Corpus • advanced L2 learners of German • solving a map task • Orthographic transcription with lemma, POS, disfluency annotation

Demo 2: DGD2 with FOLK Corpus FOLK: Research & Teaching Corpus of Spoken German

Future directions: Support a „real“ query language: CQL CQPWeb as a test case User survey DGD2 (approaching 2000 users!) … … TEI as common ground for different spoken language corpora query platforms? for querying spoken and written data side-by-side?

Querying Spoken Language Corpora

Querying Spoken Language Corpora

Presentation Transcript

Spoken Language

Corpora and Language Teaching

Spoken Language Structure

Spoken Language Processing

Spoken Language

spoken language

Spoken Language difficulties:

Spoken Language Corpora 2012-04-10 course overview

Corpora in language education

Spoken multimedia corpora for pedagogical purposes

Spoken Language Corpora for the Official African Languages of South Africa

SPOKEN LANGUAGE COMPREHENSION

Spoken Language Understanding

Spoken Language

Studying spoken language

goh kawai 2013- 04-09 tue1 week1 spoken language corpora s316

Wold's Most Spoken language

Spoken Language Understanding

Spoken Language Processing

Corpora in language education

Spoken Language Translation

SPOKEN LANGUAGE ANALYSIS