290 likes | 442 Views
KONVENS Wien, 15 Sep 2004 EXMARaLDA – A modeling and visualization framework for the computer-assisted transcription of spoken language. Thomas Schmidt SFB 538 ‚Mehrsprachigkeit‘ University of Hamburg. Background. Multilingual Database , SFB 538 „Mehrsprachigkeit“, University of Hamburg
E N D
KONVENS Wien, 15 Sep 2004EXMARaLDA – A modeling and visualization framework for the computer-assisted transcription of spoken language Thomas Schmidt SFB 538 ‚Mehrsprachigkeit‘ University of Hamburg
Background • Multilingual Database, SFB 538 „Mehrsprachigkeit“, University of Hamburg • EXMARaLDA (Extensible Markup Language for Discourse Annotation) • Dissertation project „Computer-based transcription of spoken language as a modelling and visualisiation process“ (Supervisor: Angelika Storrer)
Background • Transcription of spoken language • Interviewer / child interaction • Classroom interaction • Interpreted doctor-patient discourse • for discourse / conversation analysis • for (child) language acquisition studies
Background • Problem: Diversity of Transcription Data • Theoretical diversity: • Entities of transcription (utterances, turns, non-verbal activities etc.) • Relations between entities (temporal, hierarchical, features, ...) • Presentation formats (partitur notation, column notation, ...) • Technological diversity: • Storage formats (text, binary, RDB) • Software (syncWriter, HIAT-DOS, DBM-Systems, word processors, ...) • Operating Systems (Windows, MAC OS)
Background • Problem: Diversity of Transcription Data • Aim: A common platform for computer-assisted transcription • Exchange, reuse, archive transcription data Merge corpora Use different software tools with one piece of data
Background • Problem: Diversity of Transcription Data • Aim: A common platform for computer-assisted transcription • (Elements of a) Solution • XML technology • Three level architecture • Separate form from content • Separate logical from physical structure
Topics of this talk 2. Components of the developed system • Some methodological considerations: Linguistic methods Computer science methods „Computing in the humanities“ Interdisciplinary communication
Modified view Computer Transcription as... Visualisation Visualisation Visualisation Modelling Form Analogue model Symbolic model Model theory view Application vs. Logical layer Form E/R model Form View Database view Form Form Content Document... Form Text technology view Methodological considerations Transcript Transcription as... „Verschriftlichung“ Theory Established view Readability Adequacy Quality criteria
Methodological considerations Transcription as Modeling and Visualization of spoken language • Accordance with text-technological concepts • One model, different visualizations • No tradeoff between readability and adequacy • No tradeoff between human and computer processability • No “Standardization” of models • a common modelling framework, not a common model • no ontological specifications • XML = Standardization of physical representation
Visualization to Model • Structural relations: • Temporal sequence
Visualization to Model • Structural relations: • Temporal sequence • Simultaneity
Visualization to Model • Structural relations: • Temporal sequence • Simultaneity • Equivalence (Entity Feature)
Visualization to Model • Structural relations: • Temporal sequence • Simultaneity • Equivalence (Entity Feature) • Hierarchy (Containment)
Modeling framework • Relational? Sequence? Simultaneity? • OHCO? Simultaneity? • DAG: Annotation Graphs? Complexity? • Transcription Graphs
Application: Input tools EXMARaLDA Partitur-Editor
Application: Input tools Simple EXMARaLDA Text file
Application: Input tools TASX annotator
Application: Input tools PRAAT
Application: Input tools EUDICO Linguistic Annotator (ELAN)
Application: Visualization ... as a wrapped partitur ... as a line transcript ... in column notation
Application: Corpus management EXMARaLDA Corpus Manager (COMA)
Application: Query/Analysis Search and Query Instrument for EXMARaLDA (SQUIRREL)
Project status • Software past beta stage • Five projects at our own institution use EXMARaLDA for their corpus work • Around 800 users in research and teaching outside SFB • Used at the IDS in Mannheim • Submitted a suggestion for integration of data model into P5 of the TEI guidelines
Summary • Transcription as theory and „Verschriftlichung“ Computer-assisted transcription as modelling and visualisation • Interdisciplinary bridge / Methodology of computational techniques in „classical“ linguistics • Concrete practical improvements for work with transcription data EXMARaLDA and Database „Multilingalism“ • Data model, formats and tools building on the separation of model and visualisation