220 likes | 230 Views
DGD 2.0: A Web-based Navigation Platform for the Visualization, Presentation and Retrieval of German Speech Corpora. Joachim Gasch E-mail: gasch@ids-mannheim.de. Introduction 1.1 The Collection of German Speech Corpora at the IDS
E N D
DGD 2.0: A Web-based Navigation Platform for the Visualization, Presentation and Retrieval of German Speech Corpora Joachim GaschE-mail: gasch@ids-mannheim.de
Introduction 1.1 The Collection of German Speech Corpora at the IDS 1.2 The Standardization Approach for cross-Corpus Information Management 2. The Online Navigation Platform 2.1 The Navigation Interface – Design Principals 2.2 The Visualization and Presentation of Speech Corpus Content 2.2.1 Generic Visualization of the XML Meta-Information of Speech Corpora 2.2.2 Transcript Visualization and Presentation 2.2.3 Media Presentation 3. Retrieval Strategies for unstructured and structured Speech Corpus Data Components 3.1 The Full-Text Search Module 3.2 XQuery Information Retrieval in structured XML Documents 4. Summary and Outlook
1. Introduction 1.1 The Collection of German Speech Corpora at the IDS • The IDS is hosting a wide range of historical and contemporary German speech corpora • Many historical corpora can be (partially) accessed online via the Database for Spoken German (DGD) => Main objectives of the current DGD 2.0 project: • Generic, cross-corpus approach to speech corpus management • Normalized integration of historical and recent speech corpora • Sustainability of speech corpus data components • Object-oriented user interface (based on document structures) for corpus exploration and querying
1.2 The Standardization Approach for cross-Corpus Information Management • The speech corpus system manages meta-information of media source signals • Different corpora: the information structures of data components may vary considerably due to different linguistic research questions, i.e. represented genres, degree of content restriction, physical data structure, research field (natural vs. elicited speech) => Web-based speech corpus navigation platform: • Standardization concept: cross-corpus solution for large speech corpus collections rather than for particular speech corpus projects • Definition of a generic, system-wide data model containing the following components (systematically interlinked):+ structured XML documentation instances on corpus-, event- and speaker level+ unstructured, semi-structured or structured transcripts (time aligned, multi-dimensional)+ media source files + optional: unstructured secondary documents
Interlinked components of the normalized speech corpus data model
2. The Online Navigation Platform 2.1 The Navigation Interface - Design Principals • Object-oriented, document-centric interaction paradigm: based on document structures to be managed by the system • Provision of adaptive views of speech corpus data components => The application menu: • Flat structure of the navigation menu • Fixed position at the top of the screen • Permanent, homogeneous acces to application components • Indication of flat / hierarchically subdivided menu entry points by the symbols ► and ▼
=> Classifying icons • Intuitive user orientation by marking specific types of corpus data components with their correspondent icons: => „bread crumb“ navigation: • Help the user to identify his current position in the navigation tree
2.2 The Visualization and Presentation of Speech Corpus Content 2.2.1 Generic Visualization of the XML Meta-information • Native XML database storage of documentation instances • Use of generic XML rendering module to avoid corpus specific instance visualizations, providing:+ expandable / collapsible document nodes+ node level selection functionality+ direct access to hyperlinks => The cross-corpus (single coprus independent) display method of corpus-, event and speaker documentation offers an ergonomic navigation experience (especially for large data-centric XML instances)
=> Documentation of geocodes: • The geographic coordinates of event locations may be documented in specific speech corpus projects • A geographic map can be displayed on demand: the example shows the geographic map for the event DH--_E_00167 (with geographic latitude 47.423336 and longitude 9.377225 ) which took place in St. Gallen (Switzerland)
Geographic map (based on documented geocodes showing the event location)
2.2.2 Transcript Visualization and Presentation • For larger speech corpus collections, a common concept of „transcript“ becomes fuzzy:+ Annotation of distinct phenomena+ Use of heterogeneous (transcript editor specifc) data formats • Historical speech copora:+ Unstructured transcript data formats (only layout oriented) • Contemporary speech corpora:+ Use of annotation tools available nowadays: structured data formats but no cross-corpus structure homogeneity • Cross-corpus visualization is possible for the transcript-related part of the event documentations via menu point „Transkripte“ (corpus specific transcript access lists)
2.2.3 Media Presentation • Speech corpora may include different types of interdependent media files:+ One event is related to one or more source files:the raw material recorded for an event (originating directly from an audio device) + An event can be composed of several speech events:further segmentation of the source files into speech event specific recordings • All relevant information regarding different media file types is maintained in the meta-documentation of the corresponding event and can be accessed via the list of the menu point “Aufnahmen”
Corpus-specific list of source recordings for the speech corpus DH
3. Retrieval Strategies for unstructured and structured Speech Corpus Data Components • Media file content can only be located via descriptive meta-information:+ meta data (schema valid XML instances)+ transcript data (unstructured, semi-structured, structured) • Transcript data of speech corpus collections is spreading regarding the structuring degree • Retrieval strategies depend on this degree: from simple full-text search to complex layer-aware query processing • Single corpus transcript incompatibilities (worst case scenario):+ Signal segmentation without precise segmentation guidelines (i.e. phones, words, phrases or turns)+ No or not sufficient naming conventions applied for the different transcript layer descriptors (i.e. no unique descriptor used for orthographic transcription layer)+ No exact semantic layer definition available or semantic mix-up of layer content (i.e. mix-up of orthographic and phonetic markup in one single layer)+ No exact syntactic definition of layer content available or syntactic mix-up of layer content (i.e. mix-up of punctuation- or capitalization conventions in the orthographic layer)+ Violation of cross-layer time relations (i.e. caused by interval changes that were made with multi-layer transcript editors without layer inheritance control)
3.1 The Full-Text Search Module • No structured data is required (but can be optionally included) • Advantages: short query response times, easy user interface handling • The full-text search functionality is implemented using Oracle Text • Examples of the provided full-text query features:+ The simple and multiple wildcard characters "_" and "%":_ind matches i.e. "Kind" and "Wind“%wind matches i.e. "Nordwind" or Südwind“+ The operators AND and OR build logical relations between search terms:Nordwind AND Südwind matches only documents with occurrences of both terms+ Tthe NOT operator excludes a specific search term:Nordwind NOT Südwind matches only documents containing "Nordwind" but not containing "Südwind“+ The NEAR operator finds documents depending on the word distance of search terms:NEAR((Schule, Kirche, 4, true) matches documents where both search terms occur with a (maximum) word distance of 4 words.
Full-text search in semi-structured transcript data with search results (KWIC-list)
3.2 XQuery Information Retrieval in structured XML Documents • The full-text search option is not sufficient for the retrieval in fine-grained XML instances (like meta data or time aligned multi-dimensional transcripts) • XQuery allows the implementation of context-sensitive queries for the hierarchical interdependent informational units of XML structured data:+ criteria-specific information selection and filtering+ joining of data from document selections+ sorting, grouping, aggregating, transforming and restructuring of data+ arithmetic calculations on numbers and dates • Powerful queries can be defined but a detailed knowledge about the underlying information structures is necessary => Two different approaches for the implementation of Web-based XQuery retrieval interfaces:+ HTML form with a graphical representation of the XML tree (easy to use but limited flexibility for query definition)+ HTML form providing a text area field to enter the XQuery as plain text (intended for system experts only, also complex queries on data centric instances or cross-structural joins are possible)
HTML form providing a graphical XQuery composition interface
4. Summary and Outlook • Media source files become analyzable via their appropriate meta-information • Contemporary speech corpus systems have to close the gap between the processing of binary media data and related meta-information • The need for standardization of speech corpus components is commonly accepted • But: the identification of all necessary parameters for a cross-corpus standardization still remains an outstanding goal • Future evolving technologies like the MPEG-7 standard might provide appropriate logic to achieve the standardized integration of the different audiovisual information types (potentially involved in media corpora): + Audio + Voice + Video + Images + Graphs + 3D models => Questions? Suggestions?