1 / 9

DynaSAND Technology: Storing and Generating Transcripts and Metadata for Language Resources

Learn about DynaSAND technology, which stores transcripts in a relational database while preserving context. This technology allows for easy generation of different formats, such as TEI.XML and IMDI metadata. Explore search engine development and Meertens Institute's language resources.

Download Presentation

DynaSAND Technology: Storing and Generating Transcripts and Metadata for Language Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. interview speaker sentence word interview_id speaker_id sentence_id word_id interview_id speaker_id sentence_id start time end time locality DynaSAND: technology • Transcripts are stored in a relational database • Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a structure basically like this:

  2. category category_id word_id word attributes word_id attribute_id value_id word_id DynaSAND: technology • This means that individual words can be addressed, e.g. for POS tagging • The POS tags are themselves stored as separate categories, attributes and values, not as opaque strings:

  3. Generating other formats • The fact that the data is stored in its smallest constituent parts makes it relatively easy to generate other formats • Example: we realize that a binary format like a relational database is not appropriate for long-term archival, so we made the SAND transcriptions available as TEI XML by creating a template and filling that with data from the database with a script • Another example: the IMDI metadata for another corpus (The Goeman-Taeldeman-Van Reenen Project, or GTRP corpus) were created in the same way

  4. Generating metadata for CLARIN • Previous experience with SAND and GTRP indicates that generating XML metadata for CLARIN from our databases should be doable • The TEI and IMDI for SAND and GTRP were created once and are static; we plan to make the process more dynamic for CLARIN metadata by creating the XML on the fly (and implementing a caching mechanism for performance reasons) so that the metadata is always up to date

  5. Edisyn (European Dialect Syntax) • One of the goals of Edisyn is the development of a search engine which uses one tag set to search different corpora, including the SAND, concurrently • Central tag set is being developed by Franca Wesseling; we plan to make it compatible with ISOcat • Search engine translates these tags to the native tag sets of the corpora • Ideal case: corpora are hosted by their own organizations and accessible via a web service • In practice: the Meertens has local copies of the corpora • Participating corpora: SAND, CORDIAL-SIN (Portuguese), ASIS (Italian), EMK (Estonian); more to come

  6. Other Meertens language resources • PLAND (Plant Names in Dutch Dialects) • NVD (Dutch Database of First Names) • NFD (Dutch Database of Family Names) • Corpus of free dialect speech (sound recordings) • Dutch Database of Toponyms (in development) • Dutch Song Database • Dutch Folktale Database

  7. Other Meertens language resources • Apart from part of the sound recordings, all these are web-based and based on the same database technology • We plan to make CLARIN metadata available for these resources in a stepwise manner: first metadata on the corpus level, later also metadata on the record level • The technologies involved (OAI-PMH) are new to us, so we want to do this in close cooperation with a “harvesting” institution to make sure that our stuff is correct

  8. Further in the future • The Meertens Institute wants to be part of CLARIN and in the future we also hope to contribute to the development of tools to work with language resources

  9. Thank you for your attention!

More Related