190 likes | 382 Views
Customizing the IMDI metadata schema for endangered languages. Heidi Johnson (AILLA) Arienne Dwyer (DOBES). Introduction. IMDI: International Standards for Language Engineering Metadata Initiative DOBES: Volkswagen Foundation’s Documentation of Endangered Languages initiative
E N D
Customizing the IMDI metadata schema for endangered languages Heidi Johnson (AILLA) Arienne Dwyer (DOBES)
Introduction • IMDI: International Standards for Language Engineering Metadata Initiative • DOBES: Volkswagen Foundation’s Documentation of Endangered Languages initiative • AILLA: the Archive of the Indigenous Languages of Latin America
Types of resources • Audio and video recordings in various digital formats • Annotation text files, e.g. transcriptions and translations • Standalone texts, e.g. dictionaries, poetry • Wide range of genres: from verbal art to scholarly analyses
Bundles of resources • Session (IMDI, 2001): resources resulting from a linguistic elicitation session - recordings and annotations. • Only models one kind of resource production - a recording session. • Collections will include a greater variety of resources, in sets of related materials.
Types of bundles • Canonical bundle: the original session. A digitized recording, in different formats, and some textual annotation files, also in different formats. • Minimal bundle: a single file. Examples: dictionary, poem, recording of uninterpretable chants. • Meta-bundle: a bundle containing other bundles. Example: a book about a set of annotated recordings.
Bundle elements • Current: • Name of bundle • Date and place of production • Proposed: • Resource relations • Date archived • Last modified
Project Collector Content Participants Resources References Major subschemas
The Content Subschema • Genre is the top-level category: • Interaction: conversation, interview … • Explanation: description, recipe … • Performance: narrative, poem, oratory … • Teaching: primer, textbook … • Analysis: grammar, dictionary …
Other Content categories • Modality: speech, writing, gesture • Communication context: • Interactivity • Planning • Involvement • Languages • Task • Description • Keys
AILLA’s Content Keys • Register: a characterization of how the discourse reflects the social context. Example: honorific speech • Style: about poetic and stylistic effects. Examples: parallelism, metered verse.
The Project subschema • Current elements: • Name: a nickname or acronym • Title: official title • ID: a unique identifier • Contact information • Proposed element: • Funder: name of funding organization
The Collector subschema • AILLA renames this Depositor, since this is the individual we have to keep track of (e.g. for Level 3 access permission). When the Depositor is not also the Collector, Collector can be listed under Participants.
The Participants subschema • Type: functional role, e.g. creator • Role: family relationship • Name/Full name • Language(s) • Ethnic group, age, sex: • Education • Anonymous: True if participant’s Full name is reserved; False otherwise
AILLA additions to Participants • Origin: Place (country, region, etc) of origin of the creator of the primary resource in the bundle (e.g. the speaker whose voice is recorded). • Occupation: Can be relevant in assessing accuracy of some kinds of data.
The Resources subschema • Resources contains information about formats and provenance of files in a bundle. • Media Files: audio, video, etc. • Annotation Files: text files. • Proposal: call them all Media Files, to reduce redundancy in the database. (All have URL, size, etc. elements.)
Text resources • Current elements: • Type: type of annotation, e.g. phonetic transcription. • Content encoding: annotation encoding scheme, e.g. EUROTYP. • Character encoding: character set(s) used in a text file.
Text resources 2 • Proposed elements: • Transcription type • Translation (aka Glossing) type • Software: used to produce transcriptions, translations, other annotations (e.g. Shoebox) • Describe Annotator in Participants (along with Translator, etc.)
Proposed subschema • Place: composed of several elements: • Continent • Country • Region • Subregion (address) • Repeated at least twice, in Bundle and in Participants (Origin). • Might also be useful in the Language subschema.
Conclusion • IMDI schema is a flexible tool. • Customization through Key/Value pairs allows local modifications. • Most of the proposed changes are terminological, moving from the DOBES in-house terminology to more general usage.