E-MELD Project: Improving Endangered Language Documentation

The E-MELD Project Helen Aristar Dry & Anthony Aristar The LINGUIST List Eastern Michigan U & Wayne State U University of Texas at Austin

E-MELD Electronic Metastructure for Endangered Languages Documentation • 5 year NSF project, 2001-6 • Linguist List, ELF, LDC • Goal: To aid in • …the preservation of endangered languages data • …the development of infrastructure for electronic archives University of Texas at Austin

Summary of the problem (2001): EL resources were/are • Difficult to find • Difficult to use • Difficult to preserve Needed: • More uniformity in naming, cataloguing, annotating, i.e., interoperable standards • More knowledge of how to create digital resources that last University of Texas at Austin

Problems with EL resources • Difficult to find • At distributed sites • Language names ambiguous • No central catalog of resources or cataloging information (metadata) • Lack of interoperability among archives • Difficult to display accurately • Idiosyncratic character encoding • Specific fonts needed University of Texas at Austin

Problems with EL resources, 2 • Difficult to compare • Non-standard terminology • Idiosyncratic markup & annotation schemes • Difficult to manipulate or reuse • Specific software needed (incl. specific software version), e.g. MSWord 1.0 • Meaning represented via formatting, which was not documented • bold represents “headword” University of Texas at Austin

Problems with EL resources, 3 Impermanent—vulnerable to: • Deterioration of the physical media • Hardware obsolescence • Software obsolescence University of Texas at Austin

PHONOGRAMMARCHIV - AUSTRIAN ACADEMY OF SCIENCEslide from Dietrich Schüller, Director University of Texas at Austin

Toward a Solution: E-MELD Components • Involve linguistics community in developing standards • Promote consensus about: • Language Identification • Metadata • Annotation and markup • Teach and facilitate implementation of “best practices” in the creation of digital language documentation University of Texas at Austin

Promoting consensus : annual workshops • 2001, Santa Barbara, CA: The Need for Standards • E-MELD 2002, Ann Arbor, MI: Digitizing Lexical Information • E-MELD 2003, Lansing, MI: Digitizing Texts • E-MELD 2004, Detroit, MI: Databases and Best Practice • E-MELD 2005, Cambridge, MA: Linguistic Ontologies & Terminology University of Texas at Austin

2006 E-MELD Workshop on Digital Language Documentation • Michigan State University • June 20-22, 2006 • In conjunction with the 2006 Summer Meeting of the Linguistic Society of America • Topic: Electronic Archiving and Digital Tools: Current State & Future Directions Please come! University of Texas at Austin

Finding resources: metadata • OLAC metadata standards (subcommunity of OAI) • OLAC search engine on LL site: • http://linguistlist.org/olac • OLAC metadata editor on LL site: • http://linguistlist.org/olac/ore • XSL Stylesheets for transformation / presentation of OLAC metadata • Ethnologue/LL language codes proposed as ISO standard University of Texas at Austin

Using resources: comparing and finding annotation • Ontologies developed (as interlanguage between markups and as search aids) • GOLD: General Ontology for Linguistic Description (morphosyntax) • OPF: Ontology of Phonetic Features (based on Ladefoged & Madison) • ODIN Project: mining interlinear glossed text on the web (Will Lewis et al) University of Texas at Austin

Using resources: Tools • Tools to encourage use of the ontology: • OntoElan: text annotation (modification of MPI’s Elan) • OntoGloss: stand-off annotation tool • FIELD: lexical input • Tool to encourage use of Unicode • CharWrite: input of Unicode characters • Facility to encourage use of OLAC metadata • Stylesheet library • ORE University of Texas at Austin

Facilitating ‘Best Practices’ in resource creation • Creation of reference website • School of Best Practices in Digital Language Documentation • http://emeld.org/school/ • Addressed to the individual linguist who creates language documentation University of Texas at Austin

What should the linguist do? • To ensure that digital data endure long into the future: • Create an archival copy: Put the materials into an enduring file format. • Deposit the materials with an archive that will make a practice of periodically migrating them to new storage media as needed. University of Texas at Austin

Organization of the School • Entrance Hall: orientation • Classroom: lessons & tutorials • Reading Room: bibliography • Work Room: online work • Tool Room: links to tools • Help (incl. Ask an Expert) • Case Studies: documentation of 10 ELs digitized according to best practices University of Texas at Austin

Currently School has: Documentation from 12 ELs: University of Texas at Austin

Current Initiatives • Identify and record metadata for legacy documentation • Improve the ontology (GOLD) – incorporate suggestions from 2005 E-MELD workshop • Finish prototyped software University of Texas at Austin

Future: finish prototyped software • OntoElan: ontology-aware modification of MPI’s Elan annotation tool • OntoGloss: ontology-aware stand-off annotation tool • CharWrite: downloadable tool for web-input of Unicode characters • FIELD: Field Input Environment for Linguistic Data • All but OntoGloss available through the School of Best Practices website University of Texas at Austin

Current Initiatives: School of BP • Make the School even more practical • Distinguish between good, better, best practice • Emphasize • explicit ‘how-to’ pages • Different paths for different user types • Advice from experts, e.g. “equipment on a budget” page, Ask-An-Expert University of Texas at Austin

Practices in resource creation • Good practice: ensure preservation • Better practice: ensure longterm intelligibility • “We don’t want to create another Rosetta Stone” - Whalen, 2003 • Best practice: promote interoperability University of Texas at Austin

School of Best Practices in Digital Language Documentation http://emeld.org/school/ University of Texas at Austin

Future Directions • MultiTree • LL-MAP University of Texas at Austin

What is MultiTree? • 3-year grant • Database of all hypothesized language relations • Ultimately linked to GIS database • Interface to allow linguists to input updates • Panel of experts to assess input University of Texas at Austin

LL-MAP • Collect geographically linked linguistic data • Build this into a GIS system, allowing layers of information to be built into a single map Then… • Build tools for querying, annotating and discussing this data • Build tools which allow new language data from linguists and anthropologists to be incorporated into this system University of Texas at Austin

E-MELD Project: Improving Endangered Language Documentation

E-MELD Project: Improving Endangered Language Documentation

Presentation Transcript

The Gutenberg-e Project

The Gutenberg-e Project

The WTO’s e-GPA Project

The Natural History of MELD

The Philippine e-Library Project and e-Learning

MELD

The e-Bug Project

The Montclair Electronic Language Learner Database (MELD)

The e-Bug Project

A Vulcan Mind Meld

The SFR E-Learning Project

MELD for IWMP in Maharashtra

The E-MELD Project:

The E-MELD Project

Enhancement of MELD

Proposal to Add Serum Sodium to the MELD Score

project e-scape

The Red Black Mind Meld

The e-Bug Project

Adult MELD Exception Guidance