250 likes | 272 Views
The E-MELD Project. Helen Aristar Dry & Anthony Aristar The LINGUIST List Eastern Michigan U & Wayne State U. E-MELD. Electronic Metastructure for Endangered Languages Documentation. 5 year NSF project, 2001-6 Linguist List, ELF, LDC Goal: To aid in
E N D
The E-MELD Project Helen Aristar Dry & Anthony Aristar The LINGUIST List Eastern Michigan U & Wayne State U University of Texas at Austin
E-MELD Electronic Metastructure for Endangered Languages Documentation • 5 year NSF project, 2001-6 • Linguist List, ELF, LDC • Goal: To aid in • …the preservation of endangered languages data • …the development of infrastructure for electronic archives University of Texas at Austin
Summary of the problem (2001): EL resources were/are • Difficult to find • Difficult to use • Difficult to preserve Needed: • More uniformity in naming, cataloguing, annotating, i.e., interoperable standards • More knowledge of how to create digital resources that last University of Texas at Austin
Problems with EL resources • Difficult to find • At distributed sites • Language names ambiguous • No central catalog of resources or cataloging information (metadata) • Lack of interoperability among archives • Difficult to display accurately • Idiosyncratic character encoding • Specific fonts needed University of Texas at Austin
Problems with EL resources, 2 • Difficult to compare • Non-standard terminology • Idiosyncratic markup & annotation schemes • Difficult to manipulate or reuse • Specific software needed (incl. specific software version), e.g. MSWord 1.0 • Meaning represented via formatting, which was not documented • bold represents “headword” University of Texas at Austin
Problems with EL resources, 3 Impermanent—vulnerable to: • Deterioration of the physical media • Hardware obsolescence • Software obsolescence University of Texas at Austin
PHONOGRAMMARCHIV - AUSTRIAN ACADEMY OF SCIENCEslide from Dietrich Schüller, Director University of Texas at Austin
Toward a Solution: E-MELD Components • Involve linguistics community in developing standards • Promote consensus about: • Language Identification • Metadata • Annotation and markup • Teach and facilitate implementation of “best practices” in the creation of digital language documentation University of Texas at Austin
Promoting consensus : annual workshops • 2001, Santa Barbara, CA: The Need for Standards • E-MELD 2002, Ann Arbor, MI: Digitizing Lexical Information • E-MELD 2003, Lansing, MI: Digitizing Texts • E-MELD 2004, Detroit, MI: Databases and Best Practice • E-MELD 2005, Cambridge, MA: Linguistic Ontologies & Terminology University of Texas at Austin
2006 E-MELD Workshop on Digital Language Documentation • Michigan State University • June 20-22, 2006 • In conjunction with the 2006 Summer Meeting of the Linguistic Society of America • Topic: Electronic Archiving and Digital Tools: Current State & Future Directions Please come! University of Texas at Austin
Finding resources: metadata • OLAC metadata standards (subcommunity of OAI) • OLAC search engine on LL site: • http://linguistlist.org/olac • OLAC metadata editor on LL site: • http://linguistlist.org/olac/ore • XSL Stylesheets for transformation / presentation of OLAC metadata • Ethnologue/LL language codes proposed as ISO standard University of Texas at Austin
Using resources: comparing and finding annotation • Ontologies developed (as interlanguage between markups and as search aids) • GOLD: General Ontology for Linguistic Description (morphosyntax) • OPF: Ontology of Phonetic Features (based on Ladefoged & Madison) • ODIN Project: mining interlinear glossed text on the web (Will Lewis et al) University of Texas at Austin
Using resources: Tools • Tools to encourage use of the ontology: • OntoElan: text annotation (modification of MPI’s Elan) • OntoGloss: stand-off annotation tool • FIELD: lexical input • Tool to encourage use of Unicode • CharWrite: input of Unicode characters • Facility to encourage use of OLAC metadata • Stylesheet library • ORE University of Texas at Austin
Facilitating ‘Best Practices’ in resource creation • Creation of reference website • School of Best Practices in Digital Language Documentation • http://emeld.org/school/ • Addressed to the individual linguist who creates language documentation University of Texas at Austin
What should the linguist do? • To ensure that digital data endure long into the future: • Create an archival copy: Put the materials into an enduring file format. • Deposit the materials with an archive that will make a practice of periodically migrating them to new storage media as needed. University of Texas at Austin
Organization of the School • Entrance Hall: orientation • Classroom: lessons & tutorials • Reading Room: bibliography • Work Room: online work • Tool Room: links to tools • Help (incl. Ask an Expert) • Case Studies: documentation of 10 ELs digitized according to best practices University of Texas at Austin
Currently School has: Documentation from 12 ELs: University of Texas at Austin
Current Initiatives • Identify and record metadata for legacy documentation • Improve the ontology (GOLD) – incorporate suggestions from 2005 E-MELD workshop • Finish prototyped software University of Texas at Austin
Future: finish prototyped software • OntoElan: ontology-aware modification of MPI’s Elan annotation tool • OntoGloss: ontology-aware stand-off annotation tool • CharWrite: downloadable tool for web-input of Unicode characters • FIELD: Field Input Environment for Linguistic Data • All but OntoGloss available through the School of Best Practices website University of Texas at Austin
Current Initiatives: School of BP • Make the School even more practical • Distinguish between good, better, best practice • Emphasize • explicit ‘how-to’ pages • Different paths for different user types • Advice from experts, e.g. “equipment on a budget” page, Ask-An-Expert University of Texas at Austin
Practices in resource creation • Good practice: ensure preservation • Better practice: ensure longterm intelligibility • “We don’t want to create another Rosetta Stone” - Whalen, 2003 • Best practice: promote interoperability University of Texas at Austin
School of Best Practices in Digital Language Documentation http://emeld.org/school/ University of Texas at Austin
Future Directions • MultiTree • LL-MAP University of Texas at Austin
What is MultiTree? • 3-year grant • Database of all hypothesized language relations • Ultimately linked to GIS database • Interface to allow linguists to input updates • Panel of experts to assess input University of Texas at Austin
LL-MAP • Collect geographically linked linguistic data • Build this into a GIS system, allowing layers of information to be built into a single map Then… • Build tools for querying, annotating and discussing this data • Build tools which allow new language data from linguists and anthropologists to be incorporated into this system University of Texas at Austin