250 likes | 387 Views
The Rosetta Project Digital Language Archive. Laura Buszard-Welcher The Long Now Foundation / University of California, Berkeley. The Rosetta Project Archive. A public, Web-based, digital archive of language documentation
E N D
The Rosetta Project Digital Language Archive Laura Buszard-Welcher The Long Now Foundation / University of California, Berkeley
The Rosetta Project Archive • A public, Web-based, digital archive of language documentation • Part of the National Science Digital Library (NSF program for dissemination of educational STEM resources) • Over 95,000 pages of resources on over 2,300 languages • Over 3000 wordlists (Swadesh lists, 500-1500 term lists) • New! Audio files
Project Goals: Resources • We are a digital language archive with comprehensive, global scope: we can and do accept digital resources onanylanguage, dialect, family, or subgroup. • Promotes linguistic diversity by broadly disseminating resources on languages with small numbers of speakers--contributes to the effort to document and disseminate resources on endangered languages. • Comprehensive scope both requires and builds communities: global networks of linguists, speakers, educators
Project Goals: Interoperability and Resource Discovery • Supporting metadata standardization and interoperability (OLAC participating archive and individuals, E-MELD, GOLD, LSA Conversation on Endangered Language Archiving) • Promoting resource discovery through open archive search: we serve oai_dc, nsdl_dc, olac_dc metadata
Project Goals: Developing tools for collaborative linguistic research • Endangered Language Query Room • DOCS (Digital Online Curation Services) • LangGator • Wordlist tool (collaboration with MPI-EVA) • New Rosetta V2.0 Website
Site Infrastructure • Plone 2.1 content management system, running in the Zope Application Server • Open source, leverages worldwide developer communities • Lots of “plug in” modules for functionality expansion • CMF Bibliography AT, Plone Board, etc. • Heavily modified infrastructure (language node design) and user interface
Nodal Architecture • Languages, language families, family subgroups, dialects all represented by nodes. • A node is a content aggregation page • Nodes and parent-child relationships each have unique IDs • The system currently represents Ethnologue language relationships, but has the flexibility to be agnostic about them, represent relationships from various theoretical perspectives
Node Pages • Accessible from a variety of browse and search pages • Browse by language name, family, country data type • Quick search, advanced search • Node page organization • Node metadata • Descriptive Resources • Navigation: classification tree • Links to people functions, LINGUIST List people search • External links: searches
Content • In-house collection, vetting • Primary focus of collection • Rosetta descriptive categories • Special collections • Endangered Language Fund (ELF) Digital Archives • Alan Lomax Audio Collection • Future collections that come in through DOCS • Future development • Uploaded, peer-reviewed resources • Collaborative content areas (bulletin boards, wiki)
Scanning • Historically, the primary focus of in-house collection • Rosetta serves over 95,000 images from a variety of published resources • Excerpts in data categories (see following slides) • Public domain resources can be scanned in their entirety
Resource Pages • Accessed from node pages • Bibliographic metadata • Links to other resources • Resource bundles • Associated resource files • Scanned images • OCR’ed live text files • Annotated text files • Audio/video files • User comments
Community Functions • Goal: build a network of linguists, speakers, educators • People: • Member pages • Regional and language curators • Collaborative content: • Discussions (nodes, resources) • Resource upload • Vetting by volunteer language/family experts • In the future? Wiki documents (unvetted, but resources produced may go through higher vetting levels)
Member Gallery • Central access to member search and browse • Central access to language forums • Highlighted members
MemberProfile Page • User-defined content area • List of recent uploads • Lists of recent forum postings
Audio Digitization • Alan Lomax language audio collection (mostly reel-to-reel, some cassette) • Edirol external digitizer (96 kHz sample rate, 24 bit depth) • Sound Forge 7.0, uncompressed .wav • Now accepting audio deposits (on a limited basis) • We archive and serve digital resources, not physical media
Rosetta Depositor Consent Form • Prompted by special collections (ELF, Alan Lomax Audio) • Intended to work on paper, or in digital form • Inspired by AILLA’s graded access system • Encourages depositors to see archiving as a kind of publication: assumes dissemination of some or all of resources • “In general, we encourage all depositors to make their resources freely available, and to consider archiving with us as a form of publication. If you feel the need to place an extreme form of restriction on the resource, then our project may not be the most suitable place to archive your resource. We reserve the right to archive only those resources that we deem appropriate to our project, with respect to both content and access.”
Level 1: Open access to recordings Users have full access to recordings after agreeing to our Terms and Conditions. For this level, we assume that depositors have already gained permission for public access from the speakers or authors of the resource. Level 1 access may be applied to the entire deposit, or to parts of the deposit. If portions of the deposit are to be restricted, attach a detailed description that clearly identifies them, and designate one of the following access levels (2-5) for each restricted portion.
Level 2: Access limited by password Users may access recordings only if they know a password that you create. This type of access allows you to keep resources private, or provide access to others by sharing the password with them. Access limited by passwords must be renegotiated with The Rosetta Project every five years, at which time depositors may continue use of a previous password, choose a new password, or select another access level (Rosetta will contact the depositor at the appropriate time). If not renegotiated, access to the resource changes to open access (Level 1).
Level 3: Access protected by a time limit Users may not access the resources until after a specified date. Although we encourage all depositors to make their resources freely available, we understand that some depositors may want to restrict access to resources for a few years (normally five or less) while preparing a publication, such as a dissertation. After the date you specify, access to the resource changes to open access (Level 1).
Levels 4 and 5: Designated Controllers Level 4. The depositor controls access to the resource. The Rosetta Project will provide contact information, and the user will have to contact the depositor directly for permission, and the depositor then will write to The Rosetta Project. If permission is granted, The Rosetta Project will give the user access to the resource. Level 5. The depositor designates another person or organization to control the resource. The Rosetta Project will contact the controller on the user’s behalf. If permission is granted, The Rosetta Project will give the user access to the resource (please attach controller’s contact information).
Depositor/Controller Responsibilies Note: for Levels 2, 3, 4, and 5, the depositor must ensure that the appropriate contact information is up to date. If contact information is not up to date, or documented good faith attempts made by the Rosetta archive or its users to obtain access are not answered, then determinations of permission to access and use the resource reverts to the curator of the archive.
The Archivist in the Driver’s Seat • Archiving and serving digital resources is a valuable, (and expensive) service • Some archives also provide digitization services • For these reasons, archives can be expected toset conditions on what they will archive • Rosetta’s consent forms are intended to ensure that: • The majority of our resources are publicly accessible on the Web (all are available for listening in person) • Archivist is never at the mercy of extreme access restrictions • All access conditions work toward open access (Level 1)
URLs • Electronic Metastructure for Endangered Language Data (E-MELD) http://www.emeld.org (School of Best Practice, FIELD Tool). • Endangered Language Query Rooms http://rosettaproject.org:8080/emeldbase/. • The Ethnologue http://www.ethnologue.com. • General Ontology for Linguistic Description (GOLD) http://www.linguistics-ontology.org or http://emeld.org/school/workroom/terminology/ • LINGUIST List http://www.linguistlist.org • National Science Digital Library (NSDL) http://nsdl.org • ODIN www.csufresno.edu/odin • Open Language Archives Community (OLAC) http://www.language-archives.org. • The Rosetta Project, http://www.rosettaproject.org/live. A preview of the new Web site is available at http://preview.rosettaproject.org.