1 / 73

Archiving

Archiving. David Nathan ELDP Training Workshop March 2010. Archiving: what do you think of?. What is a language archive, then?. What is a digital language archive?. a forum / platform for data providers and data users to negotiate and exchange

lynna
Download Presentation

Archiving

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Archiving David Nathan ELDP Training Workshop March 2010

  2. Archiving: what do you think of?

  3. What is a language archive, then?

  4. What is a digital language archive? • a forum / platform for data providers and data users to negotiate and exchange • a trusted repository created and maintained by an institution with a commitment to the long-term preservation of archived material • has policies and processes for materials acquisition, cataloguing, preservation, dissemination, migration to new digital formats • a collection of managed materials

  5. afd_34 afd_34 afd_34 afd_34 afd_34 dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds OAIS model • OAIS archives define three types of ‘packages’ ingestion, archive, dissemination: Producers Ingestion Archive Dissemination Designated communities

  6. What is archiving of language materials? • preparing materials in a structured, well-documented, and complete form • building long-term relationships • it is not just backup • it is not just dissemination/publication • it does not define good linguistic practice

  7. What can a language archive offer? • Security - keep your electronic materials safe • Preservation - store your materials for the long term • Discovery - help others to find out about your materials, and you to find out about users • Protocols - respect and implement sensitivities, restrictions • Sharing - share results of your work, if appropriate • Acknowledgement - create citable acknowledgement • Mobilisation - create usable language materials for communities • Quality and standards - advice for assuring your materials are of the highest quality and robust standards

  8. Kinds of language archives • many cross-cutting classifications: • Indigenous and local, eg. Squamish Nation, “language centres” • regional, eg. AILLA, Paradisec • international, eg. DoBeS, ELAR • associated with research institute, eg. AIATSIS, ANLC • grant-driven deposits, eg. DoBeS, ELAR • digital vs physical vs mixed, eg. DoBeS vs Vienna Sound Archive, ANLC

  9. Potential users • depositors – deposit, access or update materials • speakers and their descendants (“majority of users of Berkeley Language Center archive are community members”) • other researchers - comparative/historical linguists, typologists, theoreticians, anthropologists, historians, musicologists etc etc • other “stakeholders”, eg educationalists • journalists and the wider public

  10. Archives networks and bodies • foundation concepts and technologies from • library initiatives, eg. D-LIB http://www.dlib.org/ • OAI (Open Archives Initiative) • OAIS Open Archival Information Systems (NASA and space agencies incl JAXA) • Open Language Archives Community (OLAC) • Digital Endangered Languages and Archives Network (DELAMAN) • ELAR, DOBES, ANLC, Paradisec, EMELD, LACITO, AIATSIS, AMPM (Maori)

  11. Archives networks and bodies • DELAMAN’s interests and activities include: • language archiving training coordination and syllabus • citation of deposits (for academic recognition of deposited corpora) • archive federations (for seamless access to resources across )

  12. Citation examples • Courtesy Heidi Johnson of AILLA Collection: Sherzer, Joel. "Kuna Collection." The Archive of the Indigenous Languages of Latin America: www.ailla.utexas.org. Media: audio, text, image. Access: 0% restricted. File/resource: Sherzer, Joel (Researcher). (1970). "Report of a curing specialist." Kuna Collection. Archive of the Indigenous Languages of Latin America: www.ailla.utexas.org. Type: transcription&translation. Media: text. Access: public. Resource ID: CUK001R001.

  13. Why is language archiving different? • what is a language? • the data is not conventionalised (like $, age, year of publication etc) – what and how to code? • varying and competing expectations

  14. And endangered languages archiving? • extremely diverse context – languages, cultures, communities, individuals, projects • typical source is fieldworkers • no established genres • difficult for archive staff to manage • sensitivities and restrictions • extremely high priority

  15. Endangered Languages ARchive (ELAR) • one of 3 semi-autonomous programs of the Hans Rausing Endangered Languages Project • staff of 3; archivist, software developer, technician, (research assistants etc) • develop policies, preservation infrastructure, cataloguing and dissemination, facilities, training, advice, materials development and publishing

  16. ELAR’s holdings • ELAR currently holds about 50 deposits with a total volume of approx 4 TB. • the average deposit is about 80 GB • sizes vary widely, with a small number of huge deposits. The median size is around 15GB • we expect volume to nearly double over the next 18 months • see next slides for distribution of data types

  17. ELAR holdings by data type • data types for a 25% sample of holdings (early 2008) • data type by volume (MB) and number of files, sorted by volume

  18. The way we were ... ASEDA • Aboriginal Studies Electronic Data Archive, AIATSIS Canberra, founded early 1990s (modelled on Oxford Text Archive) • receive and catalogue electronic materials that were at risk or not accessible • lexica • grammars • texts

  19. How things have changed .. • types of data (modalities and genres) • now predominantly media / documentation • storage methods • now “professional”, mass data systems • standardisation and metadata • now various standards for data and metadata • dissemination • now web-based dissemination • expanded influence into practice and workflow of linguists

  20. Why digital? • preservation: digitisation is the only way that media (audio and video) can be preserved for the future • because it can be copied and transmitted with zero loss • cataloguing, sharing, dissemination all facilitated

  21. Digital disadvantages • digital data is fragile and ephemeral • cost (human, equipment, maintenance) • requires strategy and luck to get infrastructure right • preservation depends on file and data formats • depend on tools and software • depends on formats (prefer standard, open, explicit, long-lasting) • materials may have to be converted and migrated • some formats require particular software (can we archive the software?)

  22. These issues impact on archive policy • how to balance cost of andling and preservation with value of materials? • how to provide long-term preservation when our funding is time-limited?

  23. The archiving process (depositors’ view)

  24. Documenter and archive interactions • grant formulation and application • communications, questions, advice • training • archiving services (transfer, conversion etc) • ongoing management of materials

  25. Documenter & archive interactions

  26. Query/interaction topics • analysis of approx 150 queries from documenters/linguists

  27. ELAR Feedback template ELAR Data Sample Evaluation Prepared for: By: Date: TEXT - xx file Document type Document format/layout/data structures Character/language representation Linking/references Consistency

  28. ELAR Feedback template AUDIO Document type/format Resolution Quality Editing Length Annotation/transcription Consistency

  29. ELAR Feedback template VIDEO Document type/format Resolution Quality Editing Length Annotation/transcription Consistency

  30. ELAR Feedback template GENERAL File naming Data volume Delivery Consistency

  31. Example detail (section: Document format) Use of typography (size, underlining, bold, spaces etc) to make headings and other structures is weak - at least Styles should be used (with complete consistency). Tables to represent interlinear data is reasonably appropriate, although would need to be converted later. Is it clear from this document, or somewhere else, where to look up codes etc, such as the speaker initials? While the language is consistently labelled in the interlinear section, it is identified only by the alternation in font in the first section.

  32. Example detail (section: Audio quality) AD-MD03a 4Noe Song thami miya.wav - quality good. AD-MD04b 33Boa Sr. LongNarrativeOnTsunami.wav - quality reasonable, but background hiss is too loud in proportion to the signal. Was this was part of your original recording (on what equipment?) or was introduced by digitisation, in which case it would be a good idea to try de-digitising. AD-MD05b 34Peje Phonetic Variation.wav - quality quite good. Stereo separation of voices is nice. CIILQ Seasons Contd 699-703.wav - suffers a number of faults, including severe clipping (overmodulation), background noise, microphone physical handling, and poor acoustic representation (probably due to poor microphone and/or recorder?).

  33. Audio evaluation using Dobbin • software from Cube-Tec who make Quadriga • audio evaluation, conversion and reporting

  34. Dobbin

  35. Dobbin

  36. Dobbin

  37. Dobbin

  38. Dobbin

  39. Dobbin

  40. What can you archive (at ELAR)? • media - sound, video • graphics - images, scans • text - fieldnotes, grammars, description, analysis • structured data - aligned and annotated transcriptions, databases, lexica • metadata - structured, standardised contextual information about the materials

  41. Archive objects • an “object” could be a file, a set of files, a directory, a “session” or a set of files with relationships between them • these are often called “bundles” • like all structures, these should be made explicit • eg through metadata • our new catalogue system will provide a facility to create and label bundles

  42. Data “portability” (Bird & Simons 2003) • data should also be “portable” (Bird & Simons “Seven Dimensions ...”) • complete • explicit • documented • preservable • transferable • accessible • adaptable • not technology-specific • (also appropriate, accurate, useful etc!!)

  43. Archive material should be selected • example: Depositor’s question: How much video can I archive? • answer: ... • however, • unlikely that linguist is in position to plan and consistently create excellent video, so selection is unavoidable • data has always been edited and selected!

  44. (... selection) • in your linguistic work you also: • selected • labeled • transformed/processed/edited • added, corrected, expanded • made links • made or assumed relationships between “whole” and processed units; invented labels, IDs, scope etc • imposed formats

More Related