370 likes | 499 Views
Archiving. LingDy 16 Feb 2012 TUFS, Tokyo David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project SOAS, University of London. What is an archive?. What is a digital language archive?.
E N D
Archiving LingDy 16 Feb 2012 TUFS, Tokyo David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project SOAS, University of London
What is a digital language archive? • a trusted repository created and maintained by an institution with a commitment to the long-term preservation of archived material • has policies and processes for materials acquisition, cataloguing, preservation, dissemination, migration to new digital formats • a platform for building and conducting relationships between data providers and data users
Why is language archiving different? • what is a language? • the data is not conventionalised (like $, age, year of publication etc) – what and how to code? • varying and competing expectations
And endangered languages archiving? • extremely diverse context – languages, cultures, communities, individuals, projects • typical source - fieldworkers • typical materials - documentation • difficult for archive staff to manage • sensitivities and restrictions • extremely high priority
What can a language archive offer? • Security - keep your electronic materials safe • Preservation - store your materials for the long term • Discovery - help others to find out about your materials, and you to find out about users • Protocols - respect and implement sensitivities, restrictions • Sharing - share results of your work, if appropriate • Acknowledgement - create citable acknowledgement • Mobilisation - create usable language materials • Quality and standards - advice for assuring your materials are of the highest quality and robust standards
Different kinds of language archives • different contexts, systems, methods, collection policies • you should consider placing your materials in more than one …
Why digital? • preservation: digitisation is the only way that audio and video (non-symbolic material) can be preserved for the future … because it can be copied and transmitted with zero loss • cataloguing, sharing, dissemination, repurposing
Digital disadvantages • digital data is fragile and ephemeral • cost (human, equipment, maintenance) • requires strategy and luck to get right • preservation depends on file and data formats • depend on tools and software • depends on formats (prefer standard, open, explicit, long-lasting) • materials may have to be converted and migrated • some formats require particular software (can we archive the software?)
What is archiving of language materials? • preparing materials • selecting • structuring • suitable encodings and formats • well-documented • depositing them in a suitable archive(s) • curation and accession by the archive • ongoing management, dissemination • new focus on form, presentation and user interaction/feedback
Users and potential users • depositors – deposit, access or update materials • speakers and their descendants (“majority of users of Berkeley Language Center archive are community members”) • other researchers - comparative/historical linguists, typologists, theoreticians, anthropologists, historians, musicologists etc etc • other “stakeholders”, eg educationalists • journalists and the wider public
Archives networks and bodies • foundation concepts and technologies from • library initiatives, eg. D-LIB http://www.dlib.org/ • OAI (Open Archives Initiative) • OAIS Open Archival Information Systems (NASA and space agencies incl JAXA) • Open Language Archives Community (OLAC) • Digital Endangered Languages and Archives Network (DELAMAN) • ELAR, DOBES, ANLC, Paradisec, EMELD, LACITO, AIATSIS, AMPM (Maori)
Citation examples • from Heidi Johnson of AILLA Collection: Sherzer, Joel. "Kuna Collection." The Archive of the Indigenous Languages of Latin America: www.ailla.utexas.org. Media: audio, text, image. Access: 0% restricted. File/resource: Sherzer, Joel (Researcher). (1970). "Report of a curing specialist." Kuna Collection. Archive of the Indigenous Languages of Latin America: www.ailla.utexas.org. Type: transcription&translation. Media: text. Access: public. Resource ID: CUK001R001.
Endangered Languages ARchive (ELAR) • one of 3 programs of the Hans Rausing Endangered Languages Project • develop policies, preservation infrastructure, cataloguing and dissemination, facilities, training, advice, materials development and publishing
ELAR facts and figures • archived collections: 110 • online (published) collections: 50 • average collection size about 60 GB • online data bundles: 9523 • total number of files held: around 200,000 • total volume of files held: around 10 TB • online data bundles unrestricted access: 5298 • registered users: >500 • annual downloads: >1,000 • annual number of website "hits": 230,000
ELAR facts and figures – user accounts • increasing number of community members, including Aleut (Canada), Tai-Ahom, Wadar (India), Burushaski (Pakistan), Serrano, Cahuilla, Arapaho (USA), Iraqi Jewish (Iraq), Saami (Finland), Wabena (Tanzania), Torwali (Pakistan), Hani, Bai (China), Irish • comments: “I found your site while looking up my grandmother, and i found her on your site speaking our language. and i would love for my children her great grandchildren to hear our language coming from her". • many interdisciplinary researchers, particularly archivists and anthropologists
Archiving and data management • most data-related issues are really part of linguistic data/corpus management • there are now few data-related issues that are archive-specific • metadata formats • video • presentation/exhibition of material
What can you archive (at ELAR)? • media - sound, video • graphics - images, scans • texts - fieldnotes, grammars, description, analysis • structured data - aligned and annotated transcriptions, databases, lexica • metadata - contextual information about the materials, structured and unstructured
Archive objects • an “object” could be a file, a set of files, a directory, or a set of files with their relationships explicitly defined • these are often called “sessions”or “bundles” • they should be made explicit • through metadata • our future catalogue system will provide the ability for depositors to directly create, label and update bundles See bundles at ELAR
Archive material should be selected • example: Depositor’s question: How much video can I archive? • answer: ...
What is required to make a deposit? • resource(s) for an endangered language • it could be just one file • inventory / metadata • deposit form view • existing deposits can also be updated, added to, and metadata added/modified
How can I deliver data? • hard disks • we return them • we send them out • email • good for samples for evaluation • OK for most text materials • Dropbox etc • flash cards and USB sticks • a web upload facility may beprovided one day • we download from your server
What about CDs and DVDs? • we have found CDs, andespecially DVDs, to bevery unreliable • DVD fail rate > 10% • cause confusion as filesare allocated to fit on disks, not according to corpus structure • create a lot of work fordepositors and for ELAR
Protocol • the sensitivities and access restrictions associated with EL resources • need to be discussed, collected and recorded in the field • global protocol (the overall, typical value) is entered into the deposit form • specific protocol (for files, bundles) is entered via metadata (or any other explicit way)
Protocol and access control • principles: • granularity – file, bundle or collection • access is a relation between object and user • protocol values can be changed over time • ELAR’s URCS system • User • Researcher • Community member • Subscriber
“I have images” • what kinds of images? • what are their sources? • what is their documentation value? what role do they play in the collection? • … these should be reflected in the data structures/metadata
Metadata for images • at least captions • what else? • … • … • … • … • in what form? • narrative • tabular fields • keywords
Integrating images into metadata • get a list of image files • command (DOS) window • in directory • type “dir > list.txt” • open text file (in Notepad++ or MS Word) • change font to Courier • get a “vertical selection” • (or use a file listing utility!) • paste into spreadsheet
Integrating images into metadata • make a new sheet for images • paste in image file list (see previous) • add an ID column • type “1” in first cell • select from first to last cell in ID column • Edit>Fill>Series>OK • add other columns • now you can refer to your images anywhere!
Using spreadsheet to access data • you can turn a filename into a link to access files directly from a spreadsheet • have the filename in cells • use the formula =HYPERLINK(file, “Message") • examples =HYPERLINK("E:\archiving\images\"&A2, "click here")=HYPERLINK(A1&A2, "click here")=HYPERLINK(A1&A2, A2)
My cells have multiple values! • example: keywords • this is probably OK, as keywords are atomic • just consistently use a suitable delimiter • e.g. use comma - if data values cannot have commas • ELAR recommends double pipe “||”
My cells have multiple values! • example: speakers in a recording • speakers are probably not atomic – they have other attributes • create a separate “speakers” sheet • give each speaker an ID (number or initials) • use the IDs in the original sheet, with delimiter (implements one to many) • (advanced) or make another sheet to associate recordings with speakers (implements many to many)
Expressing “Relation” in spreadsheets • one column is usually insufficient • “relationship” has 2-parts • the target of the relationship • description of the relationship • how would this work for images?
How can I tell if it’s Unicode? • use a browser or Notepad++ • paste text in • examine the encoding (before and after)
Can I still use MS Word? • ELAR no longer accepts MS Word files • but Word is still useful • quicker to type up • useful tables, functions, macros etc • solutions • think “text only” • tables as spreadsheets (are they bad too?) • (advanced) complex materials formatted as styles, then export as marked up • PDF/A – but not a perfect solution