1.21k likes | 1.24k Views
Archiving. David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project SOAS, University of London. Topics. Introducing ELAR and digital language archives Preservation Archive interactions with documentation What and how to archive Protocol Metadata
E N D
Archiving David Nathan Endangered Languages Archive Hans Rausing Endangered Languages Project SOAS, University of London
Topics • Introducing ELAR and digital language archives • Preservation • Archive interactions with documentation • What and how to archive • Protocol • Metadata • Evaluation of audio • Archives and revitalisation • Archivism : mobilisation • Video • Conclusions
Endangered Languages ARchive (ELAR) • one of 3 semi-autonomous programs of the Hans Rausing Endangered Languages Project • staff of 3; archivist, software developer, technician, (research assistants etc) • develop preservation infrastructure, cataloguing and dissemination; policies; facilities; training and advice; materials development and publishing
What is a digital language archive? • a trusted repository created and maintained by an institution with a commitment to the long-term preservation of archived material • will have policies and processes for materials acquisition, cataloguing, preservation, dissemination, migration to new digital formats • a collection of managed materials
What is archiving of language materials? • preparing materials in a structured form suitable for long-term preservation • creating long-term relationships • it is not backup • it is not dissemination/publication • it should not impinge on good linguistic practice
What can a language archive offer? • Security - keep your electronic materials safe • Preservation - store your materials for the long term • Discovery - help others to find out about your materials • Protocols - respect and implement sensitivities, restrictions • Sharing - share results of your work, if appropriate • Acknowledgement - create citable acknowledgement • Mobilisation - create usable language materials for communities • Quality and standards - advice for assuring your materials are of the highest quality and robust standards
Kinds of language archives • many cross-cutting classifications: • Indigenous vs outsider, eg. Squamish Nation • regional vs international, eg. AILLA, Paradisec; DoBeS, ELAR • associated with research institute, eg. AIATSIS, ANLC • granter-funded, eg. DoBeS, ELAR, OTA • digital vs physical vs mixed, eg. DoBeS vs Vienna Sound Archive, ANLC
Potential users • speakers and their descendants - up to 95% of users of UCB are community members • depositors - to create or renew materials • other researchers - comparative/historical linguists, typologists, theoreticians, anthropologists, historians, musicologists etc etc • other “stakeholders”, eg educationalists • journalists and the wider public
Archives networks and bodies • Digital Endangered Languages and Archives Network (DELAMAN) • ELAR, DOBES, ANLC, Paradisec, EMELD, LACITO, AIATSIS, AMPM (Maori) • Open Language Archives Community (OLAC) • others, eg. D-LIB • http://www.dlib.org/ • Open Archives Initiative
afd_34 afd_34 afd_34 afd_34 afd_34 dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds Digital archive architectures • OAIS archives define three types of ‘packages’ ingestion, archive, dissemination: Producers Ingestion Archive Dissemination Designated communities
afd_34 afd_34 afd_34 afd_34 afd_34 dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds dfa dfadf fds fdafds ‘Live Archives’ - architecture • Boundary between depositors, users and archive: • users add, update content; customise outputs Producers Ingestion Archive Dissemination Designated communities
The way we were ... • eg 1993: ASEDA Aboriginal Studies Electronic Data Archive at AIATSIS Canberra (modelled on Oxford Text Archive) • opportunistically collect and catalogue electronic materials that were at risk or not accessible • lexica • grammars • texts • etc
How things have changed .. • types of data (modalities and some genres) • means of storage • standardisation and metadata • dissemination • (most explosive) expanded into practice and workflow of linguists
ELAR’s holdings • ELAR currently holds about 45 deposits with a total volume of approx 1.1 TB. • the average deposit is about 25 GB, however, the sizes vary widely, with a few much larger deposits. The median size is around 10GB • we expect volume to nearly double over the next year • see next slides for distribution of data types
ELAR holdings by data type • data types for a representative sample (70%) of holdings • data type by volume (MB) and number of files, sorted by volume
If you are a depositor, ELAR will • preserve your deposited materials • provide for making changes where possible • provide web-based metadata management • implement your access restrictions etc • give feedback about materials • provide advice, general and specific • assistance, eg data conversion • provide some equipment and services • on a case by case basis, develop resources
Preservation issues • making materials robust • making storage robust • organisational, ownership and policy issues • changing technologies • refreshing • migrating
Changing technologies • advantages of digital preservation • primarily: copying • items no longer unique • also transmission, dissemination • other implications • robust formats (standard, open, explicit) • formats with long horizons • formats easy to refresh • formats that don’t require particular software (sometimes software is intrinsic!) • may have to describe software or even archive the software
Two preservation models • “preserve the bytestream” • keep the exact original at all costs • LOCKSS • “lots of copies keep stuff safe” • http://lockss.stanford.edu/ • guess which community it came from!
Some backup issues • risk management • undetected problems and useless backups • aspects of professional backup: • scheduled frequencies, eg monthly, weekly, daily • retention • media and locations • naming/versions • proven restoration
Top 10 worst ways to collect/manage data • 1. No backup • 2. Divergent versions of same data • 3. Unlabeled disks/media • 4. Non-standard or undocumented filenames • 5. Master recordings used to review/analyse data • 6. Don’t know how characters are encoded • 7. Never tried to convert/export data • 8. Unprocessed or unedited audio and video • 9. Inconsistent recording • 10. Unmonitored recording
Documenter and archive interactions • grant formulation and application • communications, questions, advice • training • archiving services
Query/interaction topics • analysis of approx 150 queries from documenters/linguists over nearly 2 years
What can you archive (at ELAR)? • media - sound, video • graphics - images, scans • text - fieldnotes, grammars, description, analysis • structured data - aligned and annotated transcriptions, databases, lexica • metadata - structured, standardised contextual information about the materials
Archive objects • informed by traditions, eg document archives • sometimes called “resources”, bundles • it could be a file, a set of files, a directory, a “session” or a coherent item with many parts • should have archival qualities eg Bird & Simons “7 Dimensions” (or see Thieberger in LDD2) • may impose standard structures or formats • need deposit event and processes • legal and protocol • verification • accession • ongoing processes
Archive objects should be selected • example: video: How much volume allocated? • answer: ... • however, e.g.: • unlikely that linguist is in position to plan and consistently create excellent video, so selection is unavoidable • data has always been selected!
(... selection) • in your typical work you also: • selected • labeled • transformed/processed/edited • added, corrected, expanded • made links • made or assumed relationships between “whole” and processed units; invented labels, IDs, scope etc • imposed formats
Data portability • Bird and Simons 2003: (for language documentation) our data should have integrity, flexibility, longevity and utility
Data portability • complete • explicit • documented • preservable • transferable • accessible • adaptable • not technology-specific • (also appropriate, accurate, useful etc!!)
Formats - media - preferred • sound - WAV • image - BMP, TIFF, JPEG • video - MPEG2
Formats - documents - preferred • plain text, with or without markup • PDF (PDF/A) • XML, other systematic markup (with description of markup system) • well-structured documents in common Office formats - ELAR will eventually convert them to archive formats • character encoding : • preferred encoding is ASCII or Unicode • clearly document any other encodings used, e.g. ISO 8859-5 • discuss with us if you use font substitution to handle non-Roman characters
Formats - characters - preferred • character encoding : • ASCII or Unicode (UTF-8) • you must clearly document any other encodings used, e.g. ISO 8859-9 • discuss with us if you use font substitution to handle non-Roman characters
Filenames and directories • characters [A-Z], [a-z], [0-9], underscore and a single full stop before the extension • correct MIME extension • favour lower case letters • maximum length 30 characters • maximum directory depth 8 • = ASCII only, no spaces
Semantics of filenames • don’t stuff meaningful information into filenames - use metadata instead • versions • use directory structures wisely
Characters • did my characters comethrough? • answer: ... • however: • perhaps ELAR should do it? hápa ki hená mázaska wikcémna núpa iyóphe-wa-ye kst DBW wóz?az?a-s?ni yeló DB OK wash things-NEG ASS.M 'he didn't do the wash' wózaza-sni yeló DB OK wash things-NEG ASS.M 'he didn't do the wash'
Preservation • Is my file preservable? • Note: • characters? • inconsistent segmentation • data as comments • conventions/metadata Text transcription: “Korimáka” Language: Choguita Rarámuri Language used for transcription: Spanish Consultant: Luz Elena León Ramírez Linguist: abriela Cabaero Transcription: erth Fuen & Gabrela Cabaero Date recorded: 11/02/2006 Date tranbscribed: 11/02/2006 Recording: rec6-LEL.wav
Knowledge representation 1 - before wama momol chi naron mon chayako (LB) / wama momol chi naron chayako (MD) wama momol chi nan mon chayako (more emphatic(LB) / wama momol chi nan chayako (MD) Why don't you and him do it? + Notes have both of these sentences without the negator mon. OK runon naynangkroy ile ri He ate their sago. * kipin kannangkroy ngolu intended: We ate their cassowary. OK kipin kanangkroy ngolu We ate their cassowary.
Knowledge representation 1 - after * kipin kannangkroy ngolu intended: We ate their cassowary. OK kipin kanangkroy ngolu We ate their cassowary. <sentence.set num="75"> <version> <walman>Kipin kannangkroy ngolu</walman> <judgement>*</judgement> </version> <english>We ate their cassowary. </english> </sentence.set> <sentence.set num="76"> <version> <walman>Kipin kanangkroy ngolu</walman> <judgement>OK</judgement> </version> <english>We ate their cassowary.</english> </sentence.set>
Knowledge representation 2 • avoid generic software “convert to XML” <?xml version=“1.0” encoding=“UTF-8”?> <FMPXMLRESULT xmlns=“http://www.filemaker.com/fmpxmlresult”> <PRODUCT BUILD=“06/26/2002” NAME=“FileMaker Pro” VERSION=“6.0v2”/> <DATABASE DATEFORMAT=“M/d/yyyy” LAYOUT=““ NAME=“Videos” RECORDS=“13” TIMEFORMAT=“h:mm:ss a”/> <METADATA> <FIELD EMPTYOK=“YES” MAXREPEAT=“1” NAME=“Index name” TYPE=“TEXT”/> <FIELD EMPTYOK=“YES” MAXREPEAT=“1” NAME=“Image desc” TYPE=“TEXT”/> <FIELD EMPTYOK=“YES” MAXREPEAT=“1” NAME=“Date” TYPE=“TEXT”/> <FIELD EMPTYOK=“YES” MAXREPEAT=“1” NAME=“Content” TYPE=“TEXT”/> </METADATA> <RESULTSET FOUND=“13”> <ROW MODID=“16” RECORDID=“40”> <COL><DATA>Morly Beeta</DATA></COL> <COL><DATA>Interview with Morly Beeta</DATA></COL> <COL><DATA>Jan/13/05</DATA></COL> <COL><DATA>Obu history by Morly Beeta</DATA></COL> </ROW>