190 likes | 324 Views
Ensuring that digital data last. The priority of archival form over working form and presentation form Gary Simons SIL International. A paradox of writing history. The more advanced the writing technology, the less durable the written product.
E N D
Ensuring that digital data last The priority of archival form over working form and presentation form Gary Simons SIL International Symposium on Best Practice LSA, Boston, MA
A paradox of writing history • The more advanced the writing technology, the less durable the written product. • From most durable to least durable: • Clay tablets and stone • Velum • Papyrus • Paper • Digital word processing Symposium on Best Practice LSA, Boston, MA
Storage media are ephemeral • Life expectancy of digital storage media: • Magnetic tape: 10 to 20 years • CD-R (write once) • Manufacturers say: 100 to 200 years • Independent lab says: 30 years • CD-RW (write many times) • Manufacturers say: 25 years Symposium on Best Practice LSA, Boston, MA
Hardware devices are ephemeral • Removable media on personal computers advance over 25 years: • 8-inch floppies • 5.25-inch floppies • 3.5-inch floppies • Zip drives • CD-Rs • DVD-Rs Symposium on Best Practice LSA, Boston, MA
Software formats are ephemeral • Software vendors change file formats and functionality with each version. • When we use a proprietary single vendor format, we lose access to the data when the software is obsolete. • For instance, • Microsoft Word files from the 1980s cannot be read by current versions of Word Symposium on Best Practice LSA, Boston, MA
An impending “Digital Dark Age” • Future historians may see our present age as another Dark Ages since so much information documenting our current civilization is recorded digitally and will have vanished. • If linguists fail to act in time, our digital data records are in danger of dying out before the endangered languages we are seeking to document. Symposium on Best Practice LSA, Boston, MA
What’s a linguist to do? • Do two things to ensure that digital data endure long into the future: • Put the materials into an enduring file format. • Deposit the materials with an archive that will make a practice of periodically migrating them to new storage media as needed. Symposium on Best Practice LSA, Boston, MA
Forms contrasted by function • Working form • The form in which information is stored as it is created and edited. • Presentation form • The form in which information is presented to the public. • Archival form • The form in which information isstored for access long into the future. Symposium on Best Practice LSA, Boston, MA
The problem • Popular working forms (like Microsoft Word or database applications) are not suitable archival forms. • Popular presentation forms (like dynamic web pages) are not suitable archival forms. • Linguists tend to focus on working form and presentation form; they must look beyond these to create enduring work. Symposium on Best Practice LSA, Boston, MA
Unacceptable practice • The form that is archived is a binary working form that requires a specific piece of software, e.g., • .DOC, .XLS, .PPT, .MDB • A format supported by homemade software • The information will cease to exist when the required software ceases to work on the hardware in use. Symposium on Best Practice LSA, Boston, MA
Minimally acceptable practice • The form that is archived is a presentation form based on an open format supported by multiple vendors, e.g., • HTML, PDF • The good news • A snapshot of how you presented the information will persist. • The bad news • It is a dead end format—the information is not repurposeable. Symposium on Best Practice LSA, Boston, MA
Best practice • The form that is archived preserves all of the information (including its structure) in such a way that it is portable and repurposeable. • Descriptive XML markup • An XML archival form is not a dead end: • It may be reloaded into a working form. • it may regenerate new presentation forms. Symposium on Best Practice LSA, Boston, MA
A sample presentation form • From a dictionary of Sikaiana, Solomon Islands aha[na] the shell tool used for measuring the spaces between mesh in nets (seu manu, kupena). ahaa (from PPN *afaa) [n] a cyclone, a tidal wave. aaha 1. [vt] to open up, to push apart, as in pushing apart branches in order to look through. 2. [vt] to open up a new settlement or start a new garden. 3. [vt] to start, to begin a new project or way of life. Tapa mai a koe ko hano i mua ki aaha te ala o te taina, 'you called upon me to go first (to school) to open the way for my brother (MS)'. Symposium on Best Practice LSA, Boston, MA
Unacceptable practice • If you archive a .DOC file, this is what future generations will see when they open it: Symposium on Best Practice LSA, Boston, MA
Minimally acceptable practice • If you archive an HTML presentation, this is what future generations will see: <P><B>aha</B> <I>[na]</I> the shell tool used for measuring the spaces between mesh in nets (<I>seu manu, kupena</I>).</P><P><B> ahaa</B> (from PPN *afaa) <I>[n]</I> a cyclone, a tidal wave.</P><P><B> aaha</B> 1. <I>[vt]</I> to open up, to push apart, as in pushing apart branches in order to look through. 2. <I>[vt]</I> to open up a new settlement or start a new garden. 3. <I>[vt]</I> to start, to begin a new project or way of life. <I>Tapa mai a koe ko hano i mua ki aaha te ala o te taina,</I> 'you called upon me to go first (to school) to open the way for my brother (MS)'. </P> Symposium on Best Practice LSA, Boston, MA
Best practice • If you archive descriptive XML markup, this is what future generations will see: • Future generations (though they lack our current working tools) will be able to: • See and understand the information • Load it into their own working tools • Create modern presentation forms Symposium on Best Practice LSA, Boston, MA
Is XML just one more ephemeral format? • No! It’s as rock solid as ASCII. • ASCII was adopted in 1963; 40 years later it is at the heart of operating sys-tems, email, the web — it won’t change. • XML uses ASCII notation to essentially extend ASCII by solving two of its inherent limitations: • Via Unicode it encodes text in any language • Via tags it encodes the structure of information Symposium on Best Practice LSA, Boston, MA
Is XML just one more theory? • No! It has become part of the fabric of the global information infrastructure. • It’s a family of open standards from the Worldwide Web Consortium. • All major vendors (e.g. Microsoft, IBM, Sun, Oracle) have embraced it. • Hundreds of small vendors and open-source projects have developed tools. Symposium on Best Practice LSA, Boston, MA
What’s linguistics to do? • The community needs to recognize the fleeting value of digital presentation forms and embrace archival forms. • Grants should require best practice archiving, not just “dissemination”. • Reward archival language documentation. • Get into league with libraries and archives. • Only by taking steps like these can we ensure that our digital data will endure. Symposium on Best Practice LSA, Boston, MA