360 likes | 676 Views
Digital Immortality. OR. Keeping Digital Data for Ever. Dr David Holdsworth <ecldh@leeds.ac.uk>. http://www.leeds.ac.uk/cedars/. Obsolete(?) Data. 1 Things that must be kept by law 2 Things that must be destroyed by law 3 Things that we choose to keep
E N D
Digital Immortality OR Keeping Digital Data for Ever Dr David Holdsworth <ecldh@leeds.ac.uk> http://www.leeds.ac.uk/cedars/
Obsolete(?) Data • 1 Things that must be kept by law • 2 Things that must be destroyed by law • 3 Things that we choose to keep • 4 Things that we are certain can be thrown away
Obsolete(?) Data • 5 Things that we would like to keep if we have room • 6 Things that we would like to throw away, but are not sure about • 7 Things that we think we have kept but cannot find • 8 Things that we have kept but now cannot decypher • 9 Things that we have not kept but now wish that we had
What to Keep • All of 1 and 3 • 1 Things that must be kept by law • 3 Things that we choose to keep • As much of 5 and 6 as is cost-effective • 5 Things that we would like to keep if we have room • 6 Things that we would like to throw away, but are not sure about • Data discarded from 5 and 6 has the potential to be in 9 in the future • 9 Things that we have not kept but now wish that we had • Minimise cost per item
Some Pitfalls • Errors are usually not correctable • Failure to index adequately puts data into category 7 • 7 Things that we think we have kept but cannot find • Failure to know the format puts data into category 8 • 8 Things that we have kept but now cannot decypher
Personal Involvement CEDARS • Curl Exemplars in Digital ARchiveS • Collaborative project for libraries • Funded by HEFCE/JISC • Oxford, Cambridge and Leeds
Personal Involvement - contd. CAMiLEON • Creative Archiving at Michigan and LeedsEmulating the Old on the New • Collaborative project on emulation • Funded by NSF/JISC
Challenges to digital preservation • Deteriorating media • Magnetic dropout • Obsolete equipment • Obsolete data formats • EBCDIC • UNICODE has established itself • Machine code software is an extreme example
Challenges to digital preservation contd • Needles in haystacks • ISBN • Meta-data • Deteriorating Institutions • Where are the digital legal deposits? • .. Or even Digital Equipment Corporation • Proprietary systems become obsolete • leaving data inaccessible
Compatibility - Friend or Foe • e.g. OS/z evolves from OS/360 • Windows Vista evolves from 16-bit Windows 3.1 • Modern machines run old software …… but faster • Who keeps old versions? • Computer Museum in California • Microsoft -- ?
Times Change • People don’t always want to process their old data using the tools of yesteryear
THIS IS GEORGE 3 MARK 8.67 ON 31DEC99 10.19.03_ TIMED OUT 10.19.33 THE SYSTEM HAS TEMPORARILY CLOSED DOWN
Times Change • People don’t always want to process their old data using the tools of yesteryear • Need to bridge the gap between data’s origins and the time of access
Use the Past to Illuminate the Future • In 1987 EDCDIC was king • In 2007 UNICODE is heir apparent • In 2027 ……. • In 2038 UNIX time_t overflows 31 bits • What has survived the decades?
Survival of the Abstract • Character sets • Bytes • Unstructured Files (stream of bytes) • Hierarchical file tree • Associative mappings • Programming languages
All is not lost • We can keep a byte-stream for everThe abstract data separated from the medium is technology-neutral • i.e. files can be kept for ever • Copies are perfect • File formats do not last for ever • ….. Remember WORDSTAR
Non-File Objects • e.g. CDs, DVDs, magnetic tapes, web sites • Map each digital object into a byte-stream and then preserve • Multiple files (e.g. websites) can go in a ZIP or tar archive
Abstraction • Identify significant properties of the object • represent them in a byte stream
Example -- magnetic tape • Significant properties • blocks of data • tape marks • start and end of tape • Representation • block-- raw bytes, preceded by 32-bit byte count • tape mark -- 4 bytes all ones • start & end -- ends of stream
When to convert • Conversion is inevitable • a) as soon as the format becomes obsolete • b) only when we want to read the data • c) never - emulate the original system
Convert as soon as Obsolete • Copying to new technology is no longer trivial • Any errors are cast in stone • Digital signatures are lost • Only viable when the number of different formats is small
Convert when we want to read • Preserve the original by simply copying onto current technology • Record the format of each stored object • Keep an index of all the formats held • Maintain access to conversion software from the old to the current • Treasure open-source conversion software
Format Registries • National Archives PRONOM • Harvard Global Digital Format Registry • OAIS ISO14721:2003 Representation Information
Emulation of Yesteryear • Today’s desktop machine far exceeds the mainframe of the 1970s or even 80s • George3 • Emulate the George3 executive • i.e. order code + system calls + peripherals • BBC micro • Publicly available emulation on WWW
Abstraction for Emulation of 1900 system • George3 sits on 1900 instruction set plus executive calls • Executive sits on 1900 instruction set plus Fancy I/O stuff • George3 provides lots of embellishment of 1900 instruction set • Emulate executive + 1900 instruction set
Malawi Census Data • Data stored on ICL magnetic tapes • Rescued by using emulated ICL 1900
Standards • Open Archival Information System • OAIS ISO14721:2003 • Originated by Space Data Community • Proprietary “standards” • Big enough to be reverse engineered e.g. MS Word • XYZ Software Ltd • Open standards, e.g. RFCs
Really Long-Term • Look back 20 years to see how things have changed • Today’s Vista is not the final scene • Ensure that systems can accommodate new formats • Even the standards are likely to change
Domesday 1986 • 900th anniversary of William the Conqueror’s version • BBC collects data (inc pictures) • Data written on 12" LaserVision discs • Discs last 100 years, but not the drives • Access is via BBC Master computer • That won’t last 100 years either • Can we preserve it until the 1000th anniversary?
Stewardship • Copies of the discs are lodged with: • BBC • British Library • National Archives (ex PRO) • Abstract data held by: • DH / Leeds University • Longlife Data Ltd
Stewardship • Current archival activity stresses retention of media • Retention of digital media is useless • Need digital safe deposits
Keeping Digital Data for Ever Digital Immortality OR Dr David Holdsworth <ecldh@leeds.ac.uk> http://www.leeds.ac.uk/cedars/