140 likes | 274 Views
Bruce R. Barkstrom Retired NASA/NOAA Asheville, NC 28804. The Role of Data Formats in Long-Term Earth Science Information Preservation. Outline. The Difficulty of Preservation Threats – Risk of Loss Digital Artifacts and Representation Networks
E N D
Bruce R. Barkstrom Retired NASA/NOAA Asheville, NC 28804 The Role of Data Formats inLong-Term Earth ScienceInformation Preservation
Outline • The Difficulty of Preservation • Threats – Risk of Loss • Digital Artifacts and Representation Networks • The OAIS Reference Model – Layered Representation • Demonstrating Equality of Science Content • Mechanisms for Formating Data • Archivally Safe Transformations • Strategies for Reducing Risk • Concluding Comments
The Difficulty of Preservation • To preserve 98% of an archive's contents for 200 years • The average probability of loss per year needs to be below ~0.01% (4 9's in reliability engineering) • 200 years covers a lot of events • 50 administrations • 200 annual budgets • 70 cycles of hardware obsolescence (new models of stuff every 3 years) • 10 generations (one every 20 years)
Operator Error Natural Disaster External Attack Internal Attack Economic Failure Organizational Failure List from the LOCKSS Threat Model paperRosenthal, et al, 2005 Threats – Risk of Loss • Media Failure • Hardware Failure • Software Failure • Communication Errors • Failure of Network Services • Media & Hardware Obsolescence • Software Obsolescence
Digital Artifacts and Representation Networks • Recent work by EU CASPAR Project on Representation Networks: • Network of digital artifacts that designated user community needs in order to understand archived information • Examples of Digital Artifacts in a RN • Calibration Data, Reports, Plans, Procedures • Satellite/Instrument Coordinate Descriptions and Plans • ... • Documentation Reading Software • Data Format Documentation and Software
The OAIS Reference Model – Layered Representation • The Object Layer, in which the Aggregations identified in the Aggregation Layer are classified into objects that are recognizable and meaningful in the application domain, such as images • The Aggregation Layer, in which the individual data elements of the Data Element Layer are aggregated into structural groupings, which are a tree whose leaves are ADE's • The Data Element Layer that consists of a sequence of Atomic Data Element (ADE) types (integers, reals, dates, character strings) • The Bit Stream Layer that consists of an array of bits • The Media Layer that includes data on disks, tapes, and networks
Scientific Data Only Part of File A Data File can have four kinds of ADE's: • Scientific Data • Structural Info (array sizes; XML tags) • Context Information • Tacit Information (ordering of array elements) • Aggregation Layer Tree • Trees for Data, Struc, Context • Subtrees for arrays of records (like relational DB tables) • Records contain elements that point to individual ADEs
Station ID Year JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC Strc Strc Dat Dat Dat Dat Dat Dat Dat Dat Dat Dat Dat Dat 2112867900011891 203 202-9999 231 96 186 152 646 139 430 169 209 2112867900011892 150 148 26 81 448 262 328 568-9999-9999 66 301 2112867900011893 0 41 121 46 779 198 701 511 107 192 418 249 2112867900011894 150 120 310 185 279 672-9999-9999-9999-9999-9999-9999 Missing Obs Value : -9999 “Trace Precip Value : -8888 An Example: NCDC Precip • Global Historical Climate Network (GHCN)Monthly Average Precipitation • ~2100 rainguage stations at peak • Earliest data ~1835 • Data recorded in ASCII • Station ID, lat, long in another file
Mechanisms for Formating Data Three mechanisms: • Use Templates to impress structure on data streams • Use Delimiters (e.g. XML tags) • Count Bits from beginning of array Mechanisms identify digital artifacts used in RN
Where Does Understanding Lie? • With bit counting (e.g. FORTRAN) • Compiler creates instructions for interpretation based on text of read program • With delimiter (e.g. XML) • XML Parser and compiler create instructions for interpretation based on DTD or Schema • Interpretation may reside in different places • Array dimensions implicit in read program or conventions of language for file handling • Array dimensions included in file
Beware of Tacit Knowledge • Some fields may reside in other files or are implicit • Geolocation of pixels in MODIS: have to consult file outside spectral image file • Months of year in GHCN precipitation: month numbering implicit in array ordering • Conventions may not be noted • Date conventions (European vs American vs Astronomical Julian Date) • Language encodings (Unicode vs ASCII)
Archivally Safe Transformations • Key Question: How Could We Tell If Two Files Contain the Same Data? • Individual data elements may be transformed in type • float -> double; byte -> int; ASCII -> Unicode • Order of data elements in aggregation may be permuted or indexed differently • FORTRAN -> C in order • Data may be separated into different files or aggregated into one • Tacit information (e.g. Representation info) may be made explicit
Strategies for Reducing Risk • Migration • Expect to create new files from old on a fairly regular basis – although with changes as needed to avoid risk of loss • Transparency • Make transformations explicit and record mappings from one format to another • Diversity • Record data in more than one format • Decentralize management and rely on federated authentication (and audit)
Concluding Comments • Use of slightly modified OAIS RM Layered Model • Gives solid basis for precise identification of particular scientific data • For two files to have identical scientific data • Use one-to-one and onto mapping, including necessary order permutations • One-to-One and onto mapping guarantees inverse mapping • Rigorous basis for identifying scientifically identical subsets