1 / 26

Format Typing for the Preservation of Data Sets and Databases

H a r v a r d U n i v e r s i t y L i b r a r y. Global Digital Format Registry. The Global Digital Format Registry (GDFR) will provide sustainable services to collect, review, store, discover, and deliver significant representation information about digital formats."Centrally-organized collec

val
Download Presentation

Format Typing for the Preservation of Data Sets and Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. H a r v a r d U n i v e r s i t y L i b r a r y Format Typing for the Preservation of Data Sets and Databases Stephen Abrams Harvard University Cambridge, Massachusetts, USA stephen_abrams@harvard.edu

    2. H a r v a r d U n i v e r s i t y L i b r a r y Global Digital Format Registry “The Global Digital Format Registry (GDFR) will provide sustainable services to collect, review, store, discover, and deliver significant representation information about digital formats.” Centrally-organized collection and review Distributed storage, discovery, and delivery via a peer-to-peer network

    3. H a r v a r d U n i v e r s i t y L i b r a r y Format and digital preservation Preservation is concerned with ensuring access to managed digital assets over time Thus, preservation activities are focused on Viability Fixity Authenticity Interpretability Renderability The last two are primarily a function of format

    4. H a r v a r d U n i v e r s i t y L i b r a r y Without format typing, all content is opaque

    5. H a r v a r d U n i v e r s i t y L i b r a r y Without format typing, all content is opaque

    6. H a r v a r d U n i v e r s i t y L i b r a r y Without format typing, all content is opaque Edward Burne-Jones (British, 1833-1898) The Days of Creation: the First Day, 1870-1876 Watercolor and gouache, 102.2×35.5 cm Fogg Art Museum, Harvard University, 1943.454 Bequest of Grenville L. WinthropEdward Burne-Jones (British, 1833-1898) The Days of Creation: the First Day, 1870-1876 Watercolor and gouache, 102.2×35.5 cm Fogg Art Museum, Harvard University, 1943.454 Bequest of Grenville L. Winthrop

    7. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? Informally, “a serialized encoding of an abstract information model” Encompasses the nominal sense of “file format” as well as a range of conceptual entities from the micro to the macro level IEEE 754 floating point number File system

    8. H a r v a r d U n i v e r s i t y L i b r a r y GDFR project Two DLF-sponsored invitational workshops University of Pennsylvania, January 2003 Washington, March 2003 Provisional data and service models Two independent demonstration projects FRED [John Ockerbloom, University of Pennsylvania] http://tom.library.upenn.edu/fred/ FOCUS [Joseph JaJa, University of Maryland] http://www.umiacs.umd.edu/~joseph/focus-archiving06.pdf FRED, Format Registry Demonstrator, TOM (Typed Object Model) FOCUS, Format Curation Service (LDAP)FRED, Format Registry Demonstrator, TOM (Typed Object Model) FOCUS, Format Curation Service (LDAP)

    9. H a r v a r d U n i v e r s i t y L i b r a r y The GDFR project Harvard University Library (HUL) funded for 2 years by the Mellon Foundation Staffing and technical work subcontracted by HUL to OCLC (July 2006) Project oversight Steering Committee (SC) for policy oversight Technical Working Group (TWG) for technical oversight Active solicitation of the international stakeholder community for review and comment

    10. H a r v a r d U n i v e r s i t y L i b r a r y General development goals A generalized registry framework, specialized for the GDFR application A network of independent but cooperating registry nodes, capable of synchronizing their holdings Globally fault tolerant Platform independence Open source Re-use well-known products and protocols Human and machine interfaces, with support for localization and accessibility Full information content expressible in XML form, and re-instantiatable from that expression

    11. H a r v a r d U n i v e r s i t y L i b r a r y GDFR network Peer-to-peer network communicating over a common protocol

    12. H a r v a r d U n i v e r s i t y L i b r a r y Data model ISO 11179, Information technology – Metadata registries (MDR) LC Digital Formats Web www.digitalpreservation.gov/formats/ OASIS/ebXML Registry Information Model http://www.ebxml.org/specs/ebRIM.pdf PRONOM www.nationalarchives.gov.uk/pronom/ Representation Information Registry/Repository dev.dcc.ac.uk/twiki/bin/view/Main/DCCRegRepV04 LC Caroline Arms/Carl Fleischhauer PRONOM Adrian Brown, TNA RIRR David Giaretta, JISC DCCLC Caroline Arms/Carl Fleischhauer PRONOM Adrian Brown, TNA RIRR David Giaretta, JISC DCC

    13. H a r v a r d U n i v e r s i t y L i b r a r y Format properties Canonical (GDFR) and alias identifiers Version Description Relationships – extension, containment, version, etc. Dependencies – hardware, media, software Classification Rights Developer Support Release date Withdrawal date

    14. H a r v a r d U n i v e r s i t y L i b r a r y Format properties Documentation External/internal signatures – e.g. file extension/magic number Byte order – big-, little-endian, either, both Orientation – text vs. binary Grammar – ABNF, BNF, BSDL, DFDL, EAST Assessment – LC SQF, OCLC INFORM, DSTC PANIC, VRC Processes – using format as input/output Typed grammar, e.g. BNF, ABNF, BSDL, DFDL, EAST Typed assessment, e.g. LC SQF, OCLC INFORM, DSTC PANIC, Cornel VRCTyped grammar, e.g. BNF, ABNF, BSDL, DFDL, EAST Typed assessment, e.g. LC SQF, OCLC INFORM, DSTC PANIC, Cornel VRC

    15. H a r v a r d U n i v e r s i t y L i b r a r y Format properties – taxonomy Ontological CLASSES, abstract families, concrete formats, and relationships BYTESTREAM IMAGE STILL RASTER GIF GIF87a GIF89a is-new-version-of GIF87a JPEG ISO 10918-1 JFIF is-extension-of ISO 10918-1 TIFF TIFF 4.0 TIFF 5.0 is-new-version-of TIFF 4.0 TIFF 6.0 is-new-version-of TIFF 5.0 TIFF/IT is-extension-of TIFF 6.0 TIFF/IT/CT is-subtype-of TIFF/IT TIFF/IT/CT/P1 is-subtype-of TIFF/IT/CT TIFF/EP (ISO 12234-2) TIFF/IT (ISO 12639)TIFF/EP (ISO 12234-2) TIFF/IT (ISO 12639)

    16. H a r v a r d U n i v e r s i t y L i b r a r y Format properties – relationships Subtype ASCII is-subtype-of UTF-8 Extension DNG is-extension-of TIFF 6.0 Containment WAVE can-contain µ-law Equivalence DXF (ASCII) is-equivalent-to DXF (binary) Version TIFF 6.0 is-version-of TIFF 5.0 Affinity SPIFF is-similar-to JPEG

    17. H a r v a r d U n i v e r s i t y L i b r a r y Format properties – documentation Public domain specifications managed and replicated in the network For non-public domain, full bibliographic citation with actionable identifiers Mechanism for agents to register locally-held copy with terms of use

    18. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? Four conceptual entities AIM Abstract information model CIS Coded information set (semantic) SIS Structural information set (syntactic) SBS Serialized byte stream Informed by the Unicode character encoding model www.unicode.org/unicode/reports/tr17/

    19. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? Three encodings FEM Format encoding model FEM : AIM ? CIS FEF Format encoding form FEF : CIS ? SIS FES Format encoding scheme FES : SIS ? SBS A format is a triple, F = (FCS, FEF, FES)

    20. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? – TIFF

    21. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? Most analysis to date has been with regard to the humanities, i.e. formats useful for representing text, still image, moving image, or audio objects An object corresponds to an intellectual or aesthetic work, manifest in one or more bit streams Simple object – single independent bit stream Complex object – logical aggregation of multiple dependent bit streams

    22. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? – Content model A class of complex objects sharing useful formal, structural, and behavioral properties forms a content model Concept introduced informally in Fedora Intended to be given formal expression in a subsequent version A content model is a format

    23. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? – Page turned objects Structural parent (XML/METS) Pre-formed display (PDF) Page images (GIF, JP2, JPEG, TIFF) OCR (UTF-8)

    24. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? – Data set

    25. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? – Data set Semantically defined by the codebook, schema, etc. Syntactically, a tabular (delimited) set of numbers Serialized as text/binary

    26. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? – Relational database Semantically, a collection of typed fields, along with constraints (DDL) Syntactically, a collection of tables (tuples) Serialized as data file(s), export file, XML, etc.

    27. H a r v a r d U n i v e r s i t y L i b r a r y For more information

More Related