260 likes | 444 Views
H a r v a r d U n i v e r s i t y L i b r a r y. Global Digital Format Registry. The Global Digital Format Registry (GDFR) will provide sustainable services to collect, review, store, discover, and deliver significant representation information about digital formats."Centrally-organized collec
E N D
1. H a r v a r d U n i v e r s i t y L i b r a r y Format Typing for the Preservation of Data Sets and Databases Stephen Abrams
Harvard University
Cambridge, Massachusetts, USA
stephen_abrams@harvard.edu
2. H a r v a r d U n i v e r s i t y L i b r a r y Global Digital Format Registry “The Global Digital Format Registry (GDFR) will provide sustainable services to collect, review, store, discover, and deliver significant representation information about digital formats.”
Centrally-organized collection and review
Distributed storage, discovery, and delivery via a peer-to-peer network
3. H a r v a r d U n i v e r s i t y L i b r a r y Format and digital preservation Preservation is concerned with ensuring access to managed digital assets over time
Thus, preservation activities are focused on
Viability
Fixity
Authenticity
Interpretability
Renderability
The last two are primarily a function of format
4. H a r v a r d U n i v e r s i t y L i b r a r y Without format typing, all content is opaque
5. H a r v a r d U n i v e r s i t y L i b r a r y Without format typing, all content is opaque
6. H a r v a r d U n i v e r s i t y L i b r a r y Without format typing, all content is opaque Edward Burne-Jones (British, 1833-1898)
The Days of Creation: the First Day, 1870-1876
Watercolor and gouache, 102.2×35.5 cm
Fogg Art Museum, Harvard University, 1943.454
Bequest of Grenville L. WinthropEdward Burne-Jones (British, 1833-1898)
The Days of Creation: the First Day, 1870-1876
Watercolor and gouache, 102.2×35.5 cm
Fogg Art Museum, Harvard University, 1943.454
Bequest of Grenville L. Winthrop
7. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? Informally, “a serialized encoding of an abstract information model”
Encompasses the nominal sense of “file format” as well as a range of conceptual entities from the micro to the macro level
IEEE 754 floating point number
File system
8. H a r v a r d U n i v e r s i t y L i b r a r y GDFR project Two DLF-sponsored invitational workshops
University of Pennsylvania, January 2003
Washington, March 2003
Provisional data and service models
Two independent demonstration projects
FRED [John Ockerbloom, University of Pennsylvania]
http://tom.library.upenn.edu/fred/
FOCUS [Joseph JaJa, University of Maryland]
http://www.umiacs.umd.edu/~joseph/focus-archiving06.pdf FRED, Format Registry Demonstrator, TOM (Typed Object Model)
FOCUS, Format Curation Service (LDAP)FRED, Format Registry Demonstrator, TOM (Typed Object Model)
FOCUS, Format Curation Service (LDAP)
9. H a r v a r d U n i v e r s i t y L i b r a r y The GDFR project Harvard University Library (HUL) funded for 2 years by the Mellon Foundation
Staffing and technical work subcontracted by HUL to OCLC (July 2006)
Project oversight
Steering Committee (SC) for policy oversight
Technical Working Group (TWG) for technical oversight
Active solicitation of the international stakeholder community for review and comment
10. H a r v a r d U n i v e r s i t y L i b r a r y General development goals A generalized registry framework, specialized for the GDFR application
A network of independent but cooperating registry nodes, capable of synchronizing their holdings
Globally fault tolerant
Platform independence
Open source
Re-use well-known products and protocols
Human and machine interfaces, with support for localization and accessibility
Full information content expressible in XML form, and re-instantiatable from that expression
11. H a r v a r d U n i v e r s i t y L i b r a r y GDFR network Peer-to-peer network communicating over a common protocol
12. H a r v a r d U n i v e r s i t y L i b r a r y Data model ISO 11179, Information technology – Metadata registries (MDR)
LC Digital Formats Web
www.digitalpreservation.gov/formats/
OASIS/ebXML Registry Information Model http://www.ebxml.org/specs/ebRIM.pdf
PRONOM
www.nationalarchives.gov.uk/pronom/
Representation Information Registry/Repository
dev.dcc.ac.uk/twiki/bin/view/Main/DCCRegRepV04 LC Caroline Arms/Carl Fleischhauer
PRONOM Adrian Brown, TNA
RIRR David Giaretta, JISC DCCLC Caroline Arms/Carl Fleischhauer
PRONOM Adrian Brown, TNA
RIRR David Giaretta, JISC DCC
13. H a r v a r d U n i v e r s i t y L i b r a r y Format properties Canonical (GDFR) and alias identifiers
Version
Description
Relationships – extension, containment, version, etc.
Dependencies – hardware, media, software
Classification
Rights
Developer
Support
Release date
Withdrawal date
14. H a r v a r d U n i v e r s i t y L i b r a r y Format properties Documentation
External/internal signatures – e.g. file extension/magic number
Byte order – big-, little-endian, either, both
Orientation – text vs. binary
Grammar – ABNF, BNF, BSDL, DFDL, EAST
Assessment – LC SQF, OCLC INFORM, DSTC PANIC, VRC
Processes – using format as input/output
Typed grammar, e.g. BNF, ABNF, BSDL, DFDL, EAST
Typed assessment, e.g. LC SQF, OCLC INFORM, DSTC PANIC, Cornel VRCTyped grammar, e.g. BNF, ABNF, BSDL, DFDL, EAST
Typed assessment, e.g. LC SQF, OCLC INFORM, DSTC PANIC, Cornel VRC
15. H a r v a r d U n i v e r s i t y L i b r a r y Format properties – taxonomy Ontological CLASSES, abstract families, concrete formats, and relationships
BYTESTREAM
IMAGE
STILL
RASTER
GIF
GIF87a
GIF89a is-new-version-of GIF87a
JPEG
ISO 10918-1
JFIF is-extension-of ISO 10918-1
TIFF
TIFF 4.0
TIFF 5.0 is-new-version-of TIFF 4.0
TIFF 6.0 is-new-version-of TIFF 5.0
TIFF/IT is-extension-of TIFF 6.0
TIFF/IT/CT is-subtype-of TIFF/IT
TIFF/IT/CT/P1 is-subtype-of TIFF/IT/CT TIFF/EP (ISO 12234-2)
TIFF/IT (ISO 12639)TIFF/EP (ISO 12234-2)
TIFF/IT (ISO 12639)
16. H a r v a r d U n i v e r s i t y L i b r a r y Format properties – relationships Subtype ASCII is-subtype-of UTF-8
Extension DNG is-extension-of TIFF 6.0
Containment WAVE can-contain µ-law
Equivalence DXF (ASCII) is-equivalent-to DXF (binary)
Version TIFF 6.0 is-version-of TIFF 5.0
Affinity SPIFF is-similar-to JPEG
17. H a r v a r d U n i v e r s i t y L i b r a r y Format properties – documentation Public domain specifications managed and replicated in the network
For non-public domain, full bibliographic citation with actionable identifiers
Mechanism for agents to register locally-held copy with terms of use
18. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? Four conceptual entities
AIM Abstract information model
CIS Coded information set (semantic)
SIS Structural information set (syntactic)
SBS Serialized byte stream
Informed by the Unicode character encoding model
www.unicode.org/unicode/reports/tr17/
19. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? Three encodings
FEM Format encoding model FEM : AIM ? CIS
FEF Format encoding form FEF : CIS ? SIS
FES Format encoding scheme FES : SIS ? SBS
A format is a triple, F = (FCS, FEF, FES)
20. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? – TIFF
21. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? Most analysis to date has been with regard to the humanities, i.e. formats useful for representing text, still image, moving image, or audio objects
An object corresponds to an intellectual or aesthetic work, manifest in one or more bit streams
Simple object – single independent bit stream
Complex object – logical aggregation of multiple dependent bit streams
22. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? – Content model A class of complex objects sharing useful formal, structural, and behavioral properties forms a content model
Concept introduced informally in Fedora
Intended to be given formal expression in a subsequent version
A content model is a format
23. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? – Page turned objects Structural parent (XML/METS)
Pre-formed display (PDF)
Page images (GIF, JP2, JPEG, TIFF)
OCR (UTF-8)
24. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? – Data set
25. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? – Data set Semantically defined by the codebook, schema, etc.
Syntactically, a tabular (delimited) set of numbers
Serialized as text/binary
26. H a r v a r d U n i v e r s i t y L i b r a r y What is a format? – Relational database Semantically, a collection of typed fields, along with constraints (DDL)
Syntactically, a collection of tables (tuples)
Serialized as data file(s), export file, XML, etc.
27. H a r v a r d U n i v e r s i t y L i b r a r y For more information