1 / 40

The Global Digital Format Registry (GDFR) Project

CNI Fall Task Force Meeting Washington, DC, December 10-11, 2007. The Global Digital Format Registry (GDFR) Project. Stephen Abrams Harvard University Andreas Stanescu OCLC. Digital preservation and format.

winona
Download Presentation

The Global Digital Format Registry (GDFR) Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CNI Fall Task Force Meeting Washington, DC, December 10-11, 2007 The Global Digital Format Registry(GDFR) Project Stephen Abrams Harvard University Andreas Stanescu OCLC

  2. Digital preservation and format • Preservation is concerned with ensuring access to managed digital assets over time • Thus, preservation activities are focused on • Viability • Fixity • Authenticity • Interpretability • Renderability • The last two are primarily a function of format

  3. Without format typing, all content is opaque ffd8ffe000104a46494600010201 008300830000ffed0fb050686f74 6f73686f7020332e30003842494d 03e90a5072696e7420496e666f00 0000007800000000004800480000 000002f40240ffeeffee03060252 0347052803fc0002000000480048 0000000002d80228000100000064 000000010003030300000001270f 0001000100000000000000000000 0000600800190190000000000000 0000000000000000000000000000 0000000000000000000000003842 494d03ed0a5265736f6c7574696f 6e0000000010008313a3000200...

  4. Without format typing, all content is opaque ffd8ffe000104a46494600010201 008300830000ffed0fb050686f74 6f73686f7020332e30003842494d 03e90a5072696e7420496e666f00 0000007800000000004800480000 000002f40240ffeeffee03060252 0347052803fc0002000000480048 0000000002d80228000100000064 000000010003030300000001270f 0001000100000000000000000000 0000600800190190000000000000 0000000000000000000000000000 0000000000000000000000003842 494d03ed0a5265736f6c7574696f 6e0000000010008313a3000200... SOI APP0 JFIF 1.2 APP13IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2 ...

  5. Without format typing, all content is opaque ffd8ffe000104a46494600010201 008300830000ffed0fb050686f74 6f73686f7020332e30003842494d 03e90a5072696e7420496e666f00 0000007800000000004800480000 000002f40240ffeeffee03060252 0347052803fc0002000000480048 0000000002d80228000100000064 000000010003030300000001270f 0001000100000000000000000000 0000600800190190000000000000 0000000000000000000000000000 0000000000000000000000003842 494d03ed0a5265736f6c7574696f 6e0000000010008313a3000200... SOI APP0 JFIF 1.2 APP13IPTC APP2 ICC DQT SOF0 183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2 ...

  6. Global Digital Format Registry “The Global Digital Format Registry (GDFR) will provide sustainable services to collect, review, store, discover, and deliver significant representation information about digital formats.” • Centrally-organized collection and review • Distributed storage, discovery, and delivery on a network of independent, but cooperating registries

  7. What is a format? • “A serialized encoding of an abstract information model” • Encompasses the nominal sense of “file format” as well as a range of conceptual entities from the micro to the macro level • IEEE 754 floating point number • File system • In both case, there are well-defined syntactic and semantic rules for mapping from information to bits, and back again

  8. What’s wrong with MIME types?

  9. What’s wrong with MIME types? • Non-standardized documentation • Intended for human, not machine consumption • Coarse granularity • image/tiff vs. TIFF 4.0 – 6.0 Baseline Class B, G, P, R Extension Class Y TIFF/EP TIFF/IT with file types CT, LW, HC, MP, BP, BP, BL, FP Exif 2.0 – 2.2 GeoTIFF TIFF/FX DNG

  10. GDFR project • Two DLF-sponsored invitational workshops • University of Pennsylvania, January 2003 • Washington, March 2003 • Two independent demonstration projects • FRED [John Ockerbloom, University of Pennsylvania] tom.library.upenn.edu/fred/ • FOCUS [Joseph JaJa, University of Maryland] www.umiacs.umd.edu/~joseph/focus-archiving06.pdf

  11. GDFR project • Harvard University Library (HUL) funded for 2 years by the Andrew W. Mellon Foundation • Staffing and technical work subcontracted by HUL to OCLC (July 2006)

  12. GDFR project oversight • Technical Working Group (TWG) • Bibliothèque nationale de France • British Library • California Digital Library • Digital Curation Centre, UK • Library of Congress • National Archives, UK • National Archives and Records Administration • National Library of Australia • National Library of New Zealand • Stanford University • University of Pennsylvania

  13. General development goals • A generalized registry framework, specialized for the distributed GDFR application • Based on well-known products and protocols • Human and machine interfaces • Full information content expressible in XML form, and can be re-instantiated from that expression • Platform independence • Globally fault tolerant • Open source

  14. GDFR data model • Consistent with PRONOM registry

  15. Identifiers • Canonical, GDFR-assigned identifier • “info” URI info:rfa/gdfr1/Formats/1 • Other well-known identifiers • Common name “TIFF”, “Tagged Image File Format” • MIME type image/tiff • PRONOM identifier info:pronom/fmt/7 • Library of Congress Format Description Document (FDD) identifier fdd000022

  16. Classification scheme • Eight facets • Genre (required) text, still-image, sound, aggregate, … • Role (required)family, file-format, encoding, serialization • Composition unitary, container-bundle, container-wrapper • Form binary, text • Constraint structured, unstructured • Basis sampled, symbolic • Domain astronomy, cad-cam, gis, web-archive, … • Transform compression, encryption, message-digest, …

  17. Classification scheme • Examples • TIFF (Tagged Image File Format) genre:still-image role:family composition:container-wrapper form:binary basis:sampled • LZW (Liv-Zempel-Welch) genre:still-image role:encoding transform:compression • SVG (Scalable Vector Graphics) genre:still-image role:file-format form:text basis:symbolic

  18. Signatures • External signatures • File extension • Mac OS type • Mac OS X Uniform Type Identifiers (UTI) • Internal signatures • “Magic numbers” • Required vs. optional • Fixed vs. restricted vs. unrestricted

  19. Grammar • Formal description of the syntactic grammar underlying a format, expressed in some formal typed notation • BNF Backus-Naur Form • BSDL MPEG-21 Bitstream Syntax Description Language • DFDL Data Format Description Language • EAST CCSDS 644.0-B-2 • XCEL Extensible Characterisation Extraction Language

  20. Assessment • Assessment of a format, expressed in some formal typed notation • Cornell Virtual Remote Control (VRC) • DTSC PANIC • Library of Congress Sustainability, Quality, Function (SQF) • National Library of Australia AONS • OCLC INFORM

  21. Documentation • Specification documents (and software files) can be managed and distributed in the network • Applicable only in cases of public domain resources or if explicit permission is granted by rights holders • Other documents (and software) will be referenced by full citation, including actionable links where possible • Mechanism for individuals or institutions to register locally-held copies, with terms of use

  22. Software • Format role Input, output • Process type Characterize, create, edit, identify, … • Enables discovery of transformative processing chains

  23. Relationships • Modification BWF → WAVE • Extension DNG→ TIFF 6.0 • Restriction PDF/A → PDF 1.4 • Definition NITF → XML DTD • Requisite XML → Relax NG • Containment ZIP → * • Equivalence DXF (ASCII) → DXF (binary) • Version Word 97 → Word 6.0 • Affinity SPIFF → JPEG

  24. GDFR node • Based on the OCLC IWSA/RFA framework

  25. GDFR node • Java, Apache/Tomcat, Berkeley DB XML • GNU LGPL license • Including pre-existing OCLC technology and technology newly-developed for the project • Release schedule • v0.1 (alpha)March 23, 2007 • v0.1 (beta)June 14, 2007 • v1.0 June 30, 2007 • v1.1 August 12, 2007 • v1.3 September 17, 2007 • v1.3.1 October 26, 2007

  26. GDFR node

  27. GDFR node

  28. GDFR node

  29. GDFR network • Peer-to-peer network of independent, but cooperating registries communicating over a common protocol

  30. GDFR network • Public notification of the availability of new data • RSS feed available at well-known public address to which remote nodes can subscribe • Remote harvesting of local data • OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) • Initially, a single source (root node) for all new data

  31. Project status • Extensive internal testing of GDFR software in a stand-alone mode • Current project activities are focused on • Implementing the distribution and synchronization functions • Building the network • Data acquisition • Succession planning

  32. Initial population • Manual addition is possible, but time consuming • Automated update using Atom • What sources are available for bulk population? • PRONOM registry www.nationalarchives.gov.uk/pronom • Library of Congress Format Description Documents (FDD) www.digitalpreservation.gov/formats/fdd/descriptions.shtml • Unix/Linux magic(4) database

  33. Subsequent population • RFC 2026, Internet Standards Process www.ietf.org/rfc/rfc2026.txt • “Iterations of review by the ... community and revision based upon experience” • Draft distribution and public discussion • Approval by “area” editors • Release to the network for distribution

  34. Sustainability • The technological solution is the (relatively) easy part, but… • The technology is expendable • The important point is for the data to survive, evolve, and expand

  35. Governance and succession • Mellon funding was for technical work only • At the end of the two year project… • Harvard will continue maintenance for up to two years • Library of Congress has agreed to be a care-taker agency until a permanent body is identified

  36. Governance and succession • NARA GDFR governance investigation • Part of the Electronic Records Archives (ERA) initiative • GDFR Governance Workshop, November 2007 • Bibliothèque et Archives, Canada • NARA • Corp. for National Research Initiatives • NASA • Digital Curation Centre, UK • NIST • Digital Library Federation • National Library of Australia • General Services Administration • National Library of New Zealand • Georgia Institute of Technology • San Diego Supercomputer Center • Government Printing Office • Stanford University • Harvard University • Statens Archiv, Sweden • IBM Watson Research Center • Tessalla Support Services • Koninklijke Bibliotheek, Netherlands • University of Pennsylvania • Library of Congress • MIT

  37. Administrative considerations • Policy • Who (and how many) can join the network? • What are the eligibility requirements? • What are the rights and obligations of membership? • Technical • Who will maintain and enhance the data model? • Who will maintain, enhance, distribute, and support the software?

  38. Administrative considerations • Data • Who will contribute data? • Who will vouch for data authenticity? • Who will ensure data integrity? • Financial • What are the real human and system costs associated with GDFR operation? • Who pays, and how?

  39. Summary • The GDFR is an enabling technology that will support digital repository and preservation activities • Supports the strong typing of digital assets at an appropriate level of granularity • Enables the future recovery of the syntax and semantics associated with typed digital assets • A means to pool and redistribute the expertise of the international digital preservation community

  40. For more information… www.formatregistry.org stephen_abrams@harvard.edu andreas_stanescu@oclc.org

More Related