1 / 38

Global Digital Format Registry

DLF Spring Forum New York, May 14-16, 2003. Global Digital Format Registry. Stephen L. Abrams Harvard University Library MacKenzie Smith Massachusetts Institute of Technology. Why Do We Need a Registry?. Repository functions are performed on a format-specific basis

harvey
Download Presentation

Global Digital Format Registry

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DLF Spring Forum New York, May 14-16, 2003 Global Digital Format Registry Stephen L. Abrams Harvard University Library MacKenzie Smith Massachusetts Institute of Technology

  2. Why Do We Need a Registry? • Repository functions are performed on a format-specific basis • Interpretation of otherwise opaque content streams is dependent upon knowledge of how typed content is represented • Interchange requires mutual agreement of format syntax and semantics New York, May 14-16, 2003

  3. Potential Use Cases • Identification • “I have a digital object; what format is it?” • Validation • “I have an object purportedly of format F; is it?” • Transformation • “I have an object of format F, but need G; how can I produce it?” • Characterization • “I have an object of format F; what are its significant properties?” • Risk assessment • “I have an object of format F; is at risk of obsolescence?” • Delivery • “I have an object of format F; how can I render it?” New York, May 14-16, 2003

  4. Repository Format Dependencies • Ingest • Validation • SIP-to-AIP • Access • AIP-to-DIP • Rendering • Preservation planning • Migration • Emulation • UVC New York, May 14-16, 2003

  5. Repository Format Dependencies New York, May 14-16, 2003

  6. Repository Format Dependencies New York, May 14-16, 2003

  7. What’s Wrong with MIME Types? • Insufficient depth of detail • Syntax and semantics • Public and proprietary • Insufficient granularity • Both tiled RGB TIFF with LZW and striped bi-tonal TIFF with Group 4 → image/tiff • All of PDF 1.0 – 1.4, PDF/X-1 – 3, and PDF/A → application/pdf New York, May 14-16, 2003

  8. A Bit of History • DLF-sponsored invitational meetings • Ad-hoc committee • Collected use cases • Working groups on data and governance models During summer 2002 the Harvard LDI and MIT DSpace teams met to discuss shared concerns. New York, May 14-16, 2003

  9. Bibliothèque nationale de France California Digital Library Digital Library Federation Harvard University IETF JISC JSTOR Library of Congress MIT NARA National Archives of Canada New York University NIST OCLC Public Records Office, UK RLG Stanford University University of Pennsylvania Ad-Hoc Committee New York, May 14-16, 2003

  10. Global Digital Format Registry The registry will maintain persistent, unambiguous bindings between public identifiers for digital formats and representation information for those formats. New York, May 14-16, 2003

  11. What is a Format? • No assumption regarding byte size • An information model is a formal expression of exchangeable knowledge A format is a fixed, byte-serialized encoding of an information model. New York, May 14-16, 2003

  12. What is Representation Information? • Significant properties are those aspects of a format that are the primary carriers of the format’s intellectual value Representation information maps typed formats into more meaningful concepts by capturing the significant syntactic and semantic properties of those formats. New York, May 14-16, 2003

  13. Data Model • Registry • Format • Descriptive • General descriptive properties • Characterization • Technical syntactic/semantic properties • Processing • Services and systems using format as input or output • Administrative • Provenance New York, May 14-16, 2003

  14. Informative, not Evaluative • Legal liability • May discourage deposit of proprietary information • Investigate ways to include (by reference?) third party evaluations/recommendations • Insofar as this doesn’t hamper primary goal The format properties stored in the registry should be factual, not judgmental. New York, May 14-16, 2003

  15. Data Model Sources • ISO 14721, Open archival information system -- Reference model • CCSDS OAIS reference model • Representation information • Interpret, or provide “additional meaning” to Data Object • Structure and semantic information • PRONOM • Public Records Office, UK • “information about file formats and the application software needed to open them” • Format, vendor, product New York, May 14-16, 2003

  16. Data Model Sources • Diffuse • EC’s Information Society Technologies programme • “reference and guidance information on available and emerging standards and specifications” • Business Guides • “application of standards and specifications in specific areas” • OCLC/RLG Preservation Metadata Framework • “information necessary to render/display, understand, and interpret the Content Data Object” • Based on CEDARS, NEDLIB NLA, OAIS, and OCLC metadata New York, May 14-16, 2003

  17. Data Model Sources • NIST National Software Reference Library • File profiles for the NSRL Reference Data Set • Vendor, product, operating system • Used for forensic identification • Media features • Protocol-independent content negotiation • Selection of an “appropriate representation” of a resource • RFCs 2506, 2533, 2534 New York, May 14-16, 2003

  18. Data Model Sources • Typed Object Model (TOM) • “model for identifying and describing data formats … distributed system of ‘type brokers’ that maintain and interpret these descriptions” • Format is aggregate of type (attributes, operations, semantics) and encoding • JISC File Format Representation and Rendering Project • Assessment of formats and rendering software • Representation system to track formats and their rendering software New York, May 14-16, 2003

  19. Data Model Sources • Bitstream Syntax Description Language • MPEG-21content adaptation • XML schema to model multimedia bitstreams • Useful for administrative properties and data types: • ISO/IEC 11179, Specification and • standardization of data elements • OASIS/ebXML Registry Information Model New York, May 14-16, 2003

  20. Data Model New York, May 14-16, 2003

  21. High-Level Format Properties New York, May 14-16, 2003

  22. Descriptive Properties • Identifiers • Canonical and alias • Arbitrary relationships • Equivalence • Encapsulation • Sub-typing, with strict substitutability • PDF 1.0 ← … ← PDF 1.4 ← PDF/A • XML ← SVG • Versioning • Ontological classification New York, May 14-16, 2003

  23. Content stream Logical Numeric Scalar Integer Unsigned Real Floating point Complex Text Structured text Mark-up language Programming language Message Mail News Image Still Font Outline Raster Graphic Vector Raster Page description Motion Audio Music Application CAD Communication Database Executable GIS Presentation Spreadsheet Word processing Transformation Compression Lossless Lossy Container File system Transfer 7-bit safe Physical media Magnetic Disk Tape Reel Cartridge Optical Disk CD-ROM DVD Film Paper Card Tape Format Ontology New York, May 14-16, 2003

  24. Characterization Properties • Specification documents • Actionable links • Public identifiers • Hard copy • Public, on-site, license, and escrow access • Signatures • External • File extension, Mac OS data fork type • Internal • Magic number New York, May 14-16, 2003

  25. Centralized vs. Distributed • Allowing arbitrary granularity may lead to an explosion of registered formats • Versions • Local profiles • Typed relationships support internal and external references • Enable distributed architecture without mandating it New York, May 14-16, 2003

  26. Core Registry Services • Management Services • Approval • Level of review, level of public disclosure • Maintenance • Add, update, delete format entries • Notification • Notify registry clients of new/updated format or trigger events (e.g. obsolescence, new transformation service, etc.) • Introspection • Determine local policies (scope, coverage, implemented services, etc.) of a given registry to identify appropriate registry to use New York, May 14-16, 2003

  27. Core Registry Services • Access Services • Description • Representation information returned on request for single format • Export • Entire registry or selected subset sent to external repository New York, May 14-16, 2003

  28. Supported Services • Representation Services • Identification services • Determine format of a specific digital object by comparing its attributes to the attribute profiles retrieved from the registry • Validation services • Verify format of a specific DO by comparing its attributes to the attribute profile retrieved from the registry for that format. New York, May 14-16, 2003

  29. Supported Services • Brokerage Services • Rendering service • Identify current rendering conditions for supplied DO • Transformation service • Convert DO from current (source) format to target format • Metadata Extraction services • Registry returns information supporting automated extraction of attribute metadata from a DO of a specific format New York, May 14-16, 2003

  30. Service Model Sources • ANSI X3.285, Metamodel for Management of Shareable Data • Service model for ISO/IEC 11179 • IANA MIME media type registry • OASIS/ebXML Registry Services Specification New York, May 14-16, 2003

  31. Registry Operation • Trust is necessary to encourage deposit of proprietary information • Sustainability is necessary to justify expense • As for all preservation activities, how do we generate income today, for services not needed until tomorrow? The registry is valuable insofar as it is trustworthy and sustainable. New York, May 14-16, 2003

  32. Registry Operation • Will registry staff collect and manage representation information, or • Will knowledgeable community members submit information? • What is the level of technical review, and by whom? • IETF model Is the registry self-populating, or a public bulletin board? New York, May 14-16, 2003

  33. Governance Model • Can this initiative reasonably be placed under the umbrella of an existing organization? • Is global scope in conflict with national prerogatives? • How to build sufficient trust models • Governance model becomes more important as the operational model becomes more pro-active (distributed and contributory) New York, May 14-16, 2003

  34. Business Model • Costs depend on level of quality and authority required (e.g. wiki vs oclc) • Assuming the registry needs to be cost-recovered, options for supporting “common good” services include: • Subsidy • Subscription • Pay to submit • Format registration accompanied by fee • Pay to view • Queries on a for-fee basis • Added-value services New York, May 14-16, 2003

  35. Next Steps • Tell people what we’re doing • National, academic, private libraries/archives • Standards bodies • Commercial • Regulated industries • Software vendors (developers and consumers of formats) • Publishers • Anyone with long-term digital preservation needs • Refine documentation for a general audience • Vision statement and high-level project plan New York, May 14-16, 2003

  36. Next Steps • Look for project funding • Potentially two phases: • Design and implementation • Can be funded through grants, in-kind participation • Operational • Need reliable, sustainable income stream • Planning grant to sustain initial activity • Data and service models • Governance and business model • Development and operations plan • Library of Congress NDIIPP and/or JISC (UK) Digital Curation Centre New York, May 14-16, 2003

  37. Why Is This Important to You? • If you care about the long-term usability of your digital assets: • The registry will allow typing of digital objects at an appropriate level of granularity • The registry will allow the recovery in the future of the syntax and semantics associated with typed digital objects • The registry is an enabling technology underlying digital repository operations and preservation activities New York, May 14-16, 2003

  38. … thanks! hul.harvard.edu/formatregistry stephen_abrams@harvard.edu kenzie@mit.edu New York, May 14-16, 2003

More Related