380 likes | 536 Views
DLF Spring Forum New York, May 14-16, 2003. Global Digital Format Registry. Stephen L. Abrams Harvard University Library MacKenzie Smith Massachusetts Institute of Technology. Why Do We Need a Registry?. Repository functions are performed on a format-specific basis
E N D
DLF Spring Forum New York, May 14-16, 2003 Global Digital Format Registry Stephen L. Abrams Harvard University Library MacKenzie Smith Massachusetts Institute of Technology
Why Do We Need a Registry? • Repository functions are performed on a format-specific basis • Interpretation of otherwise opaque content streams is dependent upon knowledge of how typed content is represented • Interchange requires mutual agreement of format syntax and semantics New York, May 14-16, 2003
Potential Use Cases • Identification • “I have a digital object; what format is it?” • Validation • “I have an object purportedly of format F; is it?” • Transformation • “I have an object of format F, but need G; how can I produce it?” • Characterization • “I have an object of format F; what are its significant properties?” • Risk assessment • “I have an object of format F; is at risk of obsolescence?” • Delivery • “I have an object of format F; how can I render it?” New York, May 14-16, 2003
Repository Format Dependencies • Ingest • Validation • SIP-to-AIP • Access • AIP-to-DIP • Rendering • Preservation planning • Migration • Emulation • UVC New York, May 14-16, 2003
Repository Format Dependencies New York, May 14-16, 2003
Repository Format Dependencies New York, May 14-16, 2003
What’s Wrong with MIME Types? • Insufficient depth of detail • Syntax and semantics • Public and proprietary • Insufficient granularity • Both tiled RGB TIFF with LZW and striped bi-tonal TIFF with Group 4 → image/tiff • All of PDF 1.0 – 1.4, PDF/X-1 – 3, and PDF/A → application/pdf New York, May 14-16, 2003
A Bit of History • DLF-sponsored invitational meetings • Ad-hoc committee • Collected use cases • Working groups on data and governance models During summer 2002 the Harvard LDI and MIT DSpace teams met to discuss shared concerns. New York, May 14-16, 2003
Bibliothèque nationale de France California Digital Library Digital Library Federation Harvard University IETF JISC JSTOR Library of Congress MIT NARA National Archives of Canada New York University NIST OCLC Public Records Office, UK RLG Stanford University University of Pennsylvania Ad-Hoc Committee New York, May 14-16, 2003
Global Digital Format Registry The registry will maintain persistent, unambiguous bindings between public identifiers for digital formats and representation information for those formats. New York, May 14-16, 2003
What is a Format? • No assumption regarding byte size • An information model is a formal expression of exchangeable knowledge A format is a fixed, byte-serialized encoding of an information model. New York, May 14-16, 2003
What is Representation Information? • Significant properties are those aspects of a format that are the primary carriers of the format’s intellectual value Representation information maps typed formats into more meaningful concepts by capturing the significant syntactic and semantic properties of those formats. New York, May 14-16, 2003
Data Model • Registry • Format • Descriptive • General descriptive properties • Characterization • Technical syntactic/semantic properties • Processing • Services and systems using format as input or output • Administrative • Provenance New York, May 14-16, 2003
Informative, not Evaluative • Legal liability • May discourage deposit of proprietary information • Investigate ways to include (by reference?) third party evaluations/recommendations • Insofar as this doesn’t hamper primary goal The format properties stored in the registry should be factual, not judgmental. New York, May 14-16, 2003
Data Model Sources • ISO 14721, Open archival information system -- Reference model • CCSDS OAIS reference model • Representation information • Interpret, or provide “additional meaning” to Data Object • Structure and semantic information • PRONOM • Public Records Office, UK • “information about file formats and the application software needed to open them” • Format, vendor, product New York, May 14-16, 2003
Data Model Sources • Diffuse • EC’s Information Society Technologies programme • “reference and guidance information on available and emerging standards and specifications” • Business Guides • “application of standards and specifications in specific areas” • OCLC/RLG Preservation Metadata Framework • “information necessary to render/display, understand, and interpret the Content Data Object” • Based on CEDARS, NEDLIB NLA, OAIS, and OCLC metadata New York, May 14-16, 2003
Data Model Sources • NIST National Software Reference Library • File profiles for the NSRL Reference Data Set • Vendor, product, operating system • Used for forensic identification • Media features • Protocol-independent content negotiation • Selection of an “appropriate representation” of a resource • RFCs 2506, 2533, 2534 New York, May 14-16, 2003
Data Model Sources • Typed Object Model (TOM) • “model for identifying and describing data formats … distributed system of ‘type brokers’ that maintain and interpret these descriptions” • Format is aggregate of type (attributes, operations, semantics) and encoding • JISC File Format Representation and Rendering Project • Assessment of formats and rendering software • Representation system to track formats and their rendering software New York, May 14-16, 2003
Data Model Sources • Bitstream Syntax Description Language • MPEG-21content adaptation • XML schema to model multimedia bitstreams • Useful for administrative properties and data types: • ISO/IEC 11179, Specification and • standardization of data elements • OASIS/ebXML Registry Information Model New York, May 14-16, 2003
Data Model New York, May 14-16, 2003
High-Level Format Properties New York, May 14-16, 2003
Descriptive Properties • Identifiers • Canonical and alias • Arbitrary relationships • Equivalence • Encapsulation • Sub-typing, with strict substitutability • PDF 1.0 ← … ← PDF 1.4 ← PDF/A • XML ← SVG • Versioning • Ontological classification New York, May 14-16, 2003
Content stream Logical Numeric Scalar Integer Unsigned Real Floating point Complex Text Structured text Mark-up language Programming language Message Mail News Image Still Font Outline Raster Graphic Vector Raster Page description Motion Audio Music Application CAD Communication Database Executable GIS Presentation Spreadsheet Word processing Transformation Compression Lossless Lossy Container File system Transfer 7-bit safe Physical media Magnetic Disk Tape Reel Cartridge Optical Disk CD-ROM DVD Film Paper Card Tape Format Ontology New York, May 14-16, 2003
Characterization Properties • Specification documents • Actionable links • Public identifiers • Hard copy • Public, on-site, license, and escrow access • Signatures • External • File extension, Mac OS data fork type • Internal • Magic number New York, May 14-16, 2003
Centralized vs. Distributed • Allowing arbitrary granularity may lead to an explosion of registered formats • Versions • Local profiles • Typed relationships support internal and external references • Enable distributed architecture without mandating it New York, May 14-16, 2003
Core Registry Services • Management Services • Approval • Level of review, level of public disclosure • Maintenance • Add, update, delete format entries • Notification • Notify registry clients of new/updated format or trigger events (e.g. obsolescence, new transformation service, etc.) • Introspection • Determine local policies (scope, coverage, implemented services, etc.) of a given registry to identify appropriate registry to use New York, May 14-16, 2003
Core Registry Services • Access Services • Description • Representation information returned on request for single format • Export • Entire registry or selected subset sent to external repository New York, May 14-16, 2003
Supported Services • Representation Services • Identification services • Determine format of a specific digital object by comparing its attributes to the attribute profiles retrieved from the registry • Validation services • Verify format of a specific DO by comparing its attributes to the attribute profile retrieved from the registry for that format. New York, May 14-16, 2003
Supported Services • Brokerage Services • Rendering service • Identify current rendering conditions for supplied DO • Transformation service • Convert DO from current (source) format to target format • Metadata Extraction services • Registry returns information supporting automated extraction of attribute metadata from a DO of a specific format New York, May 14-16, 2003
Service Model Sources • ANSI X3.285, Metamodel for Management of Shareable Data • Service model for ISO/IEC 11179 • IANA MIME media type registry • OASIS/ebXML Registry Services Specification New York, May 14-16, 2003
Registry Operation • Trust is necessary to encourage deposit of proprietary information • Sustainability is necessary to justify expense • As for all preservation activities, how do we generate income today, for services not needed until tomorrow? The registry is valuable insofar as it is trustworthy and sustainable. New York, May 14-16, 2003
Registry Operation • Will registry staff collect and manage representation information, or • Will knowledgeable community members submit information? • What is the level of technical review, and by whom? • IETF model Is the registry self-populating, or a public bulletin board? New York, May 14-16, 2003
Governance Model • Can this initiative reasonably be placed under the umbrella of an existing organization? • Is global scope in conflict with national prerogatives? • How to build sufficient trust models • Governance model becomes more important as the operational model becomes more pro-active (distributed and contributory) New York, May 14-16, 2003
Business Model • Costs depend on level of quality and authority required (e.g. wiki vs oclc) • Assuming the registry needs to be cost-recovered, options for supporting “common good” services include: • Subsidy • Subscription • Pay to submit • Format registration accompanied by fee • Pay to view • Queries on a for-fee basis • Added-value services New York, May 14-16, 2003
Next Steps • Tell people what we’re doing • National, academic, private libraries/archives • Standards bodies • Commercial • Regulated industries • Software vendors (developers and consumers of formats) • Publishers • Anyone with long-term digital preservation needs • Refine documentation for a general audience • Vision statement and high-level project plan New York, May 14-16, 2003
Next Steps • Look for project funding • Potentially two phases: • Design and implementation • Can be funded through grants, in-kind participation • Operational • Need reliable, sustainable income stream • Planning grant to sustain initial activity • Data and service models • Governance and business model • Development and operations plan • Library of Congress NDIIPP and/or JISC (UK) Digital Curation Centre New York, May 14-16, 2003
Why Is This Important to You? • If you care about the long-term usability of your digital assets: • The registry will allow typing of digital objects at an appropriate level of granularity • The registry will allow the recovery in the future of the syntax and semantics associated with typed digital objects • The registry is an enabling technology underlying digital repository operations and preservation activities New York, May 14-16, 2003
… thanks! hul.harvard.edu/formatregistry stephen_abrams@harvard.edu kenzie@mit.edu New York, May 14-16, 2003