380 likes | 667 Views
What will you need to know?. The role of metadata in keeping digital content alive. Robin Wendler, Harvard University Library November 2, 2005 r_wendler@harvard.edu. The Crystal Ball. John William Waterhouse. Private collection. Let me count the ways digital stuff goes bad….
E N D
What will you need to know? The role of metadata in keeping digital content alive Robin Wendler, Harvard University Library November 2, 2005 r_wendler@harvard.edu The Crystal Ball. John William Waterhouse. Private collection
Let me count the ways digital stuff goes bad… • Media become obsolete • Media decay • Formats are superseded • Proprietary formats may be orphaned • Hardware breaks • Software is orphaned • Encryption may hinder preservation • User requirements change
How will you know? • Preservation Planning • Monitor your data through metadata for • Integrity • Renderability • Understandability • Authenticity • Identity • Responsibility • Monitor the community • Format support • Requirements
What will you do? • Identify materials at risk • Analyze options • Categorize objects • Formal characteristics • Purpose • Antecedents • Communicate with owners • Perform preservation actions • Create audit trail All utilize and/or generate metadata
Gradual understanding • OAIS • (1st workshop 1995; Blue Book 2002) • http://www.ccsds.org/docu/dscgi/ds.py/Get/File-143/650x0b1.pdf • NLA PANDORA (1996-) • http://pandora.nla.gov.au/index.html • CEDARS (1998-2002) • http://www.leeds.ac.uk/cedars/index.htm • NEDLIB (1998-2000) • http://www.kb.nl/coop/nedlib/ • OCLC/RLG Preservation Framework Working Group (2000-2001) • http://www.oclc.org/research/projects/pmwg/wg1.htm • PREMIS (2003-2005) • http://www.oclc.org/research/projects/pmwg/
Preservation Metadata “…the information necessary to carry out, document and evaluate the processes that support the long-term retention and accessibility of digital content.” “Moving digital objects and their metadata across space and time requires standard mechanisms for encoding and exchange” – Brian Lavoie • Viewed from a preservation lens, all metadata is preservation metadata • Categories of metadata overlap; a single piece of metadata can serve many purposes
OAIS Functional Model Archival Information Systems are permeated by metadata. Metadata is the difference between a repository and just files on a disk.
OAIS Content Information Framework, Expanded by OCLC/RLG WG OAIS Model OCLC/RLG Extensions Still a framework, not usable, defined elements
OAIS Preservation Description Information Framework Reference: provides identifiers and describes mechanisms by which id’s are assigned Context: documents relationships of content to its environment (why created, other formats, editions) Provenance: documents the history, changes, custody of content Fixity: documents data integrity checks or validation and verification keys to ensure no unauthorized changes
Metadata relevant to preservation • Storage management and fixity • Technical characteristics • Structure • Provenance • Rights • Digital signature trail, where applicable • Intellectual access / description
PREMIS:Preservation Metadata Implementation Strategies • Surveyed implementation of digital repositories, assessed adoption of metadata standards (2003/2004) • Defined a core set of implementable preservation metadata elements (2005) • Implementation-independent • Explicit or implicit • Not reinventing the wheel • Descriptive, rights, agents • Privilege automatically-suppliable values • Defined associated XML schemas • Set up ongoing maintenance activity • http://www.loc.gov/standards/premis/
PREMIS Data Model Intellectual Entities Rights Objects Agents Events
Metadata must adhere to the “right thing”: Representation: The set of files, including structural metadata, needed for a complete and reasonable rendition of an Intellectual Entity. File: A named and orderedsequence of bytesthat isknown by an operating system. Bitstream: A contiguous or non-contiguousdata within a filethat has meaningful common properties for preservation purposes. Importance of object modeling Any can express an Intellectual Entity All are kinds of Objects in PREMIS All can be affected by Events Rights adhere to all
objectIdentifier preservationLevel objectCategory objectCharacteristics compositionLevel fixity messageDigestAlgorithm messageDigest messageDigestOriginator size format formatDesignation formatName formatVersion formatRegistry formatRegistryName formatRegistryKey formatRegistryRole significantProperties inhibitors inhibitorType inhibitorTarget inhibitorKey creatingApplication creatingApplicationName creatingApplicationVersion dateCreatedByApplication originalName storage contentLocation storageMedium environment environmentCharacteristic environmentPurpose environmentNote dependency dependencyName dependencyIdentifier software swName swVersion swType swOtherInformation swDependency hardware hwName hwType hwOtherInformation signatureInformation signatureInformationEncoding signer signatureMethod signatureValue signatureValidationRules signatureProperties keyInformation Core Object Metadata(Yes, this is to make you sweat)
Significant Properties “…objective technical characteristics subjectively considered important, or subjectively determined characteristics.” • Requires identification in advance of what’s crucial, what might be at risk, and how to codify it. Mondrian. Composition with large red plane, yellow, black, gray and blue. 1921. Haags Gemeentemuseum, Hague Monet.Waterloo Bridge, London, at Sunset, 1904Collection of Mr. and Mrs. Paul Mellon. National Gallery of Art.
Rights • Different flavors • Rights • Permissions • Licenses • Submission Agreements • Multiple rights languages • XrML (eXtensible rights Markup Language) • http://www.xrml.org/ • ODRL (Open Digital Rights Language) • http://odrl.net/ • Designed to support DRM • Complex • Patent/licensing issues • PREMIS Rights • Lightweight • Focused on right to preserve • Statements, rather than DRM
PREMIS Permission Statement • permissionStatementIdentifier • linkingObject • grantingAgent • grantingAgreement • permissionGranted • act • restriction • termOfGrant • startDate • endDate • permissionNote
Event Metadata • Events in the life of a digital object • What was done • Who did it • When • Who authorized it • What was the outcome • General • PREMIS Events • Specific, e.g. • AES Process History
PREMIS Events • Must be related to one or more objects • Can be related to one or more agents • Consist of • eventIdentifier • eventType • eventDateTime • eventDetail • eventOutcomeInformation • linkingAgentIdentifier • linkingObjectIdentifier
Beyond PREMIS • Format-specific technical metadata • Detailed event metadata • Structural metadata / content packaging • Specific descriptive metadata
Technical Metadata • Formally characterizes • a class of objects • an individual object • Some technical metadata applies to all formats, most is specific to a category of formats, e.g. • NISO Z39.87: Technical Metadata for Still Images • http://www.niso.org/standards/resources/Z39_87_trial_use.pdf • MIX (XML schema for Z39.87):http://www.loc.gov/standards/mix// • Audio Engineering Society Core Technical Metadata for Audio – in draft • TextMD • http://dlib.nyu.edu/METS/textmd.xsd
Structural Metadata • Not only content, but also metadata and ‘binding’ must be preserved • Enables a complex object to be assembled from its constituent parts • Content, Metadata, Relationships, Behaviors
Structural and Packaging Metadata • Many formats developed in different communities, e.g., • Digital library = METS • http://www.loc.gov/standards/mets/ • Commercial media = MPEG 21 DIDL • Available from ISO www.iso.org • Learning objects = IMS Content Packaging • http://www.imsglobal.org/content/packaging/ • Space data = XFDU – still in draft • http://www.ccsds.org/docu/dscgi/ds.py/GetRepr/File-1912/html • Audio-visual = Advanced Authoring Format (AAF) • http://www.aafassociation.org/html/techinfo/index.html • Television = Television Material Exchange Format (MXF) • Available from SMPTE www.smpte.org • No consolidation of formats, but dialog and mapping
METS Basics • METS provides a framework for • Content files • Metadata • Descriptive • Structural • Technical • Provenance • Source • Relationships • Behaviors • Suitable for • Open Archival Information Systems • Archival information package (AIP) • Submission information package (SIP) • Dissemination information package (DIP) • Display and navigation of digital objects • Sharing of digital objects among libraries and archives
RLG’s METS Viewer Structural Metadata Descriptive Metadata Behaviors Content
Structure of a METS File METS metsHdr Header describing METS file itself fileSec Inventory or manifest of component files dmdSec Descriptive metadata Administrative metadata: -- technical, source, rights, provenance admSec structMap Structure map: the heart of METS structLink Structural map linking, i.e., hyperlinks behaviorSec* Executable behaviors * Less commonly used
Structure Map <div LABEL=“Title page”> <div LABEL=“title page” ORDER=“1” TYPE= <fptr FILEID=“A”> </div> <div LABEL=“Preface”> <div LABEL= “page i” ORDER=“2” ORDERLABEL=“i”> <fptr FILEID=“B”> </div> <div LABEL= “page ii” ORDER=“3”> <fptr> FILEID=“C”> </div> <div LABEL=“Chapter 1”> <div LABEL=“page 1” ORDER=“4”> <fptr FILEID=“D”> </div> <div LABEL=“page 2” ORDER=“5”> <fptr FILEID=“E”>… Title page Preface page i page ii Chapter 1 page 1 page 2…
Referring to Metadata METS METS does not define descriptive or administrative metadata elements. dmdSec and admSec are buckets or sockets where externally-defined metadata can be supplied or referenced metsHdr fileSec dmdSec • Three techniques: • In-line XML • Wrapped base-64 encoded data • Pointers to external information • (e.g., URNs, handles) admSec structMap structLink METS Board endorses range of recommended “extension schemas” behaviorSec
Use of MODS Extension Schema for Descriptive Metadata <div LABEL=“Reports of the president and treasurer” DMDID=“D1”> <div LABEL=“Chapter 1” DMDID=“CH1”> <div LABEL=“page 1” ORDER=“3”> <fptr FILEID=“D”> <div LABEL=“page 2” ORDER=“4”> <fptr FILEID=“E”>… Book Chapter 1 page 1 page 2… <dmdSec ID=“D1” > <mdWrap MDTYPE="MODS"> <xmlData> <mods:mods xmlns:mods="http://www.loc.gov/mods/v3" xsi:schemaLocation=http://www.loc.gov/mods/v3 …> <mods:name> <mods:displayForm> Radcliffe College</mods:displayForm> </mods:name> <mods:titleInfo> <mods:title> Reports of the president and treasurer for...</mods:title> </mods:titleInfo> </mods:mods> </xmlData> </mdWrap> <mdRef LOCTYPE=“URL” MDTYPE=“MARC” xlink:href=http://... BNI3165”/> Catalog record
Where does all this metadata come from? • Look, Ma, no hands! (as much as possible, that is…) • Don’t make people create it • Machines are faster, cheaper, more accurate • Don’t make people read it • Use controlled values • Expect bulk preservation of like objects • Artisanal preservation is not affordable • Develop and share tools to automate creation, ingest, extraction, exchange
Format Identification Format Validation Well-formedness (Syntactical) Validity (Semantic) Format Characterization http://hul.harvard.edu/jhove/ Modules for AIFF ASCII BYTESTREAM GIF HTML JPEG JPEG2000 PDF TIFF UTF8 WAVE XML JHOVEJSTOR/Harvard Object Validation Environment
Automatic Exposure • RLG initiative advocates for capturing standard technical metadata about digital images automatically as part of image creation: • engage manufacturers in dialog about what technical metadata their products currently capture vs what is required for digital archiving • leverage existing industry efforts • identify and evaluate tools for harvesting technical metadata and explore how those tools can scale to serve the entire community.
Format Registries • Detailed documentation of how typed content is represented • Persistent, unambiguous association between public identifiers for digital formats and their documentation • Lists of systems and services which use or produce the format • Must be inclusive, detailed, rigorous, public, and sustainable • Format Registry projects: • PRONOM • http://www.nationalarchives.gov.uk/pronom/ • Global Digital Format Registry • http://hul.harvard.edu/gdfr/ • TOM • http://tom.library.upenn.edu/ • FRED – demonstration system • http://tom.library.upenn.edu/fred/
Other Registries(Extant and Posited) • Registry of Digital Masters • “I will preserve this digital thing” • http://www.oclc.org/digitalpreservation/why/digitalregistry/default.htm • Profile registries • “I restrict this broader standard in the following ways” • Metadata Element/Schema registries • “I use these elements to mean these things” • http://www.xml.org/xml/registry.jsp • http://www.ukoln.ac.uk/projects/iemsr/ • Etc. • Environment registries • Hardware/software configurations in which given software is known to work
Digital Information Community benefits from metadata cooperation… • Develop common understanding • Crucial metadata • Standards! • Trusted repository certification • Acceptable preservation strategies • Needs and costs • Automate capture/creation of metadata • Work with equipment manufacturers • Develop open source tools • Share burden • Monitor/document digital formats • Avoid duplicate digitization