170 likes | 359 Views
Even v. More Better Metadata. SAA 2014 Panel: Metadata and Digital Preservation: How Much Do We Really Need? Andrea Goethals, Harvard Library. How much metadata do we really need? That depends on the quality of the metadata. Context of my remarks.
E N D
Even v More Better Metadata SAA 2014 Panel: Metadata and Digital Preservation: How Much Do We Really Need? Andrea Goethals, Harvard Library
How much metadata do we really need? That depends on the quality of the metadata...
Context of my remarks • Experience developing for and now managing Harvard Library’s Digital Repository Service (DRS) (In production from 2000 – Present) • ~ 47 million files • Recent multi-year overhaul of repository to the new DRS • Provided chance to analyze metadata & rethink approach
Prior to the new DRS • Most all metadata was user-contributed • Expertise ranged from professional labs to curators, archivists and other staff • Very little validation of user-contributed metadata • Metadata elements had grown organically rather than systematically. For example...
Some elements weren’t specific enough • File format one of: ICC, GIF, JPEG, TIFF, TDF, TEXT, PCD, AIFF, RealAudio, APP, WAV, WFR, JP2, JPF, ZIP, GZIP, PDF • Format variations and versions not recorded
Some elements were too specific • Text abstract character repertoire one of: ‘US-ASCII’, ‘Unicode’ • Text character map one of: ‘ISO_646.irv:1983’, ‘UTF-8’ • These weren’t validated so in reality the text could be in any character set but would be recorded as one of these regardless
Some generic elements only tracked for certain formats • For images only: • enhancements • history • methodology • producer • production software • system • And the above elements allowed free-text, leading to a variety of interpretation over time
Errors in relationship metadata • Missing relationships (e.g. referenced in the METS descriptor file but lacking explicit relationships) • Redundant relationships (files related more than once to the same files) • Illogical relationships (only discoverable because of redundant metadata) – Examples: • Target images related to other target images • Non-target images described as target images • A METS descriptor file described as a scanned image • Objects merged into themselves
Strategies in the new DRS for improving metadata Pull descriptive metadata from catalogs at ingest or on request Automated format ingest, validation & metadata extraction at ingest Validation when files or ingested, added or removed or relationship metadata is changed FITS Sync with catalogs, check and improve metadata on migration
File Information Tool Set (FITS) • Identifies many file formats • Validates a few file formats • Extracts metadata from files • Aggregates metadata from many tools • Calculates basic file info (file size, MD5, etc.) • Outputs technical metadata • Community-standard metadata schemas • Identifies problem files • Conflicting tool opinions on format, metadata values • Unidentifiable file formats • Encrypted, rights metadata embedded in files
File Information Tool Set (FITS) consolidator exporter FITS wrapper + XSL FITS wrapper + XSL FITS wrapper + XSL FITS wrapper + XSL FITS wrapper + XSL FITS wrapper + XSL FITS XML FITS XML FITS XML FITS XML FITS XML FITS XML JHOVE DROID Any file NLNZ ME FITS XML StandardXML ExifTool File utility FFIdent +Tika, OIS Audio Information, ADL Tool, OIS File Information, OIS XML Metadata
FITS configured to get high quality metadata • Metadata normalization • ‘JPEG2000’ = ‘JPEG 2000’ = ‘JPEG 2000 image’ • ‘inches’ = ‘2’ = ‘in.’ • Plays to strengths of tools and downplays their weaknesses • Overall trust tool x over tool y • Don’t run tool x for format z • Format tree (hierarchy of related formats) • ‘OpenDocument’ is more specific than ‘Zip’
Example of what we know about a file pre- and post-FITS adoption at ingest
Additional strategies in the new DRS • Move away from overly restrictive metadata elements where needed – Examples: • Allow free text for format names • Any text character set • Add elements at the format-agnostic file level when they can apply to files in any format, e.g. producer or methodology • Flag suspicious metadata (and content) for later analysis
Administrative flags • Help pinpoint incorrect metadata, problem content or where metadata tools need improvement • Some examples: • FAILED_METADATA_EXTRACTION • FORMAT_ID_CONFLICT • INCORRECT_METADATA • INHIBITOR • RIGHTS_METADATA
They said it better “It is quality rather than quantity that matters.” – LuciousAnnaeous Senegal “Quality is not an act, it is a habit.” – Aristotle “Quality is never an accident. It is always the result of intelligent effort.” – John Ruskin