180 likes | 331 Views
Preservation and Archiving Special Interest Group Spring Meeting San Francisco, 27-29 May 2008 Preservation Characterization. Stephen Abrams California Digital Library Stephen.Abrams@ucop.edu. Characterization. /ker-ik-t(ə-)rə-zā ' -shən/ noun 1. The action or result of characterizing.
E N D
Preservation and Archiving Special Interest Group Spring MeetingSan Francisco, 27-29 May 2008Preservation Characterization Stephen Abrams California Digital Library Stephen.Abrams@ucop.edu
Characterization • /ker-ik-t(ə-)rə-zā'-shən/ noun 1. The action or result of characterizing. 2. Description of characteristics or essential features.
Characterization • Knowing what you have, as a stable starting point for iterative preservation analysis, planning, and action Adopted from A. Brown, “Developing Practical Approaches to Active Preservation,” IJDC 1:2 (June 2007).
What? So what? • What do you have? • Identification • Feature extraction • Conformance • What should you do with what you have? • Assessment
Two approaches to characterization • Implicit • Custom grammars defining a single format processed by a generic engine that understands all grammars • Unix file • National Archives (UK) DROID • Open Grid Forum DFDL • Planets XCEL/XCDL • Explicit • Plug-in framework with custom modules that each understand a single format • NLNZ Metadata Extractor • JHOVE
Why choose one over the over? • Implicit • Pro More sustainable in the long term • Con Is the formal notation rich enough to capture all nuances of formats of interest? • Explicit • Pro It’s just programming • Con It’s more programming
JHVE • Extensible framework for format identification, validation, and characterization • Pluggable format-specific modules for: • GIF, JPEG, JPEG 2000, TIFF • AIFF, WAVE • ASCII, HTML, UTF-8, XML • PDF • GUI, command-line, and Java API • Collaborative project of Harvard University and the JSTOR Electronic-Archive Initiative • Funded by Andrew W. Mellon Foundation • GNU LGPL license
JHVE2 • A next generation architecture for format-aware preservation processing • Three-fold goals: • Re-factor the existing architecture to achieve higher performance, simplify system integration, and encourage third-party enhancement • Provide significant new function • (Re-) Implement modules • Collaborative project of CDL, Portico, and Stanford University • Funded by Library of Congress/NDIIPP • Open source BSD license
JHVE2 enhancements • JHOVE assumed 1 object = 1 file = 1 format • But what about… • TIFF with embedded ICC profile and XMP metadata 1 object = 1 file = 3 formats • JPEG 2000 JPX fragmentation 1 object = n files = 1 format • ESRI Shapefile 1 object = 3 files = 3 formats • JHOVE2 will support 1 object = n files = m formats
JHVE2enhancements • Generic plug-in interface • Configurable set of modules iteratively invoked against each object • Inter-module memory structure for stateful processing • Identification de-coupled from conformance • Standardized handling of format profiles and error reporting • Configurable conformance criteria • API level support for limited editing
JHVE2 modules • Identification • Feature extraction and conformance for: • GIF, JPEG, JPEG 2000, TIFF • AIFF, WAVE • ASCII, HTML, SGML, UTF-8, XML • PDF • Shapefile • ICC • Symbolic display of selected binary formats • Assessment based on prior characterization and locally-defined policy rules and heuristics
JHVE2 modules • Identification • Feature extraction and conformance for: • GIF, JPEG, JPEG 2000, TIFF • AIFF, WAVE • ASCII, HTML, SGML, UTF-8, XML • PDF • Shapefile • ICC • Symbolic display of selected binary formats • Assessment based on prior characterization and locally-defined policy rules and heuristics
JHVE2 modules • Identification • Feature extraction and conformance for: • JPEG 2000, TIFF • WAVE • ASCII, SGML, UTF-8, XML • PDF • Shapefile • ICC • Symbolic display of selected binary formats • Assessment based on prior characterization and locally-defined policy rules and heuristics
JHVE2 data abstraction • Determine the “natural” conceptual structures of a format and their component attributes • Each such structure maps to a class with methods for parsing, validating, reporting, and serializing • Each such attribute maps to a field with accessor and mutator methods • UTF-8 Character • TIFF IFH and IFD • JPEG 2000 Box • PDF boolean, number, string, name, array, dictionary, stream, and null
JHVE2 timeline • Months 1-6 Outreach, design, and prototyping • Months 7-9 Core APIs and framework • Months 10-24 Modules
For more information… www.significantproperties.org.uk droid.sourceforge.net forge.gridforum.org/projects/dfdl-wg hki.uni-koeln.de/planets/ meta-extractor.sourceforge.net hul.harvard.edu/jhove www.ucop.edu:8080/display/JHOVE2Info/Home