830 likes | 931 Views
Using JHOVE2 for Policy Assessment of Files. Richard Anderson Code4LibCon Preconference 2/7/2011 http://code4lib.org/conference/2011/schedule#preconf 13:30-16:30 : Persimmon Room. Agenda 13:30-16:30. What is JHOVE2 ? Characterization of digital objects Validation vs Assessment
E N D
Using JHOVE2 for Policy Assessment of Files Richard Anderson Code4LibCon Preconference 2/7/2011 http://code4lib.org/conference/2011/schedule#preconf 13:30-16:30 : Persimmon Room
Agenda 13:30-16:30 What is JHOVE2 ? Characterization of digital objects Validation vs Assessment Examples of JHOVE2 output Source Units, Modules, Reportable Properties Implementation of Assessment Configuration of Assessment Rules
JHOVE2 is … … a project to develop a next-generation open source framework and application for format-aware characterization … a collaborative undertaking of the California Digital Library (CDL), Portico, and Stanford University … a two year grant from the Library of Congress as part of its National Digital Information Infrastructure Preservation Program (NDIIPP)
“What? So what?” Determining the presumptive format of a digital object based on suggestive extrinsic hints and intrinsic signatures Reporting the intrinsic properties of an object significant for classification, analysis, and planning Characterization is the automated determination of the intrinsic and extrinsic properties of a formatted object • Identification • Feature extraction • Validation • Assessment
What's new in JHOVE2? Je ne sais quoi ! Processing of multi-file objects as well as embedded objects inside files Recursive processing of containers objects Plug-in Format Modules Buffered I/O Internationalized output Clean APIs and modern design patterns
API design idioms Separation of concerns Annotation and Reflection confluence.ucop.edu/display/JHOVE2Info/Background+Papers Inversion of Control (IOC) / Dependency Injection Martin Fowler martinfowler.com/articles/injection.html Spring Framework www.springsource.org/
Project Home Domain name • http://jhove2.org/ Code Repository • https://bitbucket.org/jhove2/main/wiki/Home • Public Wiki/Documentation • Browse/Clone Source Code • Download Release Packages • Changeset History • Issue Tracking Mailing lists • JHOVE2-Announce-L@listserv.ucop.edu • JHOVE2-Techtalk-L@listserve.ucop.edu
JHOVE2 Documentation Complete documentation • User’s guide • Architectural overview • Module specifications • Programmer’s guide
Agenda 13:30-16:30 What is JHOVE2 ? Characterization of digital objects Validation vs Assessment Examples of JHOVE2 output Source Units, Modules, Reportable Properties Implementation of Assessment Configuration of Assessment Rules
Validation vs. Assessment Validation is the determination of the level of conformance to the normative requirements of a format’s authoritative specification • To the extent that there is community consensus on these requirements, validation is an objective determination – Hard coded in JHOVE2 Modules Assessment is the determination of the level of acceptability for a specific purpose on the basis of locally-defined policy rules • Since these rules are locally configurable, assessment is a subjective determination – Scripted via config files
Validation vs. Assessment Validation is the determination of the level of conformance to the normative requirements of a format’s authoritative specification • To the extent that there is community consensus on these requirements, validation is an objective determination – Hard coded in JHOVE2 Modules Assessment is the determination of the level of acceptability for a specific purpose on the basis of locally-defined policy rules • Since these rules are locally configurable, assessment is a subjective determination – Scripted via config files
Putting it another way … Assessment is the evaluation of a source unit's reportable properties against a set of policy-based rules
Assessment is the evaluation of a source unit's File (UTF-8) File with embedded ByteStream(s) (TIFF with ICC profile) Aggregate (Directory, ZIP ) ClumpSource (ShapeFile) reportable properties against a set of policy-based rules
Assessment is the evaluation of a source unit's reportable properties Format Identification Features Validity against a set of policy-based rules
Assessment is the evaluation of a source unit's reportable properties against a set of policy-based rules Is the item acceptable? Is there a preservation risk? What level of preservation service? Should we flag object for future action?
Practical Applications of Assessment Ingest workflows Migration workflows Digitization workflows Publishing workflows
Agenda 13:30-16:30 What is JHOVE2 ? Characterization of digital objects Validation vs Assessment Examples of JHOVE2 output Source Units, Modules, Reportable Properties Implementation of Assessment Configuration of Assessment Rules
Running JHOVE jhove2.sh –d Text –o outfile.txt myfile.xml Display format choices are: Text (default), JSON, and XML. File argument can be any of: • Filename • Directory name • URL • Set of space-delimited filepaths http://bitbucket.org/jhove2/main/wiki/documents/JHOVE2-Users-Guide.pdf
JHOVE2 Output options • Input File • xml-schemaLocation-cannot-resolve.xml • Text • text-output.txt • XML • xml-output.xml • JSON • json-output.txt
JHOVE2 Output FileSource: Path: E:\samples\xml\schema-sample.xml Size (byte): 9516 LastModified: 2010-10-12T11:55:29-06:00 SourceName: schema-sample.xml StartingOffset (byte): 0 …
Format Identification PresumptiveFormats: PresumptiveFormat {FormatIdentification}: NativeIdentifier {I8R}: Namespace: PUID Value: fmt/101 PRONOM Identifier JHOVE2Identifier {I8R}: Namespace: JHOVE2 Value: http://jhove2.org/terms/format/xml ...
PRONOM Format Registry http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=638 Name Extensible Markup Language Version 1.0 Other names XML (1.0) Identifiers PUID: fmt/101 Apple Uniform Type Identifier: public.xml MIME: text/xml Classification Text (Mark-up) Description The Extensible Markup Language (XML) is a general purpose markup language for creating other, special purpose, markup languages, and is a simplified subset of SGML. …
Agent used for Identification Module {DROIDIdentifier}: SignatureFile: …/DROID_SignatureFile_V20.xml Version: 2.0.0 ReleaseDate: 2010-09-10 WrappedProduct: Name: DROID Version: 4.0.0 ReleaseDate: 2009-07-23 ...
DROID http://sourceforge.net/projects/droid/ DROID (Digital Record Object Identification) is an automatic file format identification tool. It is the first in a planned series of tools developed by The National Archives under the umbrella of its PRONOM technical registry service
XML Module Module {XmlModule}: SaxParser: Parser: org.apache.xerces.parsers.SAXParser XmlDeclaration: Version:1.0 Encoding: UTF-8 Standalone: no RootElement: Name: mets Namespace: http://www.loc.gov/METS/
XML Module (namespaces) NamespaceInformation: NamespaceCount: 2 Namespaces: Namespace: URI: http://www.loc.gov/METS/ Declarations: Prefix: [default] SchemaLocations: SchemaLocation: Location: http://www.loc.gov/standards/mets /version15/mets.xsd Namespace: URI: http://www.loc.gov/mix/v10 Declarations: Prefix: mix
XML Module (cont) ValidationResults: ParserWarnings {ValidationMessageList}: ValidationMessageCount: 0 ParserErrors {ValidationMessageList}: ValidationMessageCount: 0 FatalParserErrors {ValidationMessageList}: ValidationMessageCount: 0 isWellFormed: true isValid: true
ICC color profile JPEG 2000 PDF SGML Shapefile TIFF UTF-8 WAVE XML Zip Format Modules from JHOVE2 Team JHOVE2 can identify (by DROID) many more formats than it can validate (by modules)
Other Module Development 3rd party development activities • NetCDF and GRIB modules (Wegener Institute) • Integration with DuraCloud (DuraSpace) • ARC module (Bibliothèque nationale de France) • WARC, JPEG, GIF modules (CDL, hopefully ;-) Possible development efforts • Additional format modules • Configuration GUIs • JHOVE2-as-a-service • Integration with DAITTS, DSpace, Fedora, FITS, etc. Suggestions, volunteers and funders welcome
AssessmentModule Module {AssessmentModule}: AssessmentResultSets: AssessmentResultSet: RuleSetName: XmlRuleSet RuleSetDescription: RuleSet for Xml Module ObjectFilter: org.jhove2.module.format.xml.XmlModule BooleanResult: true AssessmentResults: AssessmentResult: RuleName: XmlValidityRule RuleDescription: Is the XML file acceptable? BooleanResult: true NarrativeResult: Acceptable
Agenda 13:30-16:30 What is JHOVE2 ? Characterization of digital objects Validation vs Assessment Examples of JHOVE2 output Source Units, Modules, Reportable Properties Implementation of Assessment Configuration of Assessment Rules
JHOVE2 Abstractions • Source Unit • Module • Reportable • Reportable Property • Message
Source Unit A formatted object about which characterization information can be meaningfully reported • Unitary • File e.g. UTF-8 text file • File inside of a container e.g. TIFF inside a Zip • Byte stream inside a file e.g. ICC inside a TIFF • Aggregate • Directory • Directory inside of a container • Clump e.g. Shapefile • File set e.g. command line arguments For purposes of characterization, directories, file sets, and clumps are considered format types
Source Interface (Java) public Set<FormatIdentification> getPresumptiveFormats() { return presumptiveFormatIdentifications; } public List<Module> getModules() { return this.modules; } public List<Source> getChildSources() { return this.children; }
Format Module • implements Parser • implements Validator • Implements Reportable • Imports org.jhove2.annotation.ReportableProperty public long parse(JHOVE2 jhove2, Source source, Input input) { // extract features and //fill in the reportable properties fields . . . }
Reportables A Reportable is a named set of properties Reportables correspond to Java classes Including classes for sources and modules Also define reportables for the major conceptual structures inherent to a format JPEG 2000: Box TIFF: IFH, IFD, IFD entry (“tag”) UTF-8: Character stream, character WAVE: Chunk
Reportable Interface package org.jhove2.core public interface Reportable { public I8R getReportableIdentifier(); public String getReportableName(); public void setReportableName(String name); } public abstract class AbstractReportable implements Reportable { protected I8R reportableIdentifier; protected String reportableName; } A reportable class implements the Reportable marker interface
ReportableProperties A ReportableProperty is a named, typed value • org.jhove2.annotation.ReportableProperty • Unique formal identifier • Data type • Scalar or collection • Java types, JHOVE2 primitive types, or JHOVE2 reportables • Typed value • Description of correct semantic interpretation • Properties correspond to fields
ReportableProperty Annotation Each reportable property is represented by a field and accessor and mutator methods The accessor method must be marked with the @ReportableProperty annotation public class MyReportable implements Reportable { protected String myProperty; @ReportableProperty(order=1, desc=“description”, ref=“reference”) public String getMyProperty() { return this.myProperty; } public void setMyProperty(String property) { this.myProperty = property; } }
Wave Reportable Properties chunks[ ] formatChunkNotBeforeDataChunkMessage missingRequiredFormatChunkMessage missingRequiredDataChunkMessage missingRequiredFactChunkMessage isValid childChunks[ ] hasPadByte identifier isValid size
UTF-8 Reportable Properties byteOrderMark c0Characters c1Characters codeBlocks eOLMarkers invalidCharacters[ ] isValid numCharacters numLines numNonCharacters c0Control c1Control codeBlock codePoint codePointOutOfRange coverage invalidByteValues isByteOrderMark isC0Control isC1Control isNonCharacter isValid size
Fields for the reportable properties protected StringsaxParser = "org.apache.xerces.parsers.SAXParser"; protected XmlDeclarationxmlDeclaration = new XmlDeclaration(); protected StringxmlRootElementName; protected List<XmlDTD>xmlDTDs; protected HashMap<String,XmlNamespace>xmlNamespaceMap; protected List<XmlNotation>xmlNotations; protected List<String>xmlCharacterReferences; protected List<XmlEntity>xmlEntitys; protected List<XmlProcessingInstruction>xmlProcessingInstructions; protected List<String>xmlComments; protected XmlValidationResultsxmlValidationResults ; protected booleanwellFormed ;
Getter methods for reportable properties import org.jhove2.annotation.ReportableProperty; @ReportableProperty(order = 1, value = "Java class used to parse the XML") public String getSaxParser() { return saxParser; } @ReportableProperty(order = 2, value = "XML Declaration data") public XmlDeclaration getXmlDeclaration() { return xmlDeclaration; } @ReportableProperty(order = 3, value = "Name of the document's root element") public String getXmlRootElementName() { return xmlRootElementName; }
Messages if (position == start && ch.isByteOrderMark()) { Object [] messageParms= new Object [] {position}; this.bomMessage = new Message( Severity.INFO, Context.OBJECT, "org.jhove2.module.format.utf8.UTF8Module.bomMessage", messageParms); }
Messages Messages are reportable properties Unique identifier info:jhove2/message/… Context Process Condition arising from the process of characterization Object Condition arising in the object being characterized Severity Error Warning Info Internationalizable
Agenda 13:30-16:30 What is JHOVE2 ? Characterization of digital objects Validation vs Assessment Examples of JHOVE2 output Source Units, Modules, Reportable Properties Implementation of Assessment Configuration of Assessment Rules http://code4lib.org/conference/2011/schedule#preconf