140 likes | 285 Views
FITS: The File Information Tool Set. Background. FITS is part of the second generation Harvard University Library Digital Repository Service(DRS2), which supports content models and METS/PREMIS object descriptors. Developed Fall 2008
E N D
Background • FITS is part of the second generation Harvard University Library Digital Repository Service(DRS2), which supports content models and METS/PREMIS object descriptors. • Developed Fall 2008 • First public release Spring 2009: http://fits.googlecode.com
Why? • Needed an automatic way to identify and extract metadata for a wide range of file types • No single file analysis tool satisfied our needs
Design Goals • Act as a wrapper around other open source tools • Extensible • Needs to be a standalone command line tool and also provide an API • Allow priority setting for tools • Open source
The Tools • Current tools: • Jhove 1.5 • Exiftool • National Library of New Zealand Metadata Extractor (NLNZ) • DROID • FFIdent • File Utility • 3 Categories • File Identification (all of them) • Metadata Extraction (Jhove, Exiftool, NLNZ) • format Validation (Jhove)
Features • Conflict management • Value normalization • “inches” vs “2” • Tool prioritization • Format tree for understanding more specific format identities. • PDF/A is a more specific version of PDF
Example Output • <fits> • <identification> • <identity format="Graphics Interchange Format" mimetype="image/gif"> • <tool toolname="Jhove" toolversion="1.5" /> • ... • </identity> • </identification> • <fileinfo> • <size toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">40149</size> • <md5checksum toolname="OIS File Information" toolversion="0.1" • status="SINGLE_RESULT">265c9345ebf93c89d472766fda095de4</md5checksum> • ... • </fileinfo> • <filestatus> • <well-formed toolname="Jhove" toolversion="1.5" status="SINGLE_RESULT">true</well-formed> • <valid toolname="Jhove" toolversion="1.5" status="SINGLE_RESULT">true</valid> • </filestatus> • <metadata> • <image> • <height toolname="Jhove" toolversion="1.5" status="SINGLE_RESULT">1024</height> • ... • </image> • </metadata> • </fits>
Configuration • All settings are in the fits.xml config file • Enable/disable tools (available in the API too) • Prevent tools from processing files with specific file extensions • Set tool priority • Add new tools • Use your own consolidator code • Report or ignore conflicts • Options to display original tool output
Sample Configuration File • <fits_configuration> • <!-- Order of the tools determines preference --> • <tools> • <!-- exclude-exts attribute is a comma delimited list of file extensions that the tool should not try to process --> • <tool class="edu.harvard.hul.ois.fits.tools.jhove.Jhove" exclude-exts="dng,mbx"/> • <tool class="edu.harvard.hul.ois.fits.tools.fileutility.FileUtility" exclude-exts="dng,wps"/> • <tool class="edu.harvard.hul.ois.fits.tools.exiftool.Exiftool" exclude-exts="txt,wps,vsd"/> • <tool class="edu.harvard.hul.ois.fits.tools.droid.Droid" exclude-exts="dng"/> • <tool class="edu.harvard.hul.ois.fits.tools.nlnz.MetadataExtractor" exclude-exts="dng,zip,odb,ott,odg,otg,odp,otp,ods,ots,odc,otc,odi,oti,odf,otf,odm,oth"/> • <tool class="edu.harvard.hul.ois.fits.tools.oisfileinfo.FileInfo"/> • <tool class="edu.harvard.hul.ois.fits.tools.oisfileinfo.XmlMetadata"/> • <tool class="edu.harvard.hul.ois.fits.tools.ffident.FFIdent" exclude-exts="dng,wps,vsd"/> • </tools> • <output> • <dataConsolidator class="edu.harvard.hul.ois.fits.consolidation.OISConsolidator"/> • <display-tool-output>true</display-tool-output> • <report-conflicts>true</report-conflicts> • <validate-tool-output>false</validate-tool-output> • <internal-output-schema>xml/fits_output.xsd</internal-output-schema> • <external-output-schema>http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd</external-output-schema> • <fits-xml-namespace>http://hul.harvard.edu/ois/xml/ns/fits/fits_output</fits-xml-namespace> • </output> • <!-- file name of the droid signature file to use in tools/droid/--> • <droid_sigfile>DROID_SignatureFile_V35.xml</droid_sigfile> • </fits_configuration> 10
Some Limitations... • Speed • Technical metadata only returned if the tool that reported it is in the first <identity> block • FITS considers a successful identification to be a combination of the format name and mime type
Future Plans • More tools • Apache Tika (text document formats) • Jhove 2 • Aduna Aperture (text, documents, email formats) • Mediainfo (audio and video formats) • Better audio and video format support as we add object support for them to DRS2
Wrap Up • http://fits.googlecode.com • http://ots-schemas.googlecode.com • Java library for reading and writing METS (limited support), MODS, PREMIS, MIX, TextMD, DocumentMD, and soon AES audio metadata • More information on DRS2: http://hul.harvard.edu/ois/systems/drs/enhancements.html