130 likes | 142 Views
FITS is part of Harvard University’s DRS2, offering metadata extraction and format validation for various file types. Developed in 2008, released in 2009. Access tools, categories, and key features.
E N D
Background • FITS is part of the second generation Harvard University Library Digital Repository Service(DRS2), which supports content models and METS/PREMIS object descriptors. • Developed Fall 2008 • First public release Spring 2009: http://fits.googlecode.com
Why? • Needed an automatic way to identify and extract metadata for a wide range of file types • No single file analysis tool satisfied our needs
Design Goals • Act as a wrapper around other open source tools • Extensible • Needs to be a standalone command line tool and also provide an API • Allow priority setting for tools • Open source
The Tools • Current tools: • Jhove 1.5 • Exiftool • National Library of New Zealand Metadata Extractor (NLNZ) • DROID • FFIdent • File Utility • 3 Categories • File Identification (all of them) • Metadata Extraction (Jhove, Exiftool, NLNZ) • format Validation (Jhove)
Features • Conflict management • Value normalization • “inches” vs “2” • Tool prioritization • Format tree for understanding more specific format identities. • PDF/A is a more specific version of PDF
Example Output • <fits> • <identification> • <identity format="Graphics Interchange Format" mimetype="image/gif"> • <tool toolname="Jhove" toolversion="1.5" /> • ... • </identity> • </identification> • <fileinfo> • <size toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">40149</size> • <md5checksum toolname="OIS File Information" toolversion="0.1" • status="SINGLE_RESULT">265c9345ebf93c89d472766fda095de4</md5checksum> • ... • </fileinfo> • <filestatus> • <well-formed toolname="Jhove" toolversion="1.5" status="SINGLE_RESULT">true</well-formed> • <valid toolname="Jhove" toolversion="1.5" status="SINGLE_RESULT">true</valid> • </filestatus> • <metadata> • <image> • <height toolname="Jhove" toolversion="1.5" status="SINGLE_RESULT">1024</height> • ... • </image> • </metadata> • </fits>
Configuration • All settings are in the fits.xml config file • Enable/disable tools (available in the API too) • Prevent tools from processing files with specific file extensions • Set tool priority • Add new tools • Use your own consolidator code • Report or ignore conflicts • Options to display original tool output
Sample Configuration File • <fits_configuration> • <!-- Order of the tools determines preference --> • <tools> • <!-- exclude-exts attribute is a comma delimited list of file extensions that the tool should not try to process --> • <tool class="edu.harvard.hul.ois.fits.tools.jhove.Jhove" exclude-exts="dng,mbx"/> • <tool class="edu.harvard.hul.ois.fits.tools.fileutility.FileUtility" exclude-exts="dng,wps"/> • <tool class="edu.harvard.hul.ois.fits.tools.exiftool.Exiftool" exclude-exts="txt,wps,vsd"/> • <tool class="edu.harvard.hul.ois.fits.tools.droid.Droid" exclude-exts="dng"/> • <tool class="edu.harvard.hul.ois.fits.tools.nlnz.MetadataExtractor" exclude-exts="dng,zip,odb,ott,odg,otg,odp,otp,ods,ots,odc,otc,odi,oti,odf,otf,odm,oth"/> • <tool class="edu.harvard.hul.ois.fits.tools.oisfileinfo.FileInfo"/> • <tool class="edu.harvard.hul.ois.fits.tools.oisfileinfo.XmlMetadata"/> • <tool class="edu.harvard.hul.ois.fits.tools.ffident.FFIdent" exclude-exts="dng,wps,vsd"/> • </tools> • <output> • <dataConsolidator class="edu.harvard.hul.ois.fits.consolidation.OISConsolidator"/> • <display-tool-output>true</display-tool-output> • <report-conflicts>true</report-conflicts> • <validate-tool-output>false</validate-tool-output> • <internal-output-schema>xml/fits_output.xsd</internal-output-schema> • <external-output-schema>http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd</external-output-schema> • <fits-xml-namespace>http://hul.harvard.edu/ois/xml/ns/fits/fits_output</fits-xml-namespace> • </output> • <!-- file name of the droid signature file to use in tools/droid/--> • <droid_sigfile>DROID_SignatureFile_V35.xml</droid_sigfile> • </fits_configuration> 10
Some Limitations... • Speed • Technical metadata only returned if the tool that reported it is in the first <identity> block • FITS considers a successful identification to be a combination of the format name and mime type
Future Plans • More tools • Apache Tika (text document formats) • Jhove 2 • Aduna Aperture (text, documents, email formats) • Mediainfo (audio and video formats) • Better audio and video format support as we add object support for them to DRS2
Wrap Up • http://fits.googlecode.com • http://ots-schemas.googlecode.com • Java library for reading and writing METS (limited support), MODS, PREMIS, MIX, TextMD, DocumentMD, and soon AES audio metadata • More information on DRS2: http://hul.harvard.edu/ois/systems/drs/enhancements.html