110 likes | 121 Views
Infeasible to perform manual maintenance of a large number of objects in a working archive. Software tools required for extracting and maintaining supporting service providers. Requirements include object analysis tools, support for various formats, batch analysis, flexibility, and conversion/emulation tools. Examples of tools for file, email, audio, and image formats provided. JHOVE and PLANETS Interoperability Framework suggested as partial solutions.
E N D
Challenge Reality: Infeasible to perform manual maintenance of large number of objects. Require software capable of extracting & maintaining SPs for large of objects Requirements: • Object analysis tools • Support requisite formats • Identify all/some SPs • Support batch analysis • Ideally well supported and documented • Description schemas to record SPs • Flexible • Machine and format idependent • Conversion/emulation tools capable of maintaining SPs
Format identification • File identification through Magic Number and ‘light touch’ scan of encoding structure. • Recognise 100s (potentially 1000s) of formats • Provide basic encoding info, but not detailed structure • Examples: • File (1): Free version created in 1986 & available for all operating systems. http://gnuwin32.sourceforge.net/packages/file.htm (Windows) • DROID: Java app developed by TNA. Integration with PRONOM. Format ID & assignment of PUID, which can be linked to preservation planning. http://droid.sourceforge.net/. • FFIdent: Java library to ID and extract basic information. Recognizes 27 encoding formats using header information (magic number & common structural information)
Detailed Analysis Perform detailed analysis of internal structure of one or more files. • Email: • Aperture - Java framework able to decode structured text and convert to other format • ReadPST: Open source tool for processing Outlook PSTs • XENA - Java tool developed by NAA • Audio: • MP3Info - technical info viewer and ID3 1.x tag editor that supports the MP3 file format. • SoX/SOXI (Sound eXchange): extracts descriptive MD and technical info • MetaFlac: Extractor tool for FLAC audio. • Images: • TiffInfo • ImageMagick • JHOVE See InSPECT Testing Reports available at http://www.significantproperties.org.uk/ for further info on these tools
JHOVE 1/2 JHOVE (http://hul.harvard.edu/jhove/) • Format-specific digital object validation API written in Java • Functionality: Format identification, Format validation, Format Characterisation • Supports: AIFF, ASCII, Bytestream, GIF, HTML, JPEG, JPEG 2000, PDF, TIFF, UTF-8, WAV, and XML. JHOVE2 (https://confluence.ucop.edu/display/JHOVE2Info/Home) • Supports: JPEG 2000, PDF, SGML, Shapefile, TIFF, ASCII & UTF-8 encoded text, WAVE, XML, ICC color profile • Functionality: Format identification, validation, feature extraction & policy-based assessment
XCL (eXtensible Characterization Language) • Content extraction • Extracts content & tech properties through use of XCEL and saved as XCDL. • Format support: • PNG, TIFF, GIF, BMP, JPEG, JP2, PBM, PCD, PCX, PICT, PPM, PSD, SVG, TGA, XBM and XPM, MS DOC, DocX, PDF • Content comparison • Compare 2 objects e.g. TIFF & PNG, PDF & Doc
Final thoughts • Analysis tools useful, but have problems: • Limited format support • Variable access methods (GUI, CLI, APIs) • Inconsistent reporting process • Different metrics (e.g. text vs. no.) • Metric variations (e.g. milliseconds) • Partial solution: Wrap tools into services • PLANETS Interoperability Framework