670 likes | 691 Views
The ISDA Tools Computationally Scalable File Migration Services to Keep Your Files Current. Kenton McHenry Rob Kooper Luigi Marini Michael Ondrejcek. The Problem. The abundance of file formats is a problem when preserving electronic records Why?
E N D
The ISDA Tools Computationally Scalable File Migration Services to Keep Your Files Current Kenton McHenry Rob Kooper Luigi Marini Michael Ondrejcek
The Problem • The abundance of file formats is a problem when preserving electronic records • Why? • Will there be software to load the file in the future? • If not will the specification for the format still exist? • Was the specification ever available to begin with (closed/proprietary formats)?
*.pdf (*.prc, *.u3d) *.k3d *.ma, *.mb, *.mp *.w3d *.lwo *.blend *.iam *.max, *.3ds *.c4d *.dwg *.vtk, *.vtp *.skp
Converting Formats • In order to preserve content for future use one option is to convert the file to an open/standardized format that is likely to be supported for some time. • Store both this file and the original for provenance • Ideally with one file format for a particular content type it will be easy for users to view/use the data.
NCSA Polyglot (2009) • Conversions service based on utilizing any and all available 3rd party software • Imposed Code Reuse: Re-attaching a programmable interface to compiled software. • Scripted operations within software • GUI scripting (e.g. AutoHotKey) • Created a simple workflow referred to as an Input/Output Graph • Compared files before/after conversion to measure information loss • Distributed across multiple machines • Web access
ISDA File Migration Tools • Conversion Software Registry • Software Servers • Polyglot • Versus
Software that can Convert between Formats • There is a lot of software available, each with its own unique capabilities • A lot of it is not free • It would be expensive to buy a package just to check if it truly is capable of converting between a desired pair of formats • How can someone know what software to get for their needs? http://isda.ncsa.illinois.edu/NARA/CSR
Adobe 3D Reviewer The Conversion Software Registry (Tool #1)
Adobe 3D Reviewer Input/Output Graphs
Input/Output Graphs 3DS Max Adobe 3D Reviewer AutoCAD Blender Cinema 4D K-3D LightWave 3D Maya Wings 3D
Input/Output Graphs Shortest conversion path
Software Servers (Tool #2) • Imposed Code Reuse: The process of attaching an API like interface to software so that its functionality can be called within new code.
Software Servers (Tool #2) • Shares the functionality of software over the web • In contrast to services which share data: ftpd, nfsd, sambad, httpd • Similar to services such as: telnetd, sshd, VNC, rdesktop • The main difference is in the interface: • Uniform across all software http://host:8182/software/<Application>/<Task>/<Output Format>/<InputFile> • Simple • Widely accessible • Capable of being programmed against • Allows any desktop application to become a cloud based web service*
Software Functionality Sharing #!/bin/bash host="http://141.142.224.231:8182" application="A3DReviewer" task="convert" output="igs" input="stp" url=$host/software/$application/$task/$output for input_file in `ls *.$input` ; do output_url=`curl -s -H "Accept:text/plain" -F "file=@$input_file" $url` output_file=${input_file%.*}.$output echo "Converting: $input_file to $output_file" while : ; do wget -q -O $output_file $output_url if [ ${?} -eq 0 ] ; then break fi sleep 1 done done
Software Server Robustness • Software: • 3D Studio Max, Adobe 3D Reviewer, Blender, Google Sketchup, ImageMagick, IrfanView, Microsoft Paint, Microsoft Word, ParaView, VTK • Measure throughput of software on a software server • TRY TO MAKE IT FAIL!!! • Results: • Ideal case: 1395 tasks/hour on a 1 core 1GB VM with an average wait of 4.42 s. • In a less than ideal case: 945 tasks/hour with an average wait of 11.17 s. • Server did not crash!
Software Server Robustness • We are using GUI based software! • Consider command line software as baseline: • ImageMagick: 1871 tasks/hour • IrfanView: 3163 • vs GUI software: • 3DS Max: 355 tasks/hour • Microsoft Word: 756 tasks/hour • How many people would it take using this software for the same throughput?
Polyglot (Tool #3) • Listens for Software Server broadcasts on the network • Catalogues available input/output operations and constructs and I/O-graph • Identifies conversion paths between input and output formats • Carries out CHAINED conversions
Versus (Tool #4) • Java library/framework for comparing file content • Under development: • Framework/API designed • Distributed architecture • RESTful Web Interface • http://<host>/versus/comparisons • dataset1, dataset2 • adapter, extractor, measure • Adding extractors, measures
Which conversion preserved the most? • Using the light fields measure: • Emphasizes shape through silhouettes • Adobe 3D Reviewer between *.pdf and *.stp (61.67) • Using the spin image measure: • Emphasize shape through relative vertex positions • Adobe 3D Reviewer between *.obj and *.pdf (59.07)
Which is the best format?Within the context of preservation we can define this as the format that retains on average the most information when converted to by other formats. • Using the light fields measure: • Emphasizes shape through silhouettes • *.stp (40.73) • Using the spin image measure: • Emphasizes shape through relative vertex positions • *.stl (34.89) • *.stp being a CAD format has more variability in vertex positions due to tessellation
ISDA Tools • Conversion Software Registry • Software Servers • Polyglot • Versus • 3D Utilities • Image Utilities • CyberIntegrator
Acknowledgements • The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the National Science Foundation, the National Archive and Records Administration, or the U.S. government. This research was partially supported by a National Archive and Records Administration (NARA) supplement to NSF PACI cooperative agreement CA #SCI-9619019 and by NCSA Industrial Partners. Imaginations unbound
The ISDA Tools (Free and Open Source) Image, Spatial, and Data Analysis Group http://isda.ncsa.illinois.edu Kenton McHenry Rob Kooper Michal Ondrejcek Luigi Marini