10 likes | 118 Views
A Framework for Relationship Discovery Among Files of Different Types. Michal Ondrejcek, Jason Kastner and Peter Bajcsy National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign (UIUC) {mondrejc,jkastner, pbajcsy} @ncsa.uiuc.edu. Abstract
E N D
A Framework for Relationship Discovery Among Files of Different Types Michal Ondrejcek, Jason Kastner and Peter Bajcsy National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana-Champaign (UIUC) {mondrejc,jkastner, pbajcsy} @ncsa.uiuc.edu Abstract We present a framework for relationship discovery from heterogeneous data systems. The framework consists of modules for automated file system analysis, file content analysis, integration of the results from analyses, storage of metadata and data-driven decision support for discovering relationships among files. The file content analysis includes filtering for file type detection (e.g., file format identification using DROID and PRONOM) and type-specific content analysis (such as, information extraction from 2D engineering drawings using Optical Character Recognition (OCR), and keyword based extraction of information from 3D CAD models). The integration component consolidates metadata extracted from the file system and from the file content using metadata Resource Description Framework (RDF)-based representations. These are stored using Tupelo in an underlying content repository. We report our preliminary design of the framework and the performance of prototype modules for a test collection of electronic records documenting the Torpedo Weapon Retriever (TWR 841). This test collection presents a problem of unknown relationships among files that currently include 784 2D image drawings and 22 CAD models. Framework Design Content Information Extraction We study the extraction of content information to discover relationships between engineering drawings (tiff files) with the Title Block and corresponding AutoCAD 3D models (dwg files) of the TWR841 ship deck. An overall design to discovering relationships among multiple sources of electronic records. Engineering Drawing RELATIONSHIP 3D CAD Model • Information in engineering drawings: The title block is cropped. Information is extracted using Optical character recognition (OCR) software. The extracted information is corrected and encoded into about 15-20 RDF triples using a developed ontology. File System Information Extraction • Aperture, a Java framework has been used for metadata extraction from File systems. It saves the metadata following the Nepomuk ontology. • We studied the size of extracted metadata and developed prediction capabilities to estimate additional storage requirements. <?xml version="1.0" encoding="UTF-8"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:tdrw=“’path’/NARA/titleBlockRDF/" <rdf:Description rdf:about=“’path’/NARA/titleBlockRDF/"> <tdrw:drawingTitle>120 TORPEDO WEAPONS RETRIEVER TRANSVERSE BULKHEADS BELOW MAIN DECK </tdrw:drawingTitle> <tdrw:isPreparingActivity>OfPAO'MtN* Of »NE **v* NAVAL SEA SYSTEMS COMMAND </tdrw:isPreparingActivity> <tdrw:drawingScale>1/2"-1'-0"& AS SHOWN </tdrw:drawingScale> <tdrw:drawingSize>H</tdrw:drawingSize> <tdrw:drawingNumber>117-6200895</tdrw:drawingNumber> <tdrw:drawingNumber>A</tdrw:drawingNumber> <tdrw:isDrawnBy> <rdf:Description rdf:about="http://purl.org/dc/elements/1.1/"> <dc:creator>LDOBSON</dc:creator> </rdf:Description> </tdrw:isDrawnBy> <tdrw:isDrawnDate> <rdf:Description rdf:about="http://purl.org/dc/elements/1.1/"> <dc:date>4-I0-86</dc:date> </rdf:Description> </tdrw:isDrawnDate> </rdf:Description> </rdf:RDF> Cropped Title Block Information from OCR Metadata size as a function of number of files in a File system. The test systems were, divided based on the Operating System (OS) type to: (c1 ) LINUX based 8 CPU Intel Xeon with 2.5GHz and 8GB RAM and (c2) WindowsXP 1 CPU 2GHz Intel and 2GB RAM. While the dots corresponds to concrete File systems, the blue line represents the metadata size prediction based on simulated File system topology. RDF representation of information extracted Editing and Ontology Definition • Information in 3D CAD files: The 3D CAD models in STEP file format are searched for any ASCII strings matching English dictionary. The information is again encoded by about 8-10 RDF triples. File Format Identification This component calls DROID, a file format identification program. The results are metadata about each file including the registered PRONOM universal ID. PRONOM is a resource registry (information) about the file formats, software products and other technical components. Table shows an example of information extracted from 3D CAD model stored in STEP file formats of the TWR841 ship deck. Several 3D file formats are not supported by PRONOM and DROID returns the unidentified file format flag. Those files are then checked against an internal list of 3D file types. The results are converted into RDF triples and stored in a metadata context repository. Conclusions • We have prototyped a framework for file system and file content metadata extraction. The relationship discovery from metadata is in progress. • We developed the metadata size prediction capability for File systems. • We empirically observed the number of generated RDF triples for relationship discovery to be on average about 20-30 per file leading to the total number of 8-12 million RDF triples for an average size server. RDF triples generated for two engineering drawings in tiff and Autocad formats with PRONOM Unique IDs highlighted. An UUID is used as a key for storing a set of triples about the same file. Acknowledgments This research was partially supported by a National Archive and Records Administration supplement to NSF PACI cooperative agreement CA #SCI-9619019.