220 likes | 331 Views
Grid Based Data Integration with Automatic Wrapper Generation. Xuan Zhang Gagan Agrawal Ohio State University. Overall Goal. Tools for data integration driven by: Data explosion Data size & number of data sources New analysis tools and need for workflows Autonomous resources
E N D
Grid Based Data Integration with Automatic Wrapper Generation Xuan Zhang Gagan Agrawal Ohio State University
Overall Goal • Tools for data integration driven by: • Data explosion • Data size & number of data sources • New analysis tools and need for workflows • Autonomous resources • Heterogeneous data representation & various interfaces
Motivation (Contd.) • Other Issues: • Frequent updates to data formats • Flat-file datasets • Ad-hoc sharing of data
Current Approaches • Manually written wrappers • Problems • O(N2) wrappers needed, O(N) for a single updates • Portability of wrappers in a distributed environment • Mediator-based integration systems • Problems • Need a common intermediate format • Unnecessary data transformation • Integration using web/grid services • Needs all tools to be web-services (all data in XML?)
Our Approach • Automatically generate wrappers • One layout descriptor per resource • Stand-alone wrapper programs • For integrated DBs, (grid) workflow systems • Transform data in files of arbitrary formats • No domain- or format-specific heuristics • Layout information provided by users
Our Approach (Contd.) • Help write layout descriptors using data mining techniques (dils 2005, bibe 2005) • Particularly attractive for • Data grid environments and workflows • flat-file datasets • ad hoc data sharing
Our Approach: Advantages • Advantages: • No need to write wrappers while integrating data or creating workflows • Only one descriptor per resource needed • No unnecessary transformations / storage • New resources can be integrated on-the-fly
Our Approach: Challenges • Description language • Format and logical view of data in flat files • Easy to interpret and write • Wrapper generation and execution • Correspondence between data items • Separating wrapper analysis and execution • Interactive tools for writing layout descriptors • What data mining techniques to use ? (dils 2005, bibe 2005)
Wrapper Generation System Overview Layout Descriptor Schema Descriptors Parser Mapping Generator Data Entry Representation Schema Mapping Application Analyzer WRAPINFO Source Dataset Target Dataset DataReader DataWriter Synchronizer
Suitability for a Grid Environment • Wrapper analysis can be implemented as a grid service • Very low execution costs • Wrapper execution modules are task-independent • Just need to port three modules on different systems
Assumptions for the Current Prototype • One tabular, the other semi-structured • Both datasets are stored record-wise • Order of records not disturbed • Suitable for bioinformatics Semi-structured tabular
Layout Description Language • Goal • To describe data in arbitrary flat file format • Easy to interpret and write • Components: • Schema description • Layout description • Example: FASTA
Key observations on data layout Strings of variable length Delimiters widely used Data fields divided into variables Repetitive structures Key tokens “constant string” LINESIZE [optional] <repeating> … Layout Description Language … >seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nNMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 …
Layout Description Language … >seq1 comment1\n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 … • Component I: Schema Description [FASTA] //Schema Name ID = string //Data type definitions DESCRIPTION = string SEQ = string
Layout Description Language … >seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nNMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 … • Component II: Layout Description … LOOP ENTRY 1:EOF:1 { “>” ID “ ” DESCRIPTION < “\n” SEQ > “\n” | EOF } …
Reference table TRANSFAC … FA factor1_name … RA reference1.1_authors … RA reference1.2_authors … RA reference1.3_authors … Mapping Cardinality One-to-multiple data field One-to-one data field
Analyzing Application • Goals - WRAPINFO • Summarize all application related information necessary for the wrapper • Represent the information in look-up tables and constant parameters • Represent the information in a platform-independent format, XML
Wrapper Generated Value buffer one_to_multiple_values RA FA RA Output dataset Input dataset Dataset buffer DataReader DataWriter one_to_one_values run load run halt Synchronizer
Wrapper Generated • Suitable for data grid • Three general modules • DataReader • Extract one data field value • Write value to the value buffer if useful • DataWriter • Write one data field value • Remove value from list in the value buffer • Synchronizer • Switch between calling DataReader and DataWriter • Manage dataset buffer • Application specific information in WRAPINFO
Experimental Results (in logarithm) (in logarithm) TRANSFAC-to-Reference Problem
Experimental Results SWISSPROT-to-FASTA Problem
Summary • Automatically generated wrappers can perform well • Wrapper task analysis and wrapper execution can be separated • Key Open Question: • How hard it is to write layout descriptors ? • Can we make the process semi-automatic ? • Data mining techniques seem quite promising (dils 2005, bibe 2005)