Grid Based Data Integration with Automatic Wrapper Generation

Grid Based Data Integration with Automatic Wrapper Generation Xuan Zhang Gagan Agrawal Ohio State University

Overall Goal • Tools for data integration driven by: • Data explosion • Data size & number of data sources • New analysis tools and need for workflows • Autonomous resources • Heterogeneous data representation & various interfaces

Motivation (Contd.) • Other Issues: • Frequent updates to data formats • Flat-file datasets • Ad-hoc sharing of data

Current Approaches • Manually written wrappers • Problems • O(N2) wrappers needed, O(N) for a single updates • Portability of wrappers in a distributed environment • Mediator-based integration systems • Problems • Need a common intermediate format • Unnecessary data transformation • Integration using web/grid services • Needs all tools to be web-services (all data in XML?)

Our Approach • Automatically generate wrappers • One layout descriptor per resource • Stand-alone wrapper programs • For integrated DBs, (grid) workflow systems • Transform data in files of arbitrary formats • No domain- or format-specific heuristics • Layout information provided by users

Our Approach (Contd.) • Help write layout descriptors using data mining techniques (dils 2005, bibe 2005) • Particularly attractive for • Data grid environments and workflows • flat-file datasets • ad hoc data sharing

Our Approach: Advantages • Advantages: • No need to write wrappers while integrating data or creating workflows • Only one descriptor per resource needed • No unnecessary transformations / storage • New resources can be integrated on-the-fly

Our Approach: Challenges • Description language • Format and logical view of data in flat files • Easy to interpret and write • Wrapper generation and execution • Correspondence between data items • Separating wrapper analysis and execution • Interactive tools for writing layout descriptors • What data mining techniques to use ? (dils 2005, bibe 2005)

Wrapper Generation System Overview Layout Descriptor Schema Descriptors Parser Mapping Generator Data Entry Representation Schema Mapping Application Analyzer WRAPINFO Source Dataset Target Dataset DataReader DataWriter Synchronizer

Suitability for a Grid Environment • Wrapper analysis can be implemented as a grid service • Very low execution costs • Wrapper execution modules are task-independent • Just need to port three modules on different systems

Assumptions for the Current Prototype • One tabular, the other semi-structured • Both datasets are stored record-wise • Order of records not disturbed • Suitable for bioinformatics Semi-structured tabular

Layout Description Language • Goal • To describe data in arbitrary flat file format • Easy to interpret and write • Components: • Schema description • Layout description • Example: FASTA

Key observations on data layout Strings of variable length Delimiters widely used Data fields divided into variables Repetitive structures Key tokens “constant string” LINESIZE [optional] <repeating> … Layout Description Language … >seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nNMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 …

Layout Description Language … >seq1 comment1\n ASTPGHTIIYEAVCLHNDRTTIP \n >seq2 comment2 \n ASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \n NMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \n KSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 … • Component I: Schema Description [FASTA] //Schema Name ID = string //Data type definitions DESCRIPTION = string SEQ = string

Layout Description Language … >seq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP \n>seq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFGGDRGAPK \nNMYKDSHHPARTAHYGSLPQKSHGRTQDENPVVHFFKNIVTPRTPPPSQGKGR \nKSAHKGFKGVDAQGTLSKIFKLGGRDSRSGSPMARRELVISLIVES \n >seq3 … • Component II: Layout Description … LOOP ENTRY 1:EOF:1 { “>” ID “ ” DESCRIPTION < “\n” SEQ > “\n” | EOF } …

Reference table TRANSFAC … FA factor1_name … RA reference1.1_authors … RA reference1.2_authors … RA reference1.3_authors … Mapping Cardinality One-to-multiple data field One-to-one data field

Analyzing Application • Goals - WRAPINFO • Summarize all application related information necessary for the wrapper • Represent the information in look-up tables and constant parameters • Represent the information in a platform-independent format, XML

Wrapper Generated Value buffer one_to_multiple_values RA FA RA Output dataset Input dataset Dataset buffer DataReader DataWriter one_to_one_values run load run halt Synchronizer

Wrapper Generated • Suitable for data grid • Three general modules • DataReader • Extract one data field value • Write value to the value buffer if useful • DataWriter • Write one data field value • Remove value from list in the value buffer • Synchronizer • Switch between calling DataReader and DataWriter • Manage dataset buffer • Application specific information in WRAPINFO

Experimental Results (in logarithm) (in logarithm) TRANSFAC-to-Reference Problem

Experimental Results SWISSPROT-to-FASTA Problem

Summary • Automatically generated wrappers can perform well • Wrapper task analysis and wrapper execution can be separated • Key Open Question: • How hard it is to write layout descriptors ? • Can we make the process semi-automatic ? • Data mining techniques seem quite promising (dils 2005, bibe 2005)

Grid Based Data Integration with Automatic Wrapper Generation

Grid Based Data Integration with Automatic Wrapper Generation

Presentation Transcript

Supervised and unsupervised wrapper generation

Grid Integration

Automatic Metadata Generation

Automatic Generation Control for Contract Based Regulation

Ontology-based data integration

Network-based data-integration

Next Generation, Cloud-based Data Integration and Analytics

Grid Data Integration based on Schema-mapping

Grid-based Database Integration in AIST

Wrapper Generation and HTML Reduction

XML-Based Automatic Web Presentation Generation

Character String Predicate Based Automatic Software Test Data Generation

BioGrid: Integration of Biological Data Grid and Computing Grid

Toward Automatic Grid Generation

Automatic PhotoHunt Generation

Renewable Energy Grid Integration and Distributed Generation

A Web-Based Data Grid

Automatic Interface Generation

Automatic Code Generation

IoT based Automatic Electricity bill generation with theft detection

XML-Based Automatic Web Presentation Generation

Grid Data Integration based on Schema-mapping