110 likes | 242 Views
New (Applications of) Compiler Techniques for Data Grids. Gagan Agrawal. Outline. Automatic Data Virtualization SQL Implementation XML/XQuery Automatic Wrapper Generation Data Integration in Bioinformatics Compiling XML Query Language XQuery Issues with streaming data.
E N D
New (Applications of) Compiler Techniques for Data Grids Gagan Agrawal
Outline • Automatic Data Virtualization • SQL Implementation • XML/XQuery • Automatic Wrapper Generation • Data Integration in Bioinformatics • Compiling XML Query Language XQuery • Issues with streaming data
Data Virtualization An abstract view of data dataset Data Virtualization Data Service -- Scientific Data being shared on Web/Grids -- Low-level layouts -- Need for efficient storage and processing
Our Approach: Automatic Data Virtualization • Automatically create data services • A new application of compiler technology • A meta-data descriptor describes the layout of data in a repository • An abstract view is exposed to the users • Two implementations: • Relational /SQL-based (HPDC 2004, LCPC 2004) • XML/XQuery based (ICS 2003, LCPC 2003)
SQL/Relational Implementation SELECT < Data Elements > FROM < Dataset Name > WHERE …. AND Filter( < Data Element> );
XQuery ??? XML XML/XQuery Implementation HDF5 NetCDF TEXT RMDB …
Approach / Contributions • Use of XML Schemas to provide high-level abstractions on complex datasets • Using XQuery with these Schemas to specify processing • Issues in Translation • High-level to low-level code • Data-centric transformations for locality in low-level codes • Issues specific to XQuery • Recognizing recursive reductions • Type inferencing and translation
Wrappers • Goal: to provide the integration system transparent access to data sources • Challenges • Development cost • Performance • Scripting languages can be slow • Updates • Data Formats can change frequently
Our Approach • Machine-interpretable metadata • A layout descriptor associated with each dataset • Wrappers generated on the fly • Applied to several bioinformatics examples
Layout Descriptor Dataset name Schema name DATASET “FASTAData” { DATATYPE {FASTA} DATASPACE LINESIZE=80 { LOOP ENTRY 1:EOF:1 { “>” ID “ “ DESCRIPTION < “\n” SEQ > “\n” | EOF } } DATA {osu/fasta} } ID DESCRIPTION >Example1 envelope protein ELRLRYCAPAGFALLKCNDA DYDGFKTNCSNVSVVHCTNL MNTTVTTGLLLNGSYSENRT QIWQKHRTSNDSALILLNKH >Example2 synthetic peptide HITREPLKHIPKERYRGTNDT… SEQ SEQ File layout SEQ SEQ File location
XQuery on Streaming Data • Infinite data streams • All processing must be single pass • Interesting Compiler Questions: • How do I transform a code to execute on a single pass • How to tell that it can be executed correctly with a single pass • Addressed this problem for XML Streams and XML query language XQuery • Appears in VLDB 2005