280 likes | 462 Views
An Approach for Automatic Data Virtualization. Li Weng, Gagan Agrawal et al. Motivating Applications. Oil Reservoir Management. Magnetic Resonance Imaging. Data-driven applications from science, Engineering, biomedicine : Oil Reservoir Management Water Contamination Studies
E N D
An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
Motivating Applications Oil Reservoir Management Magnetic Resonance Imaging Data-driven applications from science, Engineering, biomedicine: Oil Reservoir Management Water Contamination Studies Cancer Studies using MRI Telepathology with Digitized Slides Satellite Data Processing Virtual Microscope …
Opportunity and Issues • Emergence of grid-based data repositories • Can enable sharing of data in an unprecedented way • Access mechanisms for remote repositories • Complex low-level formats make accessing and processing of data difficult • Main desired functionality • Ability to select and down-load a subset of data
Current Approaches • Databases • Relational model using SQL • Properties of transactions: Atomicity, Isolation, Durability, Consistency • Good! But is it too heavyweight for read-mostly scientific data ? • Manual implementation based on low-level datasets • Need detailed understanding of low-level formats • HDF5, NetCDF, etc • No single established standard • BinX, BFD, DFDL • Machine readable descriptions, but application is dependent on a specific layout
Data Virtualization An abstract view of data dataset Data Virtualization Data Service • By Global Grid Forum’s DAIS working group: • A Data Virtualization describes an abstract view of data. • A Data Service implements the mechanism to access and process data • through the Data Virtualization
Our Approach: Automatic Data Virtualization • Automatically create data services • A new application of compiler technology • A meta-data descriptor describes the layout of data on a repository • An abstract view is exposed to the users • This paper: • Relational table view • Specify subsetting through SQL Select and Where statements
Outline • Introduction • Motivation • system overview • System design and algorithm • Design a meta-data descriptor • Automatic data virtualization using our meta-data descriptor • Experimental results • Related work • Conclusions and future work
System Overview SELECT < Data Elements > FROM < Dataset Name > WHERE …. AND Filter( < Data Element> );
STORM Runtime System A middleware to support data selection, data partitioning, and data transfer operations on flat-file datasets hosted on a parallel system. Services Query service Data source service Indexing service Filtering service Partition generation service Data mover service
Outline • Introduction • Motivation • system overview • System design and algorithm • Design a meta-data descriptor • Automatic data virtualization using our meta-data descriptor • Experimental results • Related work • Conclusions and future work
Scientific datasets • Large volume • Gigabyte, Terabyte, Petabyte, … • Stored as binary/character flat files with highly repetitive structure • Distributed datasets • Generated/collected by scientific simulations or instruments • Multi-dimensional datasets • Spatial and/or temporal coordinates as subsetting index attributes • Filtering attributes
Design a Meta-data Description Language • Requirements • Specify the relationship of a dataset to the virtual dataset schema • Describe the dataset physical layout within a file • Describe the dataset distribution on nodes of one or more clusters • Specify the subsetting index attributes • Easy to use for data repository administrators and also convenient for our code generation
Design Overview • Dataset Schema Description Component • Dataset Storage Description Component • Dataset Layout Description Component
An Example • Oil Reservoir Management • The dataset comprises several simulation on the same grid • For each realization, each grid point, a number of attributes are stored. • The dataset is stored on a 4 node cluster. Component I: Dataset Schema Description [IPARS] // { * Dataset schema name *} REL = short int // {* Data type definition *} TIME = int X = float Y = float Z = float SOIL = float SGAS = float Component II: Dataset Storage Description [IparsData] //{* Dataset name *} //{* Dataset schema for IparsData *} DatasetDescription = IPARS DIR[0] = osu0/ipars DIR[1] = osu1/ipars DIR[2] = osu2/ipars DIR[3] = osu3/ipars
Data Layout Description Component DATASET “ROOT” { DATATYPE { … } DATAINDEX { … } DATA { DATASET dataset1 DATASET dataset2 DATASET dataset3 } DATASET “dataset1”{ DATATYPE { … } DATASPACE { … } DATA { data1 data2 data3 } } DATASET “dataset2”{ DATATYPE { … } DATASPACE { … } DATA { data4 } } DATASET “dataset3”{ …. } } Dataset Root dataset 3 dataset 1 dataset 2 Data5 Data6 Data2 Data1 Data3 Data4
An Example Component III: Dataset Layout Description DATASET “IparsData” { //{* Name for Dataset *} DATATYPE { IPARS } //{* Schema for Dataset *} DATAINDEX { REL TIME } DATA { DATASET ipars1 DATASET ipars2 } DATASET “ipars1” { DATASPACE { LOOP GRID ($DIRID*100+1):(($DIRID+1)*100):1 { X Y Z } } DATA { $DIR[$DIRID]/COORDS $DIRID = 0:3:1 } } // {* end of DATASET “ipars1” *} DATASET “ipars2” { DATASPACE { LOOP TIME 1:500:1 { LOOP GRID ( $DIRID*100+1):(( $DIRID+1)*100):1 { SOIL SGAS } } } DATA { $DIR[ $DIRID]/DATA$REL $REL = 0:3:1 $DIRID = 0:3:1 } } //{* end of DATASET “ipars2” *} } • Oil Reservoir Management • Use LOOP keyword for capturing the repetitive structure within a file. • The grid has 4 partitions (0~3). • “IparsData” comprises “ipars1” and “ipars2”. “ipars1” describes the data files with the spatial coordinates’ stored; “ipars2” specifies the data files with other attributes stored.
Automatic Virtualization Using Meta-data • Aligned file chunks {num_rows, {File1,Offset1,Num_Bytes1}, {File2,Offset2,Num_Bytes2}, ……, {Filem,Offsetm,Num_Bytesm} } • Our tool parses the meta-data descriptor and generates function codes. At run time, the query would provide parameters to invoke the generated functions to create Aligned File Chunks. Dataset Root dataset 3 dataset 1 dataset 2 Data5 Data6 Data2 Data1 Data3 Data4
Compiler Analysis Data _Extract{ Find _File _Groups() Process _File _Groups() } Find _File _Groups{ Let S be the set of files that match against the query Classify files in S by the set of attributes they have Let S1, … ,Sm be the m sets T = Ø foreach {s1, … ,sm } si∈ Si { {* cartesian product between S1, … ,Sm *} If the values of implicit attributes are not inconsistent { T = T ∪ {s1, … ,sm } } } Output T } Process _File _Groups{ foreach {s1, … ,sm } ∈ T Find _Aligned _File _Chunks() Supply implicit attributes for each file chunk foreach Aligned File Chunk { Check against index Compute offset and length Output the aligned file chunk } } • Meta-data descriptor Create AFC Process AFC Index & Extraction function code
An Example Component III: Dataset Layout Description DATASET “IparsData” { //{* Name for Dataset *} DATATYPE { IPARS } //{* Schema for Dataset *} DATAINDEX { REL TIME } DATA { DATASET ipars1 DATASET ipars2} DATASET “ipars1” { DATASPACE { LOOP GRID ( $DIRID*100+1):(( $DIRID+1)*100):1 { X Y Z } } DATA { $DIR[$DIRID]/COORDS $DIRID = 0:3:1 } } // {* end of DATASET “ipars1” *} DATASET “ipars2” { DATASPACE { LOOP TIME 1:500:1 { LOOP GRID ( $DIRID*100+1):(( $DIRID+1)*100):1 { SOIL SGAS } } } DATA { $DIR[ $DIRID]/DATA$REL $REL = 0:3:1 $DIRID = 0:3:1 } } //{* end of DATASET “ipars2” *} } • Consider a query for selecting a subset with REL values of 0 and 1, TIME from 1 to 100. • Exclude DATA2, DATA3 • Exclude COORD2, COORD3 • Decide eight file groups k = 0, 1, 2, 3 DIR[k]/{COORD0, DATA0} DIR[k]/{COORD1, DATA1} • Create 100 Aligned File Chunks for each file group
Outline • Introduction • Motivation • system overview • System design and algorithm • Design a meta-data descriptor • Automatic data virtualization using our meta-data descriptor • Experimental results • Related work • Conclusions and future work
Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node with a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. Three sets of experiments: • Code generation ability • Evaluate scalability • Comparison with hand written codes
Test the ability of our code generation tool • Layout0 - original layout from the application collaborators • Layout1 – all data stored as a table in a file • Layout2 - all data in a file and each attribute stored as an array • Layout3 – split the layout1into multiple files based on value of the time step • Layout4 – like layout3, but each attribute stored as an array in each data file • Layout5 – data stored in 7 files where the first file with spatial coordinates and the other attributes divided into 6 files • Layout6 – like layout5, but each attribute stored as an array in each data file
Test the ability of our code generation tool Oil Reservoir Management The performance difference is within 4%~10% as for Layout 0. Correctly and efficiently handle a variety of different layouts for the same data
Evaluate the Scalability of Our Tool • Scale the number of nodes hosting the Oil reservoir management dataset • Extract a subset of interest at the size of 1.3GB • The execution times scale almost linearly. • The performance difference varies between 5%~34%, with an average difference of 16%.
Comparison with hand written codes Oil reservoir management dataset stored on 16 nodes. Performance difference is within 17%, With an average difference of 14% Satellite data processing stored on a single node. Performance difference is within 4%
Related Work • Describe data on the Grid • BinX and Binary Format Description • HDF5 • Parallel / distributed databases • Data cube • Magda on top of MySQL • Oracle’s external tables • OpeNDAP • SRS
Conclusions and Future Work • An automatic approach to support data virtualization for large distributed scientific datasets in low-level formats. • Design a meta-data description language • Compiler based strategy to generate extractor codes automatically • The dataset can be stored in the format it is generated in and no effort is involved in loading it in a database system. • Experimental evaluation demonstrates the efficacy and efficiency of our tool • Future work • Experimental studies for more real data-driven and interactive applications with larger scientific datasets under distributed and heterogeneous computing environment • Extend computation capability and flexibility by supporting User Defined Aggregate • Multiple datasets’ integration in the grid computing environment
Comparison with an existing database (PostgreSQL) 6GB data for Satellite data processing. The total storage required after loading the data in PostgreSQL is 18GB. Create Index for both spatial coordinates and S1 in PostgreSQL. No special performance tuning applied for the experiment.