Supporting SQL-3 Aggregations on Grid-based Data Repositories

Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal, Umit Catalyurek, Joel Saltz

Scientific data repositories Large volume Gigabyte, Terabyte, Petabyte Distributed datasets Generated/collected by scientific simulations or instruments Multi-dimensional datasets Dimension attributes, measure attributes Scientific data analysis Scientific Data Analysis on Grid-based Data Repositories Data Specification Data Organization Data Extraction Data Movement Data Analysis Data Visualization

Motivating Scientific Applications Oil Reservoir Management Magnetic Resonance Imaging Data-driven applications from science, Engineering, biomedicine: Oil Reservoir Management Water Contamination Studies Cancer Studies using MRI Telepathology with Digitized Slides Satellite Data Processing Virtual Microscope …

Current Approaches • Databases • Relational model using SQL • Properties of transactions: Atomicity, Isolation, Durability, Consistency • Good! But is it too heavyweight for read-mostly scientific data ? • Manual implementation based on low-level datasets • Need detailed understanding of low-level formats • HDF5, NetCDF, etc • Depend on the access behavior API • No single established standard • BinX, BFD, DFDL • Machine readable descriptions, but application is dependent on a specific layout

Our Approach • Express the query & the computing declaratively on a virtual relational table view • Dataset in complex, low-level layouts can be abstracted as SQL-3 table to scientists. • Support basic SELECT query for specifying subset of interest. • Data analysis on subset of interest can be defined as SQL-3 aggregate function on SQL-3 relation.

Our Approach • Compilation based system • Meta-data descriptor • Generate data extracting service & data aggregation service • A lightweight layer on top of datasets • A runtime middleware STORM is used to work in coordination with the generated services.

Compiler Analysis and Code Generation Query frontend STORM Extraction Service System Overview Meta-data Descriptor User Defined Aggregate Select Query Input Aggregation Service

Outline • Introduction • Motivation • system overview • System design and algorithm • Canonical query structure and one example • Compiler analysis and code generation • Design a meta-data descriptor • Data extraction service & Data aggregation service • Experimental results • Related work • Conclusions

Canonical Query Structure SELECT <attribute list> , <AGG_name(Dataset Name)> From <Dataset Name> WHERE <Expression> GROUP BY <group-by attribute_list>; CREATE AGGREGATE <AGG_name> (BASETYPE = <Dataset Name>, SFUNC = <sfunc>, STYPE = <state_type> [, FINALFUNC = <ffunc>] [, INITCOND = <initial_condition>] ) CREATE FUNCTION <func> (<AGG_status>, <Dataset Name> ) RETURNS <rettype> AS ‘ <SQL statement list> ’ LANGUAGE SQL;

Oil Reservoir Management (IPARS) SELECT X, Y, Z, ipars_bypass_sum(IPARS) FROM IPARS WHERE REL in (0,5,10) AND TIME >= 1000 AND TIME <= 1200 GROUP BY X, Y, Z HAVING ipars_bypass_sum(OIL)>0; CREATE AGGREGATE ipars_bypass_sum ( BASETYPE = IPARS, SFUNC = ipars_func, STYPE = int, INITCOND = '1' ); CREATE FUNCTION ipars_func(int, IPARS) RETURNS int AS ' SELECT CASE WHEN $2.soil > 0.7 AND |/($2.oilx * $2.oilx + $2.oily * $2.oily + $2.oilz * $2.oilz)<30.0 THEN $1 & 1 ELSE 0 END; ' LANGUAGE SQL;

Compiler Analysis and Code Generation • Transform the canonical query into two pipelined sub-queries. • Data Extraction Service TempDataset = SELECT <all attributes> From <Dataset Name> WHERE <Expression> ; • Data Aggregation Service SELECT <attribute list> , <AGG_name(Dataset Name)> FROM TempDataset GROUP BY <group-by attribute_list>;

Design a Meta-data Descriptor-- An Example Component I: Dataset Schema Description [IPARS] // { * Dataset schema name *} REL = short int // {* Data type definition *} TIME = int X = float Y = float Z = float SOIL = float SGAS = float • Oil Reservoir Management • The dataset comprises several simulations on the same grid • For each realization, each grid point, a number of attributes are stored. • The dataset is stored on a 4 node cluster. Component II: Dataset Storage Description [IparsData] //{* Dataset name *} //{* Dataset schema for IparsData *} DatasetDescription = IPARS DIR[0] = osu0/ipars DIR[1] = osu1/ipars DIR[2] = osu2/ipars DIR[3] = osu3/ipars

An Example Component III: Dataset Layout Description DATASET “IparsData” { //{* Name for Dataset *} DATATYPE { IPARS } //{* Schema for Dataset *} DATAINDEX { REL TIME } DATA { DATASET ipars1, DATASET ipars2} DATASET “ipars1” { DATASPACE { LOOP GRID ( $DIRID*100+1):(($DIRID+1)*100):1 { X Y Z } } DATA { $DIR[$DIRID]/COORDS $DIRID = 0:3:1 } } // {* end of DATASET “ipars1” *} DATASET “ipars2” { DATASPACE { LOOP TIME 1:500:1 { LOOP GRID ( $DIRID*100+1):(( $DIRID+1)*100):1 { SOIL SGAS } } } DATA { $DIR[ $DIRID]/DATA$REL $REL = 0:3:1 $DIRID = 0:3:1 } } //{* end of DATASET “ipars2” *} } • Oil Reservoir Management • Use LOOP keyword for capturing the repetitive structure within a file. • The grid has 4 partitions (0~3). • “IparsData” comprises “ipars1” and “ipars2”. “ipars1” describes the data files with the spatial coordinates’ stored; “ipars2” specifies the data files with other attributes stored.

Generate Data Extraction Service Using Meta-data • Aligned file chunks {num_rows, {File1,Offset1,Num_Bytes1}, {File2,Offset2,Num_Bytes2}, ……, {Filem,Offsetm,Num_Bytesm} } • Our tool parses the meta-data descriptor and generates function codes. • At run time, the query would provide parameters to invoke the generated functions to create Aligned File Chunks.

Generate Data Aggregation Service • Aggregate function analysis • Projection push-down helps to extract data only needed for a particular query and its aggregation. TempDataset = SELECT <useful attributes> From <Dataset Name> WHERE <Expression> ; • As for the IPARS application, only 7 out of the 22 attributes are actually needed for the considered query. The reduction of the data volume to be retrieved and communicated is 66%. • As for the TITAN application, 5 out of the 8 attributes are actually needed and the reduction is 38%.

Generate Data Aggregation Service 2. Aggregate function decomposition • The first step involves computations applied on each tuple; The second step updates the aggregate status variable. • Replace the largest expression with TempAttr. As for the IPARS, the number of attributes is reduced further from 7 to 4. CREATE FUNCTION ipars_func(int, IPARS) RETURNS int AS ' SELECT CASE WHEN $2.TempAttr THEN $1 & 1 ELSE 0 END; ' LANGUAGE SQL;

Generate Data Aggregation Service • Partition the subset of interest based on the values of the group-by attributes if more client nodes are provided as the computing unit. • Construct a hash-table using the values of the group-by attributes as the hash-key. And translate the aggregate function in SQL-3 into the imperative C/C++ code.

Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node has a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. • Scalability test when varying the number of nodes for hosting data and performing the computations; • Performance test when the amount of data to be processed is increased; • Comparison with hand-written code; • The impact of the aggregation decomposition.

Experimental Results for IPARS • Scale the number of nodes hosting the data and the number of nodes for processing. • Extract a subset of interest at the size of 640MB from scanning the 1.9GB data. • The execution times scale almost linearly. • The performance difference varies between 6%~20%, with an average difference of 14%. • The aggregate decomposition can reduce the difference to be between 1% and 10%.

Experimental Results for IPARS • Evaluate the system’s ability to scale to larger datasets. • Use 8 data source nodes and 8 client nodes. • The execution time stays proportional to the amount of data to be retrieved and processed.

Experimental Results for TITAN • Scale the number of nodes hosting the data and the number of nodes for processing. • Extract a subset of interest at the size of 228MB from scanning the 456MB data. • The execution times scale almost linearly. • The performance difference is 17%. • The aggregate decomposition can reduce the difference to be 6%.

Experimental Results for TITAN • Evaluate the system’s ability to scale to larger datasets. • Use 8 data source nodes and 8 client nodes. • The execution time stays proportional to the amount of data to be retrieved and processed.

Related Work • Parallel / distributed databases • Parallelization of SQL-based aggregation and reductions • Data cubes • External tables in Oracle • Reduction research in parallelizing compilers • Runtime strategies for supporting reductions in a distributed environment

Conclusions • A compiler-based system for supporting SQL-3 aggregate function and select query with group-by operator on flat-file scientific datasets. • Both the extraction of the subset of interest and the aggregate computing can be expressed declaratively. • By using a meta-data descriptor to represent the layout of the dataset, our compiler generates efficient data extraction service. • The compiler analyzes the user-define aggregate function and generate code in a parallel environment.

Supporting SQL-3 Aggregations on Grid-based Data Repositories