210 likes | 313 Views
Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets. Yu Su and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University. CCGrid 2012, Ottawa, Canada. Outline. Motivation and Introduction Background System Overview Experiment
E N D
Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University CCGrid 2012, Ottawa, Canada
Outline Motivation and Introduction Background System Overview Experiment Conclusion
Motivation • Science become increasingly data driven • Strong desire for efficient data analysis • Challenges • Data sizes grow rapidly • Slow IO and Network Bandwidth • An example • Different kinds of subsetting requests • Different scientific data formats
An Example 7.4 days! • GCRM (Global Cloud Resolving Model) • A global atmospheric circulation model
Client-side vs. Sever-side subsetting and aggregation Simple Request Advanced Request
Data Virtualization • Support SQL queries over scientific dataset • Standard • Flexible • Keep data in native format(etc. NetCDF, HDF5) • Compare with other scientific data management tools • SciDB: support for data arrays in parallel • OPeNDAP: no flexible subsetting and aggregation
Our Approach • User-defined subsetting and aggregations • Subsetting: Dimensions, Coordinates, Variables • Aggregation: SUM, AVG, COUNT, MAX, MIN • Support NetCDF data format • Developed by UCAR • Widely used in climate simulation • Parallel Data Access • Data Partition Strategy • Different Parallel Level
Background - NetCDF Metadata Actual value stored in m-d array Time = 1 to 3 Y = 1 to 4 X = 1 to 4
System Architecture Physical Metadata Logical Metadata Parse the SQL expression Parse the metadata file Generate Query Request Partition Criteria: Subsetting: Disk Access Aggregation: Data Transfer Read Data Post-filter data Local Data Aggregation
Data Aggregation • SQL: SELECT SUM(pressure) FROM GCRM Slave Processes Master Process
Data Parallelism Level 1: data file (2 < 12?) Level 2: variable (5 < 12?) Level 3: data block (12)
Experiment Goals • To compare the functionality and performance of our system with OPeNDAP • OPeNDAP makes local data accessible to remote locations regardless of local storage format. • Data Translation Mechanism • No flexible subsetting and aggregation support • To evaluate the parallel scalability of our system • To show how aggregation queries reduce the data transfer cost.
Compare with OPeNDAP for Type 1 Queries • Data size: 4GB • Input: 50 SQL queries • Query Type: queries only include dimensions • Object: • Baseline: NetCDF query time • Our system without parallelism • OPeNDAP • Relative Speedup: 2.34 – 3.10
Compare with OPeNDAP for Type 2, Type 3 Queries • Data size: 4GB • Input: 50 SQL queries • Query Type: queries include coordinates and variables • Object: • Baseline • Our system without parallelism • OPeNDAP + Filter • Relative Speedup: 1.58 – 3.47
Parallel Optimization – Different Data Size • Data size: 4GB – 32GB • Process number: 1 to 16 • Input: select the whole variable • Relative Speedup: • 4 procs: 2.17 – 2.87 • 8 procs: 4.06 – 5.54 • 16 procs: 7.23 – 9.33
Parallel Optimization – Different Queries • Data size: 32GB • Processes number: 1 to16 • Input: 100 SQL queries • Query Type: queries include dimensions, coordinates and variables • Relative Speedup: • 4 procs: 2.20 – 2.92 • 8 procs: 3.95 – 4.21 • 16 procs: 7.25 – 7.74
Data Aggregation - Time • Data size: 16GB • Process number: 1 - 16 • Input: 60 aggregation queries • Query Type: • Only Agg • Agg + Group by + Having • Agg + Group by • Relative Speedup: • 4 procs: 2.61 – 3.08 • 8 procs: 4.31 – 5.52 • 16 procs: 6.65 – 9.54
Data Aggregation – Data Transfer Amount • Data size: 16GB • Process number: 1 - 16 • Input: 60 aggregation queries • Query Type: • Only Agg • Agg + Group by + Having • Agg + Group by
Conclusion Data sizes increase in a fast speed Goal: Find exact data subset as user specifies Data virtualization on top of NetCDF dataset Query request partition and parallel processing A good speedup compared with OPeNDAP
Pre-filter Module Phase 1 Phase 2 Phase 3 Dataset Storage Metadata Dataset Logical Metadata Request Partition Strategy