Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University CCGrid 2012, Ottawa, Canada

Outline Motivation and Introduction Background System Overview Experiment Conclusion

Motivation • Science become increasingly data driven • Strong desire for efficient data analysis • Challenges • Data sizes grow rapidly • Slow IO and Network Bandwidth • An example • Different kinds of subsetting requests • Different scientific data formats

An Example 7.4 days! • GCRM (Global Cloud Resolving Model) • A global atmospheric circulation model

Client-side vs. Sever-side subsetting and aggregation Simple Request Advanced Request

Data Virtualization • Support SQL queries over scientific dataset • Standard • Flexible • Keep data in native format(etc. NetCDF, HDF5) • Compare with other scientific data management tools • SciDB: support for data arrays in parallel • OPeNDAP: no flexible subsetting and aggregation

Our Approach • User-defined subsetting and aggregations • Subsetting: Dimensions, Coordinates, Variables • Aggregation: SUM, AVG, COUNT, MAX, MIN • Support NetCDF data format • Developed by UCAR • Widely used in climate simulation • Parallel Data Access • Data Partition Strategy • Different Parallel Level

Background - NetCDF Metadata Actual value stored in m-d array Time = 1 to 3 Y = 1 to 4 X = 1 to 4

System Architecture Physical Metadata Logical Metadata Parse the SQL expression Parse the metadata file Generate Query Request Partition Criteria: Subsetting: Disk Access Aggregation: Data Transfer Read Data Post-filter data Local Data Aggregation

Data Aggregation • SQL: SELECT SUM(pressure) FROM GCRM Slave Processes Master Process

Data Parallelism Level 1: data file (2 < 12?) Level 2: variable (5 < 12?) Level 3: data block (12)

Experiment Goals • To compare the functionality and performance of our system with OPeNDAP • OPeNDAP makes local data accessible to remote locations regardless of local storage format. • Data Translation Mechanism • No flexible subsetting and aggregation support • To evaluate the parallel scalability of our system • To show how aggregation queries reduce the data transfer cost.

Compare with OPeNDAP for Type 1 Queries • Data size: 4GB • Input: 50 SQL queries • Query Type: queries only include dimensions • Object: • Baseline: NetCDF query time • Our system without parallelism • OPeNDAP • Relative Speedup: 2.34 – 3.10

Compare with OPeNDAP for Type 2, Type 3 Queries • Data size: 4GB • Input: 50 SQL queries • Query Type: queries include coordinates and variables • Object: • Baseline • Our system without parallelism • OPeNDAP + Filter • Relative Speedup: 1.58 – 3.47

Parallel Optimization – Different Data Size • Data size: 4GB – 32GB • Process number: 1 to 16 • Input: select the whole variable • Relative Speedup: • 4 procs: 2.17 – 2.87 • 8 procs: 4.06 – 5.54 • 16 procs: 7.23 – 9.33

Parallel Optimization – Different Queries • Data size: 32GB • Processes number: 1 to16 • Input: 100 SQL queries • Query Type: queries include dimensions, coordinates and variables • Relative Speedup: • 4 procs: 2.20 – 2.92 • 8 procs: 3.95 – 4.21 • 16 procs: 7.25 – 7.74

Data Aggregation - Time • Data size: 16GB • Process number: 1 - 16 • Input: 60 aggregation queries • Query Type: • Only Agg • Agg + Group by + Having • Agg + Group by • Relative Speedup: • 4 procs: 2.61 – 3.08 • 8 procs: 4.31 – 5.52 • 16 procs: 6.65 – 9.54

Data Aggregation – Data Transfer Amount • Data size: 16GB • Process number: 1 - 16 • Input: 60 aggregation queries • Query Type: • Only Agg • Agg + Group by + Having • Agg + Group by

Conclusion Data sizes increase in a fast speed Goal: Find exact data subset as user specifies Data virtualization on top of NetCDF dataset Query request partition and parallel processing A good speedup compared with OPeNDAP

Thanks

Pre-filter Module Phase 1 Phase 2 Phase 3 Dataset Storage Metadata Dataset Logical Metadata Request Partition Strategy

Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets

Presentation Transcript

User Defined Functions

User Defined Functions

User Defined Functions

User Defined Functions

User-Defined Classes

User-Defined Primitives

Supporting SQL Queries for subsetting large-SCALE Datasets in Paraview

User-Defined Types

Subsetting

User-Defined Classes

Aggregation and Subsetting in ERDDAP (a middleman data server)

Parallel netCDF Study.

Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings

USER-DEFINED FUNCTIONS

User defined Functions

User Defined Data

Inheritance and User-Defined Controls

SUBSETTING

Structure and User Defined Type

User Defined Functions

User Defined Functions