260 likes | 476 Views
Supporting a Light-Weight Data Management Layer Over HDF5. Yi Wang, Yu Su and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University. Outline. Introduction System Design Experimental Results Conclusion. Introduction.
E N D
Supporting a Light-Weight Data Management Layer Over HDF5 Yi Wang, Yu Su and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University
Outline Introduction System Design Experimental Results Conclusion
Introduction • Scientific data analysis is increasingly data-intensive • Data volume approximately doubles each year • Scientific data dissemination is hampered by dataset size growth • Solution: data aggregation • Scientific dataset: read-only or append-only • Reloading data into a database to maintain ACID properties is often not needed
HDF5 HDF5 (Hierarchical Data Format)
H5DS time(hour) 0 5 10 Dataset: dim0 0 1 2 distance(km) 0 10 20 dim1 0 1 2 dim2 0 1 2 height(km) 12 13 14 • H5DS (HDF5Dimension Scales) • An auxiliary dataset associated with a dimension of the primary dataset • Serves as a coordinate system
V V V V V V HDF5 Compound Datatype 3 5 V V V int16 char int32 2x3x2 array of float32 Compound Datatype: Like a “struct” in C language
Outline Introduction System Design Experimental Results Conclusion
Motivation • An SQL interface over the virtual (relational) view • SELECT, FROM, WHERE, GROUPBY, HAVING • Scientists don’t have to get familiar with the HDF5 libraries for extra programming • Reduce data transfer volume by data aggregation • Loading a data subset instead of the entire dataset if necessary • Reduce I/O overhead • Parallel query execution • Accelerate data subsetting and data aggregation
Functionality index-based condition coordinate-based condition content-based condition • Query based on dimension index values (type 1) • Also supported by HDF5 API • Query based on dimension scales (type 2) • Use coordinate system instead of the detailed physical layout • Query based on data values (type 3) • Simple datatype + compound datatype • Aggregation query • SUM, COUNT, AVG, MIN, and MAX
Execution Overview 1D: AND-logic condition list 2D: OR-logic condition list Generated at runtime 1D: OR-logic condition list Same content-based condition
Metadata Generation Strategy A large dataset may be distributed among multiple files Cache frequently used information to avoid repeated I/O requests Collect dispersed intrinsic header metadata
True: nullify the elementary condition Hyperslab Selector False: nullify the condition list 4-dim Salinity Dataset dim1: time [0, 1023] dim2: cols [0, 166] dim3: rows [0, 62] dim4: layers [0, 33] Fill up all the index boundary values
Parallelization of Query Execution High-level parallelism:
Parallelization of Query Execution (contd.) Low-level parallelism: Combination: computing the union set
Outline Introduction System Design Experimental Results Conclusion
Sequential Comparison with OPeNDAP • What’s OPeNDAP? • A C/S-based scientific data management system • Requires translating HDF5 format into a standard OPeNDAP data format • Can only support the type 1 query (index-based) • Experimental datasets: • 4 GB (sequential exp.) and 16 GB (parallel exp.) • 2 separate datasets: salinity and temperature • Or a compound-type dataset: cell • 4 dimensions: time, cols, rows, and layers
Sequential Comparison with OPeNDAP (Type2 and Type3 Queries)
Aggregation Query Examples AG1: Simple global aggregation AG2: GROUP BY clause + HAVING clause AG3: GROUP BY clause
Outline Introduction System Design Experimental Results Conclusion
Conclusion • We’ve developed a light-weight data management layer over HDF5 • Supports queries based on dimension index, dimension scales and/or data values • Supports parallel data subsetting and aggregation • Our results show that • The sequential performance is better than OPeNDAP • Our system has a good scalability