Supporting a Light-Weight Data Management Layer Over HDF5

Supporting a Light-Weight Data Management Layer Over HDF5 Yi Wang, Yu Su and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University

Outline Introduction System Design Experimental Results Conclusion

Introduction • Scientific data analysis is increasingly data-intensive • Data volume approximately doubles each year • Scientific data dissemination is hampered by dataset size growth • Solution: data aggregation • Scientific dataset: read-only or append-only • Reloading data into a database to maintain ACID properties is often not needed

HDF5 HDF5 (Hierarchical Data Format)

H5DS time(hour) 0 5 10 Dataset: dim0 0 1 2 distance(km) 0 10 20 dim1 0 1 2 dim2 0 1 2 height(km) 12 13 14 • H5DS (HDF5Dimension Scales) • An auxiliary dataset associated with a dimension of the primary dataset • Serves as a coordinate system

V V V V V V HDF5 Compound Datatype 3 5 V V V int16 char int32 2x3x2 array of float32 Compound Datatype: Like a “struct” in C language

Motivation • An SQL interface over the virtual (relational) view • SELECT, FROM, WHERE, GROUPBY, HAVING • Scientists don’t have to get familiar with the HDF5 libraries for extra programming • Reduce data transfer volume by data aggregation • Loading a data subset instead of the entire dataset if necessary • Reduce I/O overhead • Parallel query execution • Accelerate data subsetting and data aggregation

Functionality index-based condition coordinate-based condition content-based condition • Query based on dimension index values (type 1) • Also supported by HDF5 API • Query based on dimension scales (type 2) • Use coordinate system instead of the detailed physical layout • Query based on data values (type 3) • Simple datatype + compound datatype • Aggregation query • SUM, COUNT, AVG, MIN, and MAX

Execution Overview 1D: AND-logic condition list 2D: OR-logic condition list Generated at runtime 1D: OR-logic condition list Same content-based condition

Metadata Generation Strategy A large dataset may be distributed among multiple files Cache frequently used information to avoid repeated I/O requests Collect dispersed intrinsic header metadata

True: nullify the elementary condition Hyperslab Selector False: nullify the condition list 4-dim Salinity Dataset dim1: time [0, 1023] dim2: cols [0, 166] dim3: rows [0, 62] dim4: layers [0, 33] Fill up all the index boundary values

Parallelization of Query Execution High-level parallelism:

Parallelization of Query Execution (contd.) Low-level parallelism: Combination: computing the union set

Sequential Comparison with OPeNDAP • What’s OPeNDAP? • A C/S-based scientific data management system • Requires translating HDF5 format into a standard OPeNDAP data format • Can only support the type 1 query (index-based) • Experimental datasets: • 4 GB (sequential exp.) and 16 GB (parallel exp.) • 2 separate datasets: salinity and temperature • Or a compound-type dataset: cell • 4 dimensions: time, cols, rows, and layers

Sequential Comparison with OPeNDAP (Type1 Queries)

Type2 and Type3 Query Examples

Sequential Comparison with OPeNDAP (Type2 and Type3 Queries)

Parallel Query Processing for Type1 Queries

Parallel Query Processing for Type2 and Type3 Queries

Aggregation Query Examples AG1: Simple global aggregation AG2: GROUP BY clause + HAVING clause AG3: GROUP BY clause

Sequential and Parallel Performance of Aggregation Queries

Conclusion • We’ve developed a light-weight data management layer over HDF5 • Supports queries based on dimension index, dimension scales and/or data values • Supports parallel data subsetting and aggregation • Our results show that • The sequential performance is better than OPeNDAP • Our system has a good scalability

Supporting a Light-Weight Data Management Layer Over HDF5

Supporting a Light-Weight Data Management Layer Over HDF5

Presentation Transcript

supporting data

CSCI-1680 Transport Layer II Data over TCP

Data management over flash memory

Virtual Object Layer in HDF5

Supporting Data Management Across Disciplines

Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets

Accessing HDF5 Data via OPeNDAP

A Distributed Multimedia Data Management over the Grid

Upper Layer Data on Management frames

Light-Weight Cryptography – How Light is Light?

Chapter 11 Data Management Layer Design

Building a Data Layer

(A taste of) Data Management Over the Web

Light-weight Annotations

Weight loss and over weight

Light Weight Bricks

Light Weight Blocks

supporting data

Data layer

Light weight prayermat