1 / 26

Supporting a Light-Weight Data Management Layer Over HDF5

Supporting a Light-Weight Data Management Layer Over HDF5. Yi Wang, Yu Su and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University. Outline. Introduction System Design Experimental Results Conclusion. Introduction.

Download Presentation

Supporting a Light-Weight Data Management Layer Over HDF5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Supporting a Light-Weight Data Management Layer Over HDF5 Yi Wang, Yu Su and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University

  2. Outline Introduction System Design Experimental Results Conclusion

  3. Introduction • Scientific data analysis is increasingly data-intensive • Data volume approximately doubles each year • Scientific data dissemination is hampered by dataset size growth • Solution: data aggregation • Scientific dataset: read-only or append-only • Reloading data into a database to maintain ACID properties is often not needed

  4. HDF5 HDF5 (Hierarchical Data Format)

  5. H5DS time(hour) 0 5 10 Dataset: dim0 0 1 2 distance(km) 0 10 20 dim1 0 1 2 dim2 0 1 2 height(km) 12 13 14 • H5DS (HDF5Dimension Scales) • An auxiliary dataset associated with a dimension of the primary dataset • Serves as a coordinate system

  6. V V V V V V HDF5 Compound Datatype 3 5 V V V int16 char int32 2x3x2 array of float32 Compound Datatype: Like a “struct” in C language

  7. Outline Introduction System Design Experimental Results Conclusion

  8. Motivation • An SQL interface over the virtual (relational) view • SELECT, FROM, WHERE, GROUPBY, HAVING • Scientists don’t have to get familiar with the HDF5 libraries for extra programming • Reduce data transfer volume by data aggregation • Loading a data subset instead of the entire dataset if necessary • Reduce I/O overhead • Parallel query execution • Accelerate data subsetting and data aggregation

  9. Functionality index-based condition coordinate-based condition content-based condition • Query based on dimension index values (type 1) • Also supported by HDF5 API • Query based on dimension scales (type 2) • Use coordinate system instead of the detailed physical layout • Query based on data values (type 3) • Simple datatype + compound datatype • Aggregation query • SUM, COUNT, AVG, MIN, and MAX

  10. Execution Overview 1D: AND-logic condition list 2D: OR-logic condition list Generated at runtime 1D: OR-logic condition list Same content-based condition

  11. Metadata Generation Strategy A large dataset may be distributed among multiple files Cache frequently used information to avoid repeated I/O requests Collect dispersed intrinsic header metadata

  12. True: nullify the elementary condition Hyperslab Selector False: nullify the condition list 4-dim Salinity Dataset dim1: time [0, 1023] dim2: cols [0, 166] dim3: rows [0, 62] dim4: layers [0, 33] Fill up all the index boundary values

  13. Parallelization of Query Execution High-level parallelism:

  14. Parallelization of Query Execution (contd.) Low-level parallelism: Combination: computing the union set

  15. Outline Introduction System Design Experimental Results Conclusion

  16. Sequential Comparison with OPeNDAP • What’s OPeNDAP? • A C/S-based scientific data management system • Requires translating HDF5 format into a standard OPeNDAP data format • Can only support the type 1 query (index-based) • Experimental datasets: • 4 GB (sequential exp.) and 16 GB (parallel exp.) • 2 separate datasets: salinity and temperature • Or a compound-type dataset: cell • 4 dimensions: time, cols, rows, and layers

  17. Sequential Comparison with OPeNDAP (Type1 Queries)

  18. Type2 and Type3 Query Examples

  19. Sequential Comparison with OPeNDAP (Type2 and Type3 Queries)

  20. Parallel Query Processing for Type1 Queries

  21. Parallel Query Processing for Type2 and Type3 Queries

  22. Parallel Query Processing for Type2 and Type3 Queries

  23. Aggregation Query Examples AG1: Simple global aggregation AG2: GROUP BY clause + HAVING clause AG3: GROUP BY clause

  24. Sequential and Parallel Performance of Aggregation Queries

  25. Outline Introduction System Design Experimental Results Conclusion

  26. Conclusion • We’ve developed a light-weight data management layer over HDF5 • Supports queries based on dimension index, dimension scales and/or data values • Supports parallel data subsetting and aggregation • Our results show that • The sequential performance is better than OPeNDAP • Our system has a good scalability

More Related