1 / 47

NCSA-NARA investigations of HDF5 in support of EXPRESS-Driven data

NCSA-NARA investigations of HDF5 in support of EXPRESS-Driven data. Mike Folk The HDF NARA Project PDES, Inc. Offsite Meeting September 24-29, 2006. Acknowledgement.

payton
Download Presentation

NCSA-NARA investigations of HDF5 in support of EXPRESS-Driven data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NCSA-NARA investigations of HDF5 in support of EXPRESS-Driven data Mike FolkThe HDF NARA Project PDES, Inc. Offsite Meeting September 24-29, 2006

  2. Acknowledgement This report is based upon work supported by the National Archives and Records Administration (NARA) through the grant NARA NSF 0202 GPG. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NARA. PDES, Inc. Offsite Sept 2006

  3. Participants Mike Folk, Vailin Choi, Elena Pourmal – The HDF Group Mark Conrad and Bob Chadduck – NARA David Price – EuroSTEP Keith Hunten – Lockheed-Martin Steve Cooper and Denny Moore – Electric Boat Others PDES, Inc. Offsite Sept 2006

  4. 1. What is HDF5?

  5. HDF5 is • A file format for managing any kind of data • Software system to manage data in the format • Suited especially to large volume or complex data • Suited for every size and type of system • Open file format, open software PDES, Inc. Offsite Sept 2006

  6. Definitions • “HDF” – Hierarchical Data Format • Originated in 1988 • NCSA at University of Illinois at Urbana-Champaign • “HDF5” • Successor to HDF, introduced in 1998 PDES, Inc. Offsite Sept 2006

  7. palette An HDF5 file is a container… …into which you can put your data objects. lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 PDES, Inc. Offsite Sept 2006

  8. HDF5 data model • HDF5 file – container for data objects • Primary Objects • Groups • Datasets • Additional ways to organize data • Attributes for metadata • Sharable objects • Storage and access properties Everything else is built from these parts. PDES, Inc. Offsite Sept 2006

  9. “/” (root) “/foo” 3-D array lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 Table palette Raster image Raster image 2-D array HDF “groups” for organizing objects in files PDES, Inc. Offsite Sept 2006

  10. Metadata Data Dataspace Rank Dimensions 3 Dim_1 = 4 Dim_2 = 5 Dim_3 = 7 Datatype IEEE 32-bit float Attributes Storage info time = 32.4 Chunked pressure = 987 compressed temp = 56 HDF5 “dataset” for holding the data PDES, Inc. Offsite Sept 2006

  11. Datatypes (array elements) • Datatype – how to interpret a data element • Two classes: atomic and compound PDES, Inc. Offsite Sept 2006

  12. Datatypes • HDF5 atomic types • normal integer & float • user-definable (e.g. 13-bit integer) • fixed length and variable length multiples (e.g. strings) • references to objects/dataset regions • enumeration - names mapped to integers • array • HDF5 compound types • Records with fields – comparable to C structs • Members can be atomic or compound types PDES, Inc. Offsite Sept 2006

  13. A mechanism for collections of related objects Every file starts with a root group Similar to UNIX directories Can have attributes “Groups” “/” harry tom dick b a c PDES, Inc. Offsite Sept 2006

  14. Better subsetting access time; extendable chunked Improves storage efficiency, transmission speed compressed Arrays can be extended in any direction extendable File B Metadata in one file, raw data in another. Dataset “Fred” Split file File A Metadata for Fred Data for Fred Special Storage Options PDES, Inc. Offsite Sept 2006

  15. Mesh Example, in HDFView PDES, Inc. Offsite Sept 2006

  16. HDF5 Software Tools & Applications HDF I/O Library HDF File PDES, Inc. Offsite Sept 2006

  17. Features of library • Ability to create and access complex data structures • Fast, flexible I/O • Data transformation and filtering during I/O • Flexible API for power users • Compatibility with common data models • Able to represent all common data structures • Supports key language models – C, Fortran, Java, etc. PDES, Inc. Offsite Sept 2006

  18. Other info • Library and tools run almost anywhere • Other software from THG • Java viewer • Command-line utilities • Other software • Commercial (IDL, Matlab, Labview, etc.) • Community (EOS, ASCI, etc.) • Integration with other software (SRB, databases, etc.) PDES, Inc. Offsite Sept 2006

  19. Making HDF useful for your application • There are many ways to organize and access data in HDF5 • How do we apply these capabilities to a particular domain, such as product data? • We have to decide how we will organize and access our data in a way that best addresses our needs. • And create data models, APIs and tools as appropriate to support our applications. • Or adapt existing data models, APIs and tools as appropriate to support our applications. PDES, Inc. Offsite Sept 2006

  20. Sample uses of HDF

  21. Aqua (6/01) Terra CERES MISR MODIS MOPITT AquaCERES MODIS AMSR Aura TES HRDLS MLS OMI HDF-EOS 1. NASA Earth Observing System (EOS) PDES, Inc. Offsite Sept 2006

  22. 2. Advanced Simulation & Computing (ASC) Question: How do we maintain a nuclear stockpile in the absence of testing? Answer: Very large simulations PDES, Inc. Offsite Sept 2006

  23. ASC Data requirements • Large datasets (> a terabyte) • Fast I/O on massive parallel systems • Complex data and extensive metadata • Availability on leading edge systems PDES, Inc. Offsite Sept 2006

  24. 3. Bioinformatics--Managing genomic data caacaagccaaaactcgtacaa Cgagatatctcttggaaaaact gctcacaatattgacgtacaag gttgttcatgaaactttcggta Acaatcgttgacattgcgacct aatacagcccagcaagcagaat

  25. DNA sequencing workflows are complex • Diverse formats • Highly redundant data • Multiple levels of information • Complex associations • Repeated file processing • Non-scalable storage • Lack of persistence PDES, Inc. Offsite Sept 2006

  26. BioHDF HDF5 as binary exchange format for bioinformatics PDES, Inc. Offsite Sept 2006

  27. 4. Flight test data

  28. HDF- Time-history HDF- PACKET Boeing flight test PDES, Inc. Offsite Sept 2006

  29. HDF role in the Software Stack

  30. Apps: simulation, visualization, remote sensing… Examples: Thermonuclear simulations Product modeling Data mining tools Visualization tools Climate models BioHDF SAF HDF-Packet Matlab HDF-EOS app-specificAPI or GUI LANL LLNL, SNL Grids COTS NASA HDF5 virtual file layer (I/O drivers) HDF5 serial & parallel I/O Split Files MPI I/O Custom Stdio Stream Storage ? Across the networkor to/from another application or library HDF5 format User-defined device Split metadata and raw data files File on parallel file system File Common application-specificdata models HDF5 data model & API PDES, Inc. Offsite Sept 2006

  31. 2. Why is there interest in HDF5 for product data? (Courtesy of David Price, EuroSTEP)

  32. Needs • STEP and related models exist using EXPRESS • ASCII, XML STEP formats defined, software developed • But ASCII/XML don’t adapt well for highly voluminous, complex data • Finite element analysis • Computational fluid dynamics • Heterogeneous product data PDES, Inc. Offsite Sept 2006

  33. EuroSTEP project • VIVACE: “Value Improvement through a Virtual Aeronautical Collaborative Enterprise” • Deliverable: EXPRESS-driven Large Volume Binary Data Representation PDES, Inc. Offsite Sept 2006

  34. Survey of State of the Art • Candidates • ASN.1 : Abstract Syntax Notation 1 • HDF5 : Hierarchical Data Format • XML/Binary • CGNS : CFD General Notation System • SDAI implementation by LKSoft • Found HDF5 most suitable for very large scientific datasets and complex relationships PDES, Inc. Offsite Sept 2006

  35. Goal:Create open-source toolkit mapping EXPRESS to HDF5

  36. Product model Applications Apps: simulation, visualization, remote sensing… Examples: Thermonuclear simulations Product modeling Data mining tools Visualization tools Examples: Thermonuclear simulations Product modeling Data mining tools Visualization tools Climate models BioHDF SAF HDF-Packet Matlab HDF-EOS appl-specificAPIs LANL LLNL, SNL Grids COTS NASA HDF5 virtual file layer (I/O drivers) HDF5 serial & parallel I/O Split Files MPI I/O Custom Stdio Stream Storage ? Across the networkor to/from another application or library HDF5 format User-defined device Split metadata and raw data files File on parallel file system File Common application-specificdata models STEPdata models STEP-HDF5 HDF5 data model & API PDES, Inc. Offsite Sept 2006

  37. NARA-sponsored work

  38. NCSA-THG NARA Research • Investigate the viability of scientific data formats, such as HDF5, for long-term preservation of engineering data in the federal archives PDES, Inc. Offsite Sept 2006

  39. Heterogeneous data aggregation, with HDF5 • Goal: Using NARA’s TWR collection, investigate the possibilities and limitations of using HDF5 as a container for archiving heterogeneous collections of records, with special attention to STEP data. PDES, Inc. Offsite Sept 2006

  40. Activities • Use files, datatypes, structures in NARA TWR collection – STEP files, photos, schematics, etc. • Map these to HDF5 objects and structures, exploiting features of HDF5 • Assess benefits and costs in terms of storage efficiency and accessibility • Investigate use of HDF5 as container for collection PDES, Inc. Offsite Sept 2006

  41. Relationship EuroSTEP, Electric Boat, et al • Working together to develop mappings from EXPRESS to HDF5 • Sharing data for testing • Periodic meetings to share information and coordinate research • Some involvement with standardization PDES, Inc. Offsite Sept 2006

  42. Investigating I/O efficiency and size • Explore different datatypes and storage options for b-spline surface models (later: finite element models) • Two types of data – b-splines themselves and cartesian points • Variables • Different HDF5 datatypes • Dataset compression • Use of extra indexes in HDF5 for fast access PDES, Inc. Offsite Sept 2006

  43. Some results • Small files • HDF5 not appreciably better then STEP, sometimes worse • Large files • Compression always made HDF5 files smaller • Even without compression, HDF5 storage better • Indexing approach also tended to save space • Lessons • HDF5 can provide very efficient storage for cartesian points • Choice of data types and data storage is important PDES, Inc. Offsite Sept 2006

  44. HDF5 as container HDFView Demo

  45. PDES, Inc. Offsite Sept 2006

  46. Thank you

  47. HDF Information • HDF Information Center • http://hdfgroup.org/ • HDF Help email address • help@hdfgroup.org/ • HDF users mailing list • hdfnews@hdfgroup.org/ PDES, Inc. Offsite Sept 2006

More Related