700 likes | 710 Views
HDF5 Tutorial. LCI April 28, 2008. Outline. Why HDF5? Introduction to HDF5 data and programming models HDF5 tools and utilities HDF5 advanced topics Introduction to parallel HDF5 HDF5 features that affect performance (or caching and buffering in HDF5). Why HDF5?. Matter & the universe.
E N D
HDF5 Tutorial LCI April 28, 2008 LCI Tutorial
Outline • Why HDF5? • Introduction to HDF5 data and programming models • HDF5 tools and utilities • HDF5 advanced topics • Introduction to parallel HDF5 • HDF5 features that affect performance (or caching and buffering in HDF5) LCI Tutorial
Why HDF5? LCI Tutorial
Matter & the universe Life and nature Weather and climate August 24, 2001 August 24, 2002 Total Column Ozone (Dobson) 60 385 610 Answering big questions … LCI Tutorial
… involves big data … LCI Tutorial
… varied data … Thanks to Mark Miller, LLNL LCI Tutorial
… and complex relationships … SNP Score Contig Summaries Discrepancies Contig Qualities Coverage Depth Trace Reads Aligned bases Read quality Contig Percent match LCI Tutorial
… on big computers … LCI Tutorial
… and on little computers … LCI Tutorial
How do we… • Describe our data? • Read it? Store it? Find it? Share it? Mine it? • Move it into, out of, and between computers and repositories? • Achieve storage and I/O efficiency? • Give applications and tools easy access our data? LCI Tutorial
HDF started right here at NCSA LCI Tutorial
Efficient storage, I/O Scientific data file format CommonData models I/O software & tools StandardAPIs HDF solution LCI Tutorial
The HDF5 Format LCI Tutorial
palette An HDF5 file is a container… …into which you can put your data objects. lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 LCI Tutorial
“/” (root) “/foo” 3-D array lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 Table palette Raster image Raster image 2-D array HDF5 structures for organizing objects LCI Tutorial
Introduction to HDF5 Data and Programming Models Tutorial Part I LCI Tutorial
Mesh Example, in HDFView LCI Tutorial
HDF5 Data Model LCI Tutorial
HDF5 data model • HDF5 file – container for scientific data • Primary Objects • Groups • Datasets • Additional ways to organize data • Attributes • Sharable objects • Storage and access properties Everything else is built from these parts. LCI Tutorial
Metadata Data Dataspace Rank Dimensions 3 Dim_1 = 4 Dim_2 = 5 Dim_3 = 7 Datatype IEEE 32-bit float Attributes Storage info Time = 32.4 Chunked Pressure = 987 Compressed Temp = 56 HDF5 Dataset LCI Tutorial
Dataspaces • Two roles • Dataspace contains spatial info about a dataset stored in a file • Rank and dimensions • Permanent part of dataset definition • Dataspace describes application’s data buffer and data elements participating in I/O Rank = 2 Dimensions = 4x6 Rank = 1 Dimensions = 12 LCI Tutorial
Datatypes (array elements) • Datatype – how to interpret a data element • Permanent part of the dataset definition • Two classes: atomic and compound LCI Tutorial
Datatypes • HDF5 atomic types • normal integer & float • user-definable (e.g. 13-bit integer) • variable length types (e.g. strings) • pointers - references to objects/dataset regions • enumeration - names mapped to integers • array • HDF5 compound types • Comparable to C structs • Members can be atomic or compound types LCI Tutorial
HDF5 dataset: array of records 3 5 Dimensionality: 5 x 3 int8 int4 int16 2x3x2 array of float32 Datatype: Record LCI Tutorial
Attributes • Attribute – data of the form “name = value”, attached to an object • Operations scaled down versions of dataset operations • Not extendible • No compression • No partial I/O • Optional for the dataset definition • Can be overwritten, deleted, added during the “life” of a dataset • Size under 64K in releases before HDF5 1.8.0 LCI Tutorial
A mechanism for collections of related objects Every file starts with a root group Similar to UNIX directories Can have attributes Groups “/” C A B l k m LCI Tutorial
Path to HDF5 object in a file “/” • / (root) • /x • /foo • /foo/temp • /foo/bar/temp foo x bar temp temp LCI Tutorial
Shared objects “/” A C B R P P • /A/P • /B/R • /C/P LCI Tutorial
Better subsetting access time; extendable chunked Improves storage efficiency, transmission speed compressed Arrays can be extended in any direction extendable File B Metadata in one file, raw data in another Dataset “Fred” split file File A Metadata for Fred Data for Fred Special Storage Options LCI Tutorial
HDF5 Software LCI Tutorial
HDF5 software stack Tools & Applications HDF I/O Library HDF File LCI Tutorial
Structure of HDF5 Library • Object API (C, Fortran 90, Java, C++) • Specify objects and transformation properties • Invoke data movement operations and data transformations • Library internals • Performs data transformations and other prep for I/O • Configurable transformations (compression, etc.) • Virtual file I/O (C only) • Perform byte-stream I/O operations (open/close, read/write, seek) • User-implementable I/O (stdio, network, memory, etc.) LCI Tutorial
Writing – move from memory to disk memory disk LCI Tutorial
disk memory (b) Regular series of blocks from a 2D array to a contiguous sequence at a certain offset in a 1D array Partial I/O Move just part of a dataset disk memory (a) Hyperslab from a 2D array to the corner of a smaller 2D array LCI Tutorial
memory disk (c) A sequence of points from a 2D array to a sequence of points in a 3D array. (d) Union of hyperslabs in file to union of hyperslabs in memory. Partial I/O Move just part of a dataset disk memory LCI Tutorial
Layers – parallel example Application I/O flows through many layers from application to disk. Parallel computing system (Linux cluster) Computenode Computenode Computenode Computenode I/O library (HDF5) Parallel I/O library (MPI-I/O) Parallel file system (GPFS) Switch network/I/O servers Disk architecture & layout of data on disk LCI Tutorial
Virtual I/O layer Object API (C, Fortran 90, Java, C++) Library internals Virtual file I/O (C only) LCI Tutorial
Virtual file I/O drivers File Family MPI I/O Memory Network Stdio “Storage” File File Family Memory Network Virtual file I/O layer • A public API for writing I/O drivers • Allows HDF5 to interface to disk, the network, memory, or a user-defined device LCI Tutorial
Apps: simulation, visualization, remote sensing… Examples: Thermonuclear simulations Product modeling Data mining tools Visualization tools Climate models UDM SAF hdf5mesh IDL HDF-EOS appl-specificAPIs LANL LLNL, SNL Grids COTS NASA HDF5 virtual file layer (I/O drivers) HDF5 serial & parallel I/O Split Files MPI I/O Custom Stdio Stream Storage ? Across the networkor to/from another application or library HDF5 format User-defined device Split metadata and raw data files File on parallel file system File Common application-specificdata models HDF5 data model & API LCI Tutorial
Other info • Runs almost anywhere • Most workstations • Big ASC machines, Cray, Compaq • TeraGrid and other clusters • QA • Daily regression tests on key platforms • Meets NASA’s highest technology readiness level LCI Tutorial
Other HDF Software • THG HDF • Java tools • Command-line utilities • Regression and performance testing software • Commercial (IDL, Matlab, HDF Explorer, etc.) • Community (EOS, ASCI, etc.) • Integration with other software (SRB, etc.) LCI Tutorial
Creating an HDF5 file with HDF5 tools HDFView, h5mkgrp, h5import LCI Tutorial
A B Example: create this HDF5 file “/” (root) 4x6 array of floats LCI Tutorial 3-D array of floats
Example: create this HDF5 file • HDFView • h5mkgrp file.h5 /B • h5import A.txt -c A.conf -o file.h5 LCI Tutorial
Introduction to HDF5 Programming model and APIs Programming model for sequential access LCI Tutorial
HDF5 Software stack Tools & Applications HDF I/O Library HDF File LCI Tutorial
Structure of HDF5 Library • Object API (C, Fortran 90, Java, C++) • Specify objects and transformation properties • Invoke data movement operations and data transformations • Library internals • Performs data transformations and other prep for I/O • Configurable transformations (compression, etc.) • Virtual file I/O (C only) • Perform byte-stream I/O operations (open/close, read/write, seek) • User-implementable I/O (stdio, network, memory, etc.) LCI Tutorial
Goals of HDF5 Library • Flexible API to support a wide range of operations on data • High performance access in serial and parallel computing environments • Compatibility with common data models and programming languages Because of these goals, the HDF5 API is rich and large LCI Tutorial
Operations supported by the API • Create groups, datasets, attributes, linkages • Create complex data types • Assign storage and I/O properties to objects • Complex subsetting during read/write • Flexible I/O (parallel, remote, etc.) • Ability to transform data during I/O • Query about file and structure and properties • Query about object structure, content, properties LCI Tutorial
Characteristics of the HDF5 API • For flexibility, the API is extensive – 300+ functions • This can be daunting, at first • But there is hope • You can do a lot with a just few functions • So start simple, and build up your knowledge • The library functions are categorized by object type • Once you learn the system, it’s much less daunting • And there is an “H5Lite” API if all you want to do are simple things. LCI Tutorial