720 likes | 891 Views
HDF. Update on HDF5 1.8. The HDF Group HDF and HDF-EOS Workshop X November 28, 2006. Why HDF5 1.8?. … as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know.
E N D
HDF Update on HDF5 1.8 The HDF Group HDF and HDF-EOS Workshop X November 28, 2006
… as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns -- the ones we don't know we don't know. Donald Rumsfeld HDF and HDF-EOS Workshop X, Landover MD
Some things we knew we knew • Need high level APIs – image, etc. • Need more datatypes - packed n-bit, etc. • Need external and other links • Tools needed – h5pack, etc. • Caching embellishments • Eventually, multithreading HDF and HDF-EOS Workshop X, Landover MD
Things we knew we did not know • New requirements from EOS and ASCI • New applications that would use HDF5 • How HDF5 would really perform in parallel • What new tools, features and options needed • New APIs, API features HDF and HDF-EOS Workshop X, Landover MD
Things we didn’t know we didn’t know • Completely unanticipated applications • New data types and structures • E.g. DNA sequences • New operations • E.g. write many real-time streams simultaneously HDF and HDF-EOS Workshop X, Landover MD
HDF5 1.8 topics • Dataset and datatype improvements • Group improvements • Link Revisions • Shared object header nessages • Metadata cache improvements • Other improvements • Platform-specific changes • High level APIs • Parallel HDF5 • Tool improvements HDF and HDF-EOS Workshop X, Landover MD
Text-based data type descriptions • Why: • Simplify datatype creation • Make datatype creation code more readable • Facilitate debugging by printing the text description of a data type • What: • New routine to create a data type through the text description of the data type:H5LTdtype_to_text HDF and HDF-EOS Workshop X, Landover MD
Text data type description – Example • Create a datatype of compound type. /* Create the data type with text description */ dtype = H5Ttext_to_type( “typedef struct foo {int a; float b;} foo_t;”) /* Convert the data type back to text */ H5Ttype_to_text(dtype, NULL, H5T_C, &tsize) HDF and HDF-EOS Workshop X, Landover MD
Serialized datatypes and dataspaces • Why: • Allow datatype and dataspace info to be transmitted between processes • Allow datatype/dataspace to be stored in non-HDF5 files • What: • A new set of routines to serialize/deserialize HDF5 datatypes and dataspaces. HDF and HDF-EOS Workshop X, Landover MD
Int to float convert during I/O • Why: Convert ints to floats during I/O • What: Int to float conversion supported during I/O HDF and HDF-EOS Workshop X, Landover MD
Revised conversion exception handling • Why: Give apps greater control over exceptions (range errors, etc.) during datatype conversion. • What: Revised conversion exception handling HDF and HDF-EOS Workshop X, Landover MD
Revised conversion exception handling • To handle exceptions during conversions, register handling function through H5Pset_type_conv_cb(). • Cases of exception: • H5T_CONV_EXCEPT_RANGE_HI • H5T_CONV_EXCEPT_RANGE_LOW • H5T_CONV_EXCEPT_TRUNCATE • H5T_CONV_EXCEPT_PRECISION • H5T_CONV_EXCEPT_PINF • H5T_CONV_EXCEPT_NINF • H5T_CONV_EXCEPT_NAN • Return values: H5T_CONV_ABORT, H5T_CONV_UNHANDLED, H5T_CONV_HANDLED HDF and HDF-EOS Workshop X, Landover MD
Compression filter for n-bit data • Why: Compact storage for user-defined datatypes • What: • When data stored on disk, padding bits chopped off and only significant bits stored • Supports most datatypes • Works with compound datatypes HDF and HDF-EOS Workshop X, Landover MD
N-bit compression example • In memory, one value of N-Bit datatype is stored like this: | byte 3 | byte 2 | byte 1 | byte 0 | |????????|????SPPP|PPPPPPPP|PPPP????| S-sign bit P-significant bit ?-padding bit • After passing through the N-Bit filter, all padding bits are chopped off, and the bits are stored on disk like this: | 1st value | 2nd value | |SPPPPPPP PPPPPPPP|SPPPPPPP PPPPPPPP|... • Opposite (decompress) when going from disk to memory HDF and HDF-EOS Workshop X, Landover MD
Offset+size storage filter • Why:Use less storage when less precision needed • What: • Performs scale/offset operation on each value • Truncates result to fewer bits before storing • Currently supports integers and floats • Example H5Pset_scaleoffset (dcr,H5Z_SO_INT,H5Z_SO_INT_MINBITS_DEFAULT); H5Dcreate(……, dcr) H5Dwrite (…); HDF and HDF-EOS Workshop X, Landover MD
Example with floating-point type • Data: {104.561, 99.459, 100.545, 105.644} • Choose scaling factor: decimal precision to keepE.g. scale factor D = 2 1. Find minimum value (offset): 99.459 2. Subtract minimum value from each element Result: {5.102, 0, 1.086, 6.185} 3. Scale data by multiplying 10D = 100 Result: {510.2, 0, 108.6, 618.5} 4. Round the data to integer Result: {510 , 0, 109, 619} 5. Pack and store using min number of bits HDF and HDF-EOS Workshop X, Landover MD
“NULL” Dataspace • Why: • Allow datasets with no elements to be described • NetCDF 4 needed a “place holder” for attributes • What: • A dataset with no dimensions, no data HDF and HDF-EOS Workshop X, Landover MD
Access links by creation-time order • Why: • Allow iteration & lookup of group’s links (children) by creation order as well as by name order • Support netCDF access model for netCDF 4 • What: Option to access objects in group according to relative creation time HDF and HDF-EOS Workshop X, Landover MD
“Compact groups” • Why: • Save space and access time for small groups • If groups small, don’t need B-tree overhead • What: • Alternate storage for groups with few links • Example • File with 11,600 groups • With original group structure, file size ~ 20 MB • With compact groups, file size ~ 12 MB • Total savings: 8 MB (40%) • Average savings/group: ~700 bytes HDF and HDF-EOS Workshop X, Landover MD
Better large group storage • Why: Faster, more scalable storage and access for large groups • What: New format and method for storing groups with many links HDF and HDF-EOS Workshop X, Landover MD
Intermediate group creation • Why: • Simplify creation of a series of connected groups • Avoid having to create each intermediate group separately, one by one • What: • Intermediate groups can be created when creating an object in a file, with one function call HDF and HDF-EOS Workshop X, Landover MD
/ / A A B C dset1 Example: add intermediate groups • Want to create “/A/B/C/dset1” • “A” exists, but “B/C/dset1” do not H5Dcreate(file_id, “/A/B/C/dset1”,..) One call creates groups “B” & “C”, then creates “dset1” HDF and HDF-EOS Workshop X, Landover MD
<address> “/target dataset” What are links? Links connect groups to their members “Hard” links point to a target by address “Soft” links store the path to a target root group Hard link Soft link dataset HDF and HDF-EOS Workshop X, Landover MD
“target dataset” <address> “dataset EL” “file2.h5” “target dataset” New: external Links • Why: Access objects by file & path within file • What: • Store location of file and path within that file • Can link across files file2.h5 root group file1.h5 root group dataset HDF and HDF-EOS Workshop X, Landover MD
New: User-defined Links • Why: • Allow applications to create their own kinds of links and link operations, such as • Create “hard” external link that finds an object by address • Create link that accesses a URL • Keep track of how often a link accessed, or other behavior • What: • App can create new kinds of links by supplying custom callback functions • Can do anything HDF5 hard, soft, or external links do HDF and HDF-EOS Workshop X, Landover MD
Dataset 1 Dataset 2 Dataset 3 datatype datatype datatype dataspace dataspace dataspace data 1 data 2 data 3 Shared object header messages • Why: metadata duplicated many times, wasting space • Example: • You create a file with 10,000 datasets • All use the same datatype and dataspace • HDF5 needs to write this information 10,000 times! HDF and HDF-EOS Workshop X, Landover MD
Shared object header messages What: • Enable messages to be shared automatically • HDF5 shares duplicated messages on its own! Dataset 1 Dataset 2 datatype dataspace data 1 data 2 HDF and HDF-EOS Workshop X, Landover MD
Shared Messages • Happens automatically • Works with datatypes, dataspaces, attributes, fill values, and filter pipelines • Saves space if these objects are relatively large • May be faster if HDF5 can cache shared messages • Drawbacks • Usually slower than non-shared messages • Adds overhead to the file • Index for storing shared datatypes • 25 bytes per instance • Older library versions can’t read files with shared messages HDF and HDF-EOS Workshop X, Landover MD
Two informal tests • File with 24 datasets, all with same big datatype • 26,000 bytes normally • 17,000 bytes with shared messages enabled • Saves 375 bytes per dataset • But, make a bad decision: invoke shared messages but only create one dataset… • 9,000 bytes normally • 12,000 bytes with shared messages enabled • Probably slower when reading and writing, too. • Moral: shared messages can be a big help, but only in the right situation! HDF and HDF-EOS Workshop X, Landover MD
Metadata Cache improvements • Why: • Improve I/O performance and memory usage when accessing many objects • What: • New metadata cache APIs • control cache size • monitor actual cache size and current hit rate • Under the hood: adaptive cache resizing • Automatically detects the current working size • Sets max cache size to the working set size HDF and HDF-EOS Workshop X, Landover MD
Metadata cache improvements • Note: most applications do not need to worry about the cache • See “Advanced topics” for details • And if you do see unusual memory growth or poor performance, please contact us. We want to help you. HDF and HDF-EOS Workshop X, Landover MD
New extendible error-handling API • Why: Enable app to integrate error reporting with HDF5 library error stack • What: New error handling API • H5Epush - push major and minor error ID on specified error stack • H5Eprint – print specified stack • H5Ewalk – walk through specified stack • H5Eclear – clear specified stack • H5Eset_auto – turn error printing on/off for specified stack • H5Eget_auto – return settings for specified stack traversal HDF and HDF-EOS Workshop X, Landover MD
Attribute improvements • Why: • Use less storage when large numbers of attributes attached to a single object • Iterate over or look up attributes by creation order • What: • Property to create index on the order in which the attributes are created • Improved attribute storage HDF and HDF-EOS Workshop X, Landover MD
Support for Unicode Character Set • Why: • So apps can create names using Unicode • netCDF 4 needed this • What • UTF-8 Unicode encoding now supported • For string datatypes, names of links and attributes • Example: H5Pset_char_encoding(lcpl_id, H5T_CSET_UTF8) H5Llink(file_id, "UTF-8 name", …, lcpl_id, …); HDF and HDF-EOS Workshop X, Landover MD
Efficient copying of HDF5 objects • Why: • Enable apps to copy objects efficiently • What • New routines to copy an object in an HDF5 file within the current file or to another file • Done at a low-level in the HDF5 file, allowing • Entire group hierarchies to be copied quickly • Compressed datasets to be copied without going through a decompression/compression cycle HDF and HDF-EOS Workshop X, Landover MD
Performance of object copy routines HDF and HDF-EOS Workshop X, Landover MD
Data transformation filter • Why: • Apply arithmetic operations to data during I/O • What: • Data transformation filter • Transform expressed by algebraic formula • Only +, -, *, and /supported • Example: • Expression parameter set, such as x*(x-5) • When dataset read/written, x*(x-5) applied per element • When reading, values in file are unchanged • When writing, transformed data written to file HDF and HDF-EOS Workshop X, Landover MD
Stackable Virtual File Drivers • What is Virtual File Driver (VFD)? HDF and HDF-EOS Workshop X, Landover MD
Structure of HDF5 Library • Object API (C, Fortran 90, Java, C++) • Specify objects and transformation properties • Invoke data movement operations and data transformations • Library internals • Performs data transformations and other prep for I/O • Configurable transformations (compression, etc.) • Virtual file I/O (C only) • Perform byte-stream I/O operations (open/close, read/write, seek) • User-implementable I/O (stdio, network, memory, etc.) HDF and HDF-EOS Workshop X, Landover MD
Stackable VFD • HDF5 VFD allows • Storing data using different physical file layout. E.g., Family VFD (writes file as “family of files”) • Doing different types of I/O. E.g., stdio (standard I/O); MPI-I/O (for parallel I/O) HDF and HDF-EOS Workshop X, Landover MD
Stackable VFD • Why “stackable:” • Before now, only one VFD could be used at a time • VFDs could not inter-operative • What is “stackable:” • A Non-terminal VFD may stack on top of compatible non-terminal and eventually Terminal VFD’s • Two kinds of VFD • Non-terminal (e.g. Family) • Terminal (e.g. stdio; MPI-I/O) HDF and HDF-EOS Workshop X, Landover MD
Stackable VFD Application HDF5 API Non-terminal VFD split Family File Default I/O path metadata rawdata Terminal VFD Sec2 stdio mpiio HDF5 Files HDF and HDF-EOS Workshop X, Landover MD