1 / 73

HDF5 Advanced Topics

This workshop covers advanced topics in HDF5, including dataset selections and partial I/O, chunking and filters, datatypes, compound datatypes, and object and dataset region references.

raymondl
Download Presentation

HDF5 Advanced Topics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HDF5 Advanced Topics HDf-EOS Workshop XI

  2. Outline • Dataset selections and partial I/O • Chunking and filters • Datatypes • Overview • Variable length data • Compound datatype • Object and dataset region references HDf-EOS Workshop XI

  3. Selections and partial I/O HDf-EOS Workshop XI

  4. What is a selection? • Selection describes elements of a dataset that participate in partial I/O • Hyperslabselection • Point selection • Results of Set Operations on hyperslab selections or point selections (union, difference, …) • Used by sequential and parallel HDF5 HDf-EOS Workshop XI

  5. Example of a hyperslab selection HDf-EOS Workshop XI

  6. Example of regular hyperslab selection HDf-EOS Workshop XI

  7. Example of irregular hyperslab selection HDf-EOS Workshop XI

  8. Another example of hyperslab selection HDf-EOS Workshop XI

  9. Example of point selection HDf-EOS Workshop XI

  10. Hyperslab description • Offset - starting location of a hyperslab (1,1) • Stride - number of elements that separate each block (3,2) • Count - number of blocks (2,6) • Block - block size (2,1) • Everything is “measured” in number of elements HDf-EOS Workshop XI

  11. H5Sselect_hyperslab space_idIdentifier of dataspace opSelection operator H5S_SELECT_SET H5S_SELECT_OR offsetArray with starting coordinates of hyperslab strideArray specifying which positions along a dimension to select countArray specifying how many blocks to select from the dataspace, in each dimension blockArray specifying size of element block (NULL indicates a block size of a single element in a dimension) HDf-EOS Workshop XI

  12. Reading/Writing Selections • Open the file • Open the dataset • Get file dataspace • Create a memory dataspace (data buffer) • Make the selection(s) • Read from or write to the dataset • Close the dataset, file dataspace, memory dataspace, and file HDf-EOS Workshop XI

  13. Example c-hyperslab.c: reading two rows Data in file 4x6 matrix Buffer in memory 1-dim array of length 14 HDf-EOS Workshop XI

  14. Example: reading two rows offset = {1,0} count = {2,6} block = {1,1} stride = {1,1} filespace = H5Dget_space (dataset); H5Sselect_hyperslab (filespace, H5S_SELECT_SET, offset, NULL, count, NULL) HDf-EOS Workshop XI

  15. Example: reading two rows offset = {1} count = {12} memspace = H5Screate_simple(1, 14, NULL); H5Sselect_hyperslab (memspace, H5S_SELECT_SET, offset, NULL, count, NULL) HDf-EOS Workshop XI

  16. Example: reading two rows H5Dread (…, …, memspace, filespace, …, …); HDf-EOS Workshop XI

  17. Chunking in HDF5 HDf-EOS Workshop XI

  18. Better subsetting access time; extendible chunked HDF5 chunking • Chunked layout is needed for • Extendible datasets • Compression and other filters • To improve partial I/O for big datasets Only two chunks will be written/read HDf-EOS Workshop XI

  19. Creating chunked dataset • Create a dataset creation property list • Set property list to use chunked storage layout • Set property list to use filters • Create dataset with the above property list plist = H5Pcreate(H5P_DATASET_CREATE); H5Pset_chunk(plist, rank, ch_dims); H5Pset_deflate(plist, 9); dset_id = H5Dcreate (…, “Chunked”,…, plist); H5Pclose(plist); HDf-EOS Workshop XI

  20. HDF5 filters • HDF5 Filters modify data during I/O operations • Available filters: • Checksum (H5Pset_fletcher32) • Shuffling filter (H5Pset_shuffle) • Compression • Scale + offset (H5Pset_scaleoffset) • N-bit (H5Pset_nbit) • GZIP (deflate), SZIP (H5Pset_deflate, H5Pset_szip) • BZIP2 (example of a user-defined compression filter) http://www.hdfgroup.uiuc.edu/papers/papers/bzip2/ HDf-EOS Workshop XI

  21. N-bit compression • In memory, one value of N-Bit datatype is stored like this: | byte 3 | byte 2 | byte 1 | byte 0 | |????????|????SPPP|PPPPPPPP|PPPP????| S-sign bit P-significant bit ?-padding bit • After passing through the N-Bit filter, all padding bits are chopped off, and the bits are stored on disk like this: | 1st value | 2nd value | |SPPPPPPP PPPPPPPP|SPPPPPPP PPPPPPPP|... • Opposite (decompress) when going from disk to memory HDf-EOS Workshop XI

  22. N-bit compression • Provides compact storage for user-defined datatypes • How does it work? • When data stored on disk, padding bits chopped off and only significant bits stored • Supports most datatypes • Works with compound datatypes • H5Pset_nbit(dcr); • H5Dcreate(……, dcr) • H5Dwrite (…); HDf-EOS Workshop XI

  23. Offset+size storage filter • Uses less storage when less precision needed • Performs scale/offset operation on each value • Truncates result to fewer bits before storing • Currently supports integers and floats • Example H5Pset_scaleoffset(dcr,H5Z_SO_INT, H5Z_SO_INT_MINBITS_DEFAULT); H5Dcreate(……, dcr); H5Dwrite (…); HDf-EOS Workshop XI

  24. Example with floating-point type • Data: {104.561, 99.459, 100.545, 105.644} • Choose scaling factor: decimal precision to keepE.g. scale factor D = 2 1. Find minimum value (offset): 99.459 2. Subtract minimum value from each element Result: {5.102, 0, 1.086, 6.185} 3. Scale data by multiplying 10D = 100 Result: {510.2, 0, 108.6, 618.5} 4. Round the data to integer Result: {510 , 0, 109, 619}This is stored in the file. HDf-EOS Workshop XI

  25. Writing or reading to/from chunked dataset • Use the same set of operation as for contiguous dataset • Selections do not need to coincide precisely with the chunks • Chunking mechanism is transparent to application • Chunking and compression parameters can affect performance H5Dopen(…); ………… H5Sselect_hyperslab (…); ………… H5Dread(…); HDf-EOS Workshop XI

  26. h5zip.c example Creates a compressed integer dataset 1000x20 in the zip.h5 file h5dump –p –H zip.h5 HDF5 "zip.h5" { GROUP "/" { GROUP "Data" { DATASET "Compressed_Data" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 1000, 20 )……… STORAGE_LAYOUT { CHUNKED ( 20, 20 ) SIZE 5316 } HDf-EOS Workshop XI

  27. h5zip.c example FILTERS { COMPRESSION DEFLATE { LEVEL 6 } } FILLVALUE { FILL_TIME H5D_FILL_TIME_IFSET VALUE 0 } ALLOCATION_TIME { H5D_ALLOC_TIME_INCR } } } } } HDf-EOS Workshop XI

  28. h5zip.c example (bigger chunk) Creates a compressed integer dataset 1000x20 in the zip.h5 file h5dump –p –H zip.h5 HDF5 "zip.h5" { GROUP "/" { GROUP "Data" { DATASET "Compressed_Data" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 1000, 20 )……… STORAGE_LAYOUT { CHUNKED ( 200, 20 ) SIZE 2936 } HDf-EOS Workshop XI

  29. Chunking basics to remember • Chunking creates storage overhead in the file • Performance is affected by • Chunking and compression parameters • Chunking cache size (H5Pset_cache call) • Some hints for getting better performance • Use chunk size no smaller than block size (4k) on your system • Use compression method appropriate for your data • Avoid using selections that do not coincide with the chunking boundaries HDf-EOS Workshop XI

  30. Selections and I/O Performance HDf-EOS Workshop XI

  31. Performance of serial I/O operations • Next slides show the performance effects of using different access patterns and storage layouts. • We use three test cases which consist of writing a selection to a rectangular array where the datatype of each element is a char. • Data is stored in row-major order. • Tests were executed on THG smirom using h5perf_serial and HDF5 version 1.8. HDf-EOS Workshop XI

  32. Serial benchmarking tool • A new benchmarking tool, h5perf_serial, is under development. • Some features implemented at this time are: • Support for POSIX and HDF5 I/O calls. • Support for datasets and buffers with multiple dimensions. • Entire dataset access using a single or several I/O operations. • Selection of contiguous and chunked storage for HDF5 operations. HDf-EOS Workshop XI

  33. 1 2 3 4 1 2 3 4 Contiguous storage (Case 1) • Rectangular dataset of size 48K x 48K, with write selections of 512 x 48K. • HDF5 storage layout is contiguous. • Good I/O pattern for POSIX and HDF5 because each selection is contiguous. • POSIX: 5.19 MB/s • HDF5: 5.36 MB/s HDf-EOS Workshop XI

  34. ……. 1 2 3 4 1 2 3 4 1 2 3 4 Contiguous storage (Case 2) • Rectangular dataset of 48K x 48K, with write selections of 48K x 512. • HDF5 storage layout is contiguous. • Bad I/O pattern for POSIX and HDF5 because each selection is noncontiguous. • POSIX: 1.24 MB/s • HDF5: 0.05 MB/s HDf-EOS Workshop XI

  35. ……. 1 1 2 3 4 2 1 2 3 3 4 4 1 2 3 4 Chunked storage • Rectangular dataset of 48K x 48K, with write selections of 48K x 512. • HDF5 storage layout is chunked. Chunks and selections sizes are equal. • Bad I/O case for POSIX because selections are noncontiguous. • Good I/O case for HDF5 since selections are contiguous due to chunking layout settings. • POSIX: 1.51 MB/s • HDF5: 5.58 MB/s POSIX HDF5 HDf-EOS Workshop XI

  36. Conclusions • Access patterns with small I/O operations incur high latency and overhead costs many times. • Chunked storage may improve I/O performance by affecting the contiguity of the data selection. HDf-EOS Workshop XI

  37. HDF5 Datatypes HDf-EOS Workshop XI

  38. Datatypes • A datatype is • A classification specifying the interpretation of a data element • Specifies for a given data element • the set of possible values it can have • the operations that can be performed • how the values of that type are stored • May be shared between different datasets in one file HDf-EOS Workshop XI

  39. Hierarchy of the HDF5 datatypes classes HDf-EOS Workshop XI

  40. General Operations on HDF5 Datatypes • Create • Derived and compound datatypes only • Copy • All datatypes • Commit (save in a file to share between different datatsets) • All datatypes • Open • Committed datatypes only • Discover properties (size, number of members, base type) • Close HDf-EOS Workshop XI

  41. Basic Atomic HDF5 Datatypes HDf-EOS Workshop XI

  42. Basic Atomic Datatypes • Atomic types classes • integers & floats • strings (fixed and variable size) • pointers - references to objects/dataset regions • opaque • bitfield • Element of an atomic datatype is a smallest possible unit for HDF5 I/O operation • Cannot write or read just mantissa or exponent fields for floats or sign filed for integers HDf-EOS Workshop XI

  43. HDF5 Predefined Datatypes • HDF5 Library provides predefined datatypes (symbols) for all basic atomic classes except opaque • H5T_<arch>_<base> • Examples: • H5T_IEEE_F64LE • H5T_STD_I32BE • H5T_C_S1 • H5T_STD_B32LE • H5T_STD_REF_OBJ, H5T_STD_REF_DSETREG • H5T_NATIVE_INT • Predefined datatypes do not have constantvalues; initialized when library is initialized HDf-EOS Workshop XI

  44. When to use HDF5 Predefined Datatypes? • In datasets and attributes creation operations • Argument to H5Dcreate or to H5Acreate H5Dcreate(file_id, "/dset", H5T_STD_I32BE,dataspace_id, H5P_DEFAULT); H5Dcreate(file_id, "/dset", H5T_NATIVE_INT,dataspace_id, H5P_DEFAULT); • In datasets and attributes I/O operations • Argument to H5Dwrite/read, H5Awrite/read • Always use H5T_NATIVE_* types to describe data in memory • To create user-defined types • Fixed and variable-length strings • User-defined integers and floats (13-bit integer or non-standard floating-point) • In composite types definitions • Do not use for declaring variables HDf-EOS Workshop XI

  45. HDF5 Fixed and Variable length array storage • Data • Data Time • Data • Data • Data • Data Time • Data • Data • Data HDf-EOS Workshop XI

  46. Storing strings in HDF5 (string.c) • Array of characters • Access to each character • Extra work to access and interpret each string • Fixed length string_id = H5Tcopy(H5T_C_S1); H5Tset_size(string_id, size); • Overhead for short strings • Can be compressed • Variable length string_id = H5Tcopy(H5T_C_S1); H5Tset_size(string_id, H5T_VARIABLE); • Overhead as for all VL datatypes (later) • Compression will not be applied to actual data • See string.c for how to allocate buffer for reading data for fixed and variable length strings HDf-EOS Workshop XI

  47. HDF5 variable length datatypes • Each element is represented by C struct typedef struct { size_t length; void *p; } hvl_t; • Base type can be any HDF5 type • H5Tvlen_create(base_type) HDf-EOS Workshop XI

  48. Creation of HDF5 variable length array hvl_t data[LENGTH]; for(i=0; i<LENGTH; i++) { data[i].p=HDmalloc((i+1)*sizeof(unsigned int)); data[i].len=i+1; } tvl = H5Tvlen_create (H5T_NATIVE_UINT); data[0].p • Data • Data • Data • Data data[4].len • Data HDf-EOS Workshop XI

  49. HDF5 Variable Length DatatypesStorage Raw data Global heap Dataset with variable length datatype HDf-EOS Workshop XI

  50. Reading HDF5 variable length array When size and base datatype are known: hvl_t rdata[LENGTH]; /* Discover the type in the file */ tvl = H5Tvlen_create (H5T_NATIVE_UINT); ret = H5Dread(dataset,tvl,H5S_ALL,H5S_ALL, H5P_DEFAULT, rdata); /* Reclaim the read VL data */ ret=H5Dvlen_reclaim(tvl,H5S_ALL,H5P_DEFAULT,rdata); HDf-EOS Workshop XI

More Related