300 likes | 412 Views
N-Bit and Scale-offset filters. Presented by Kent Yang Prepared by Xiaowen Wu. N-Bit filter. Introduction Description Implementation Usage Limitations. Introduction: N-Bit datatype. How to create a N-Bit datatype? Only integer and floating-point can be used for construction
E N D
N-Bit and Scale-offset filters Presented by Kent Yang Prepared by Xiaowen Wu
N-Bit filter • Introduction • Description • Implementation • Usage • Limitations
Introduction: N-Bit datatype • How to create a N-Bit datatype? • Only integer and floating-point can be used for construction • Integer datatype class • Floating-point datatype class • Integer or floating-point member(s) of compound datatype • Integer or floating-point base class of array datatype
Introduction: N-Bit datatype • How to create a N-Bit datatype? • Example codes: hid_t nbit_datatype = H5Tcopy(H5T_STD_I32LE); H5Tset_precision(nbit_datatype, 16); H5Tset_offset(nbit_datatype, 4);
Introduction: a simple example • One value of N-Bit datatype created by the example codes is stored in memory on a little-endian machine like this: | byte 3 | byte 2 | byte 1 | byte 0 | |????????|????SPPP|PPPPPPPP|PPPP????| S - sign bit P - significant bit ? - padding bit For signed integer, the sign bit is included in the precision • After data pass the N-Bit filter towards the disk, all padding bits will be chopped off during compression, and will be stored on disk like: | 1st value | 2nd value | |SPPPPPPP PPPPPPPP|SPPPPPPP PPPPPPPP|... • Opposite operation (decompression) is done when data flow from disk through the N-Bit filter towards memory
Introduction: N-Bit filter • More complex situations: • 1. Compound datatype • 2. Array datatype
Introduction: N-Bit filter • N-Bit filter allows almost all other HDF5 datatypes • Time • String • Bitfield • Opaque • Reference • Enum • Variable length
Introduction: N-Bit filter • One exception: array datatypes having variable-length or variable-length string as its base datatype • Too complicated to accommodate the fact API call <H5Tget_size> does not give correct disk size for these
N-Bit filter: pre-compression • Filter parameters are stored in the array cd_values[] by the filter call back function H5Z_set_local_nbit • They are passed to the function H5Z_filter_nbit by HDF5 library • Parameters include • Datatype parameters • Integer/floating-point: size, endianness, precision, offset • Compound: total size, number of members, member offsets, parameters for each member • Array: total size, parameters for its base type • No-op: size, endianness • Etc.
N-Bit filter: pre-compression • Recursive calls are needed for setting parameters of complex datatypes • A coding scheme is developed for storage and retrieval of different N-Bit datatype parameters
N-Bit filter: compression • Categorize datatypes into 4 groups: • Integer and floating-point datatype • Compound datatype • Array datatype • No-op datatypes • filter performs no operation • if inside a compound datatype, will be packed in full length with other N-Bit fields • Recursive function calls are used for complex situations
N-Bit filter: decompression • In structure, very similar to compression • Opposite direction • Same N-Bit parameters are needed
Enable N-Bit filter • Create a dataset creation property list • Set chunking (and specify chunk dimensions) • Set up use of the N-Bit filter • Create dataset specifying this property list • Close property list
N-Bit filter: usage example /* Define dataset datatype (N-Bit), and set precision, offset */ datatype = H5Tcopy(H5T_NATIVE_INT); precision = 17; H5Tset_precision(datatype,precision); offset = 4; H5Tset_offset(datatype,offset); /* Set the dataset creation property list for N-Bit compression */ chunk_size[0] = CH_NX; chunk_size[1] = CH_NY; properties = H5Pcreate (H5P_DATASET_CREATE); /* Create property list */ H5Pset_chunk (properties, 2, chunk_size); /* Set chunking */ H5Pset_nbit (properties);/* Set N-Bit filter */ /* Create a new dataset with N-Bit datatype and above property list */ dataset = H5Dcreate (file, DATASET_NAME, datatype, dataspace, properties);
N-Bit filter: limitations • Only compresses N-Bit datatype or field derived from integer or floating-point • No support for array datatypes having variable-length or variable-length string as its base datatype • Does not check fill value if defined can be represented by the N-Bit datatype of dataset
N-Bit filter: limitations • Decompression puts padding bits of zero in all situations • Library restores decompressed data to its original padding only when memory datatype differs from dataset datatype • Has upper limits on number of N-Bit parameters • Due to limit on object header which array cd_values[] has to fit into • Only when the dataset datatype is extremely complex (rarely happens)
Scale-Offset filter • Introduction • Usage • Limitations • Suggestions
Introduction: Scale-Offset compression • Scale-Offset compression performs a scale and/or offset operation on each data value and truncates the resulting value to a minimum number of bits (minimum-bits) before storing it • Unlike N-Bit compression, offset in Scale-Offset compression means the minimum value of a set of data values
Introduction: minimum-bits of integer values • If the maximum value of data to be compressed is 7065 and the minimum value is 2970 • Then the "span" of dataset values is (max-min+1), which is 4676 • If no fill value is defined for the dataset, the minimum-bits is: ceiling(log2(span)) = 12 • With fill value set, the minimum-bits is: ceiling(log2(span+1)) = 13
Introduction: how scale-offset filter compresses floating-point data • GRiB data packing method • The basic idea is to transform the data by some kind of scaling to integer data and then follow the procedure of Scale-Offset filter for integer type to do the data compression • the Scale-Offset compression of floating-point data is lossy in nature • Two design options for transformation: D-scaling (variable minimum-bits method) and E-scaling (fixed minimum-bits method) • Currently only D-scaling is implemented
Introduction: what’s D-scaling? • D-scaling means decimal scaling • a scale factor is introduced to transform data from floating-point to integer • Each data element value will subtract from the minimum value before transformation • The modified data will be multiplied by 10(Decimal) to the power of scale_factor • and only the integer part will be kept and manipulated through the routines for integer type of the filter during pre-compression and compression
Introduction: D-scaling example • D-scaling factor: 2 Minimum value: 99.459 Original data : {104.561, 99.459, 100.545, 105.644} • Each data element subtracts from 99.459: {5.102, 0, 1.086, 6.185} • Multiplied by 10^2, {510.2, 0, 108.6, 618.5} • The digit after decimal point rounded off: {510 , 0, 109, 619} • After decompression, each value divided by 10^2 and added the offset 99.459: {104.559, 99.459, 100.549, 105.649}
H5Pset_scaleoffset API H5Pset_scaleoffset(hid_t plist_id, intscale_factor, unsigned scale_type) • hid_t plist_idIN: Dataset creation property list identifier • int scale_factorIN: Parameter related to scale • If scale_type is H5_SO_FLOAT_DSCALE, scale_factor denotes decimal scale factor (D-scaling) and it can be positive, negative, or zero. Only this option is available • If scale_type is H5_SO_FLOAT_ESCALE, scale_factor denotes minimum-bits (E-scaling), and it must be a positive integer. Currently this is not supported • If scale_type is H5_SO_INT, scale_factor denotes minimum-bits, and it should be a positive integer or H5_SO_INT_MINIMUMBITS_DEFAULT (0, means the library calculates MinBits). If scale_factor is less than 0, the library will reset it to 0 • unsigned scale_type IN: Flag indicating compression method H5_SO_FLOAT_DSCALE (0) Floating-point type, using variable MinBits method H5_SO_FLOAT_ESCALE (1) Floating-point type, using fixed MinBits method H5_SO_INT (2) Integer type
Scale-Offset filter: integer example /* Set the fill value of dataset */ fill_val = 10000; H5Pset_fill_value (properties, H5T_NATIVE_INT, &fill_val); /* Set parameters for Scale-Offset compression */ H5Pset_scaleoffset (properties, H5_SO_INT_MINIMUMBITS_DEFAULT, H5_SO_INT); /* Create a new dataset */ dataset = H5Dcreate (file, DATASET_NAME, H5T_NATIVE_INT, dataspace, properties);
Scale-Offset filter: floating-point example /* Set the fill value of dataset */ fill_val = 10000.0; H5Pset_fill_value (properties, H5T_NATIVE_FLOAT, &fill_val); /* * Set parameters for Scale-Offset compression; * use D-scaling method, set decimal scale factor to 3 */ H5Pset_scaleoffset (properties, 3, H5_SO_FLOAT_DSCALE); /* Create a new dataset */ dataset = H5Dcreate (file, DATASET_NAME, H5T_NATIVE_FLOAT, dataspace, properties);
Scale-Offset filter: limitations • For floating-point data handling • Lossy compression • For D-scaling, data range is limited by the maximum value that can be represented by the corresponding unsigned integer type • Implementation of floating-point does not support long double type
Scale-Offset filter: suggestions • For floating-point data: • Better convert the units of data to be within a certain common range (e.g. 1200m to 1.2km) • If data values are close to zero, strongly recommend setting the fill value away from zero (e.g. a large positive number) • if the user does nothing, the HDF5 library will set the fill value to zero, may causing compression not as desirable
Scale-Offset filter: suggestions • For floating-point data (cont.): • Users are not encouraged to use a very large decimal scale factor (e.g. 100) for the D-scaling method • Fill value should be ignored when finding maximum and minimum values • Each value needs comparison to fill value • Epsilon for comparing to fill value: 10 ^ negative of decimal scale factor • If scale factor gets too large, epsilon will be zero • Comparison always fails • Fill value can not be ignored • Easy to get a much larger minimum-bits (poor compression)
Introduction: related library datatype conversion • Only if memory datatype differs from dataset datatype • Before N-Bit compression and after N-Bit decompression • Integer example on a little-endian machine: • From memory datatype H5T_NATIVE_INT to dataset datatype • Precision of H5T_NATIVE_INT is 32, offset is 0 • Precision of dataset datatype is16, offset is 4 • Before conversion: | byte 3 | byte 2 | byte 1 | byte 0 | |SPPPPPPP|PPPPPPPP|PPPPPPPP|PPPPPPPP| • After conversion: | byte 3 | byte 2 | byte 1 | byte 0 | |????????|????SPPP|PPPPPPPP|PPPP????| • Only the precision part (in red color) of 15 significant bits and the sign bit is kept • All other significant bits are turned into padding bits
Introduction: related library datatype conversion • Floating-point example on a little-endian machine: • From memory datatype H5T_NATIVE_FLOAT (IEEE) to dataset datatype • IEEE standard for H5T_NATIVE_FLOAT: precision: 32 offset: 0 mantissa size: 23 mantissa position: 0 exponent size: 8 exponent position: 23 signed bit : 1 signed position: 31 • Dataset datatype: precision: 20 offset: 7 mantissa size: 13 mantissa position: 7 exponent size: 6 exponent position: 20 signed bit: 1 sign position: 26 • Before conversion: | byte 3 | byte 2 | byte 1 | byte 0 | |SEEEEEEE|EMMMMMMM|MMMMMMMM|MMMMMMMM| S - sign bit, E - exponent bit, M - mantissa bit • After conversion: | byte 3 | byte 2 | byte 1 | byte 0 | |?????SEE|EEEEMMMM|MMMMMMMM|M???????| • The sign bit and truncated mantissa bits (in red color) are kept • Conversion of 8-bit exponent to 6-bit exponent needs mathematical calculations