Indexing HDF5: A Survey

Indexing HDF5: A Survey Joel Plutchak The HDF Group Champaign Illinois USA This work was supported by NASA/GSFC under Raytheon Co. contract number NNG10HP02C

The Technology The HDF5 hierarchical data file format and API is flexible—it supports self-describing, portable, and compact storage, as well as efficient I/O. • It is a well-described and well-supported format that is used in a wide variety of disciplines.

The Problem The HDF5 API does not include mechanisms to efficiently find and access data based on data values, like one would perform a query on a relational database. • Members of the HDF Community have developed this capability so that their applications can quickly access targeted pieces of data— rapidly search and select interesting portions of data based on ad hoc search criteria.

A Solution Solutions to this problem are called indexing. This is done by adding a layer between the HDF5 API and an application that builds a index on one or more parameters, saving enough information in the index to more efficiently find and retrieve specific parts of one or more datasets in an HDF5 file. Index Application HDF5 File HDF5API Query

Implementations Implementations exist for adding indexed access to HDF5 files. A few of them are: • PyTables • FastQuery / FastBit • Alacrity • HDF5 (prototype) • Other experimental work in progress

PyTables • Uses the Python programming language • Built on top of the HDF5 library and the NumPy package • Uses Optimized Partially Sorted Index (OPSI) technology designed for fast access to very large (>100M rows) tables

PyTables • Example • create a table: table = h5file.create_table(group, 'readout', Particle, "Readout example”) • Query a table: condition = '(name == "Particle: 5") | (name == "Particle: 7")’ for record intable.where(condition): # do something with "record”

PyTables Limitations • No support for relationships between datasets Future work: • No specifics; a continuing effort that welcomes additional developers, testers, and users • Future maintenance and extended development proposals underway • The HDF Group is very interested in taking a significant role in this work as it moves forward.

Alacrity • Analytics-Driven Lossless Data Compression for Rapid In-Situ Indexing, Storing, and Querying • Exploits the representation of floating-point values by binning on significant bits, using an inverted index to map each bin • The software is a research vessel for a group at University of North Carolina

FastQuery / FastBit • FastQuery is an extension to HDF5 from the visualization Group at Lawrence Berkley National Laboratory (LBNL) • Based on LBNL’s FastBit, an efficient searching technology that uses bitmap indexing for processing complex, multi-dimensional ad hoc queries on read-only numeric data • Extends HDF5’s hyperslab selection mechanism to allow arbitrary range conditions on the data values contained in the datasets • Compound queries can span multiple datasets

FastQuery / FastBit Assumptions • Data is: • 0-3 dimensional block-structured • Limited datatypes: float, double, int32, int64, byte • Two-level hierarchical organization: TimeStep, VariableName Future work: • Arbitrary nesting • More data schemas (unstructured, AMR, etc.)

HDF5 Data Analysis Extensions The HDF Group is developing support for indexing and querying to enable application developers to create complex and high-performance queries on both metadata and data elements within an HDF5 container. These are in the form of objects and associated APIs: • Query Objects: The H5Q API is used to define a query and apply it to an HDF5 container • View Objects: The H5V API is used to generate a selection from a query • Index Objects: The H5X API is used to attach / build an index to data; it is plug-in based to leverage multiple technologies Note: These extensions were developed under Intel’s subcontract with Lawrence Livermore National Security, LLC under U.S. Department of Energy contract DE-AC52-07NA27344.

HDF5 Data Analysis Extensions Example Add index to existing dataset dataset = H5Dopen(file, dataset_name, H5P_DEFAULT); /* Add indexing information */ H5Xcreate(dataset, H5X_PLUGIN_FASTBIT, H5P_DEFAULT); H5Dclose(dataset); Create and apply query floatquery_lb = 39.1f, query_ub = 42.6f; hid_tquery, query1, query2; /* Create a simple query:39.1 < x */ query1 = H5Qcreate(H5Q_TYPE_DATA_ELEM, H5Q_MATCH_GREATER_THAN, H5T_NATIVE_FLOAT, &query_lb); /* Create a second simple query: x < 42.1 */ query2 = H5Qcreate(H5Q_TYPE_DATA_ELEM, H5Q_MATCH_LESS_THAN, H5T_NATIVE_FLOAT, &query_ub); /* Combine query: 39.1 < x < 42.1 */ query = H5Qcombine(query1, H5Q_COMBINE_AND, query2); /* Use query to getselection */ dataset = H5Dopen(file, dataset_name, H5P_DEFAULT); H5Dquery(dataset, query, &dataspace); /* Read data hereusingdataspace */ H5Dclose(dataset);

HDF5 Data Analysis Extensions Status Phase I status (2014): • Prototype implementations for H5Q, H5V, H5X APIs • H5X API plugins for Alacrity and FastBit technologies • Incremental update of data is not supported by indexing packages Current work (started July 1): • Views generated from queries to abstract selection results on multiple objects • Support for indexing on chunked datasets • Support for compound types • Support for parallel indexing • Query optimization • Additional indexing plugins

Summary • A variety of index methods exist that can be used to speed targeted access to data in HDF5 files. • Capabilities and underlying technologies differ so use the best fit for your application. • Work is ongoing… let developers know of your needs and experiences!

References & Sources • PyTables • http://www.pytables.org/index.html • Alacrity • J. Jenkins, I. Arkatkar, S. Lakshminarasimhan, I. Boyuka, DavidA., E. Schendel, N. Shah, S. Ethier, C.-S.Chang, J. Chen, H. Kolla, R. Ross, S. Klasky, N. Samatova, “ALACRITY: Analytics-Driven Lossless Data Compression for Rapid In-Situ Indexing, Storing, and Querying,” Transactions on Large-Scale Data- and Knowledge-Centered Systems, Vol 10 (2013). • FastQuery / FastBit • http://www-vis.lbl.gov/Events/SC05/HDF5FastQuery/ • K. Wu, “FastBit: an efficient indexing technology for accelerating data-intensive science,” Journal of Physics: Conference Series, vol. 16, no. 1 (2005) • HDF5-FastQuery: An API for Simplifying Access to Data Storage, Retrieval, Indexing and Querying. - Report Number: LBNL/PUB-958 (2006) • HDF Data Analysis Extensions • J. Soumagne, Q. Koziol, RFC: Data Analysis Extensions, RFC THG 2014-07-17.v4; The HDF Group (2014)

Thank You

This work was supported by NASA/GSFC under Raytheon Co. contract number NNG10HP02C

Indexing HDF5: A Survey

Indexing HDF5: A Survey

Presentation Transcript

HDF5 Tools

Indexing a Sphere

Migrating from HDF5 1.6 to HDF5 1.8

Parallel HDF5

Parallel HDF5

Indexing

Parallel HDF5

Parallel HDF5

Indexing

HDF5 Chunking

HDF5-iRODS

HDF5 Work

Indexing

Automated indexing of survey questionnaires and interviews

HDF5 Tools

Migrating from HDF5 1.6 to HDF5 1.8

HDF5 Tutorial

HDF5 Tutorial

Indexing

Indexing