220 likes | 302 Views
Projection Indexes in HDF5. Rishi Rakesh Sinha The HDF Group. 144 MB/hr. 200 GB/run. Science Produces Large Datasets. Observation/experiment driven. Simulation driven. Information driven. > 7GB/expt. Why Not Commercial DMBSs?. Proprietary format Lack of portability Low scalability
E N D
Projection Indexes in HDF5 Rishi Rakesh Sinha The HDF Group
144 MB/hr 200 GB/run Science Produces Large Datasets • Observation/experiment driven • Simulation driven • Information driven > 7GB/expt
Why Not Commercial DMBSs? • Proprietary format • Lack of portability • Low scalability • Lack of desirable access modes • Presence of expensive concurrency control and logging mechanism • Expensive parallel versions
State of the Art Not Enough • Scientific file formats and associated I/O APIs • Concentrating on HDF5 • Data recovery is navigational • Subsetting only on a small set of attributes
Why Indexes? Easy Not So Easy
Previous Indexing Efforts • Implicit indexing in HDF5 • JPL use of HDF Vdatas • HDF-EOS point data • PyTables • HDF5 internal B-Tree structures
Why a Standard Indexing API? • Avoid duplication of effort • PyTables • Standardize indexing in HDF5 • Standard API can be differently implemented • Make indexes portable • Store indexes in HDF5 files
H5IN API • Create_index • Parameters: location of index, location of data, binning information, memory limits • Returns: location of the index • Query • Parameters: dataset to query, query string • Returns: selection representing subset of the data corresponding to the query
Design Decisions • Limited scope of the prototype • Index stored in a separate dataset • Returns a selection • Projection index • Support for simple boolean queries
Limited Scope • 1st indexing prototype in HDF5 • Presence of implicit indexing • Index on single datasets • Query over single datasets • Conditions should be over a single dataset • Result could be mapped to a separate dataset
Location Data Pressure Temperature F1 F2 F3 Index Storage Root Group: / DAY1 DAY2 DAY3 DAY4
DAY3 LD_INDEX F1 F2 Location Data F1 F2 F3 Index Storage Root Group: /
T_IN P_IN Temperature Pressure Pressure Temperature Index Storage Root Group: / DAY3
Pressure Temperature Returns a Selection FIND PRESSURE WHERE TEMP IN [100, 200] • Concise Storage • Efficient Boolean operations
Temp 40 C 50 A 60 Pressure 29 30 B 31 Projection Index
Why Projection Index ? • Data is read only • Mostly dataset once written is not changed • Index does not need to be updated • Projection indexes well suited • Number of disk accesses is same as in case of a B-Tree • Are not considering multidimensional queries
Only Simple Boolean Queries • Query Format SELECT SELECTION WHERE c11 < Attribute1 < c12 AND c21 < Attribute2 < c22 … • Results being selections boolean operations can be done inside the library
Conclusion • Developing a standard indexing API in HDF5 • Creating a proof of concept prototype using projection indexes • Take first step towards developing a query language for HDF5
Future Work • Multi-dimensionality • Multiple datasets in same file • Multiple datasets across files • Indexes on attributes • Allow user to index subset of datasets