850 likes | 882 Views
HDF5 Tutorial. 37 th SPEEDUP Workshop on HPC Albert Cheng, Elena Pourmal The HDF Group. Outline. 8:00 – 9:00 Introduction to HDF5 data, programming models and tools 9:00 – 9:30 Advanced features 10:00 – 12:00 Introduction to Parallel HDF5 13:15 – 14:15 Caching and buffering in HDF5
E N D
HDF5 Tutorial 37th SPEEDUP Workshop on HPC Albert Cheng, Elena Pourmal The HDF Group SPEEDUP Workshop - HDF5 Tutorial
Outline 8:00 – 9:00 Introduction to HDF5 data, programming models and tools 9:00 – 9:30 Advanced features 10:00 – 12:00 Introduction to Parallel HDF5 13:15 – 14:15 Caching and buffering in HDF5 14:45 – 16:45 New features in HDF5 1.8.0 SPEEDUP Workshop - HDF5 Tutorial
Introduction to HDF5 Data, Programming Modelsand Tools SPEEDUP Workshop - HDF5 Tutorial
What is HDF? SPEEDUP Workshop - HDF5 Tutorial
HDF is… • A file format for managing any kind of data • Software system to manage data in the format • Designed for high volume or complex data • Designed for every size and type of system • Open format and software library, tools • There are two HDF’s: HDF4 and HDF5 • Today we focus on HDF5 SPEEDUP Workshop - HDF5 Tutorial
HDF5The Format SPEEDUP Workshop - HDF5 Tutorial
palette An HDF5 “file” is a container… …into which you can put your data objects lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 SPEEDUP Workshop - HDF5 Tutorial
“Groups” 3-D array lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 Table palette Raster image Raster image 2-D array “Datasets” Structures to organize objects “/” (root) “/foo” SPEEDUP Workshop - HDF5 Tutorial
HDF5 model • Groups – provide structure among objects • Datasets – where the primary data goes • Data arrays • Rich set of datatype options • Flexible, efficient storage and I/O • Attributes, for metadata Everything else is built essentially from these parts. SPEEDUP Workshop - HDF5 Tutorial
HDF5The Software SPEEDUP Workshop - HDF5 Tutorial
HDF5 Software Tools, Applications, Libraries HDF5 I/O Library HDF5 File SPEEDUP Workshop - HDF5 Tutorial
Most data consumers are here. Scientific/engineering applications. Domain-specific libraries/API, tools. Applications, tools use this API to create, read, write, query, etc. Power users (consumers) Modules to adapt I/O to specific features of system, or do I/O in some special way. “File” could be on parallel system, in memory, network, collection of files, etc. Users of HDF5 Software Tools & Applications HDF5 Application Programming Interface “Virtual file layer” (VFL) File system, MPI-IO, SAN, other layers “HDF5 File” SPEEDUP Workshop - HDF5 Tutorial
HDF5 Philosophy A single platform with multiple uses • One general format • One library, with • Options to adapt I/O and storage to data needs • Layers on top and below • Ability to interact well with other technologies • Attention to past, present, future compatibility SPEEDUP Workshop - HDF5 Tutorial
Who uses HDF5? SPEEDUP Workshop - HDF5 Tutorial
Who uses HDF5? • Applications that deal with big or complex data • Over 200 different types of apps • 2+million product users world-wide • Academia, government agencies, industry SPEEDUP Workshop - HDF5 Tutorial
Applications with large amounts of data SPEEDUP Workshop - HDF5 Tutorial
NASA EOS remote sense data • HDF format is the standard file format for storing data from NASA's Earth Observing System (EOS) mission. • Petabytes of data stored in HDF and HDF5 to support the Global Climate Change Research Program. SPEEDUP Workshop - HDF5 Tutorial
Large simulations • A simulation can have billions of elements • Each element can have dozens of associated values SPEEDUP Workshop - HDF5 Tutorial
Large images Electron tomography 25-80Å resolution 4k x 4k x 500 images now 8k x 8k x 1k images (soon 256 GB) SPEEDUP Workshop - HDF5 Tutorial
It is not just about size SPEEDUP Workshop - HDF5 Tutorial
Data complexity Thanks to Mark Miller, LLNL SPEEDUP Workshop - HDF5 Tutorial
Complex relationships within data SNP Score Contig Summaries Discrepancies Contig Qualities Coverage Depth Trace Reads Aligned bases Read quality Contig Percent match SPEEDUP Workshop - HDF5 Tutorial
High speed, multi-stream, multi-modal data collection Analyze and query specific parameters by time, space Different views of data Flight test SPEEDUP Workshop - HDF5 Tutorial
HDF5 Data Model SPEEDUP Workshop - HDF5 Tutorial
HDF5 model (recap) • Groups – provide structure among objects • Datasets – where the primary data goes • Data arrays • Rich set of datatype options • Flexible, efficient storage and I/O • Attributes, for metadata • Other objects • Links (point to data in a file or in another HDF5 file) • Datatypes (can be stored for complex structures and reused by multiple datatsets) SPEEDUP Workshop - HDF5 Tutorial
Metadata Data Dataspace Rank Dimensions 3 Dim_1 = 4 Dim_2 = 5 Dim_3 = 7 Datatype IEEE 32-bit float Attributes Storage info Time = 32.4 Chunked Pressure = 987 Compressed Temp = 56 HDF5 Dataset SPEEDUP Workshop - HDF5 Tutorial
HDF5 Dataspace • Two roles • Dataspace contains spatial info about a dataset stored in a file • Rank and dimensions • Permanent part of dataset definition • Dataspace describes application’s data buffer and data elements participating in I/O Rank = 2 Dimensions = 4x6 Rank = 1 Dimensions = 12 SPEEDUP Workshop - HDF5 Tutorial
HDF5 Datatype • Datatype – how to interpret a data element • Permanent part of the dataset definition • Two classes: atomic and compound • Can be stored in a file as an HDF5 object (HDF5 committed datatype) • Can be shared among different datasets SPEEDUP Workshop - HDF5 Tutorial
HDF5 Datatype • HDF5 atomic types include • normal integer & float • user-definable (e.g., 13-bit integer) • variable length types (e.g., strings) • references to objects/dataset regions • enumeration - names mapped to integers • array • HDF5 compound types • Comparable to C structs (“records”) • Members can be atomic or compound types SPEEDUP Workshop - HDF5 Tutorial
HDF5 dataset: array of records 3 5 Dimensionality: 5 x 3 int8 int4 int16 2x3x2 array of float32 Datatype: Record SPEEDUP Workshop - HDF5 Tutorial
Better subsetting access time; extendable chunked Improves storage efficiency, transmission speed compressed Arrays can be extended in any direction extendable File B Metadata in HDF5 file, raw data in a binary file Dataset “Fred” external File A Metadata for Fred Data for Fred Special storage options for dataset SPEEDUP Workshop - HDF5 Tutorial
HDF5 Attribute • Attribute – data of the form “name = value”, attached to an object by application • Operations similar to dataset operations, but … • Not extendible • No compression or partial I/O • Can be overwritten, deleted, added during the “life” of a dataset SPEEDUP Workshop - HDF5 Tutorial
A mechanism for organizing collections of related objects Every file starts with a root group Similar to UNIXdirectories Can have attributes HDF5 Group “/” SPEEDUP Workshop - HDF5 Tutorial
Path to HDF5 object in a file “/” Y • / (root) • /X • /Y • /Y/temp • /Y/bar/temp X bar temp temp SPEEDUP Workshop - HDF5 Tutorial
Shared HDF5 objects “/” A C B R P P • /A/P • /B/R • /C/P SPEEDUP Workshop - HDF5 Tutorial
HDF5 Data ModelExample ENSIGHT Automotive crash simulation SPEEDUP Workshop - HDF5 Tutorial
Automotive crash simulation SPEEDUP Workshop - HDF5 Tutorial
Automotive crash simulation SPEEDUP Workshop - HDF5 Tutorial
Automotive crash simulation SPEEDUP Workshop - HDF5 Tutorial
Solid modeling SPEEDUP Workshop - HDF5 Tutorial
Solid modeling SPEEDUP Workshop - HDF5 Tutorial
HDF5mesh SPEEDUP Workshop - HDF5 Tutorial
Mesh Example, in HDFView April 28, 2008 LCI Tutorial SPEEDUP Workshop - HDF5 Tutorial 43
HDF5 Software SPEEDUP Workshop - HDF5 Tutorial
HDF5 software stack Tools & Applications HDF I/O Library HDF File SPEEDUP Workshop - HDF5 Tutorial
Structure of HDF5 Library • Object API (C, Fortran 90, Java, C++) • Specify objects and transformation properties • Invoke data movement operations and data transformations • Library internals • Performs data transformations and other prep for I/O • Configurable transformations (compression, etc.) • Virtual file I/O (C only) • Perform byte-stream I/O operations (open/close, read/write, seek) • User-implementable I/O (stdio, network, memory, etc.) SPEEDUP Workshop - HDF5 Tutorial
Write – from memory to disk memory disk SPEEDUP Workshop - HDF5 Tutorial
disk memory (b) Regular series of blocks from a 2D array to a contiguous sequence at a certain offset in a 1D array Partial I/O Move just part of a dataset disk memory (a) Hyperslab from a 2D array to the corner of a smaller 2D array SPEEDUP Workshop - HDF5 Tutorial
memory disk (c) A sequence of points from a 2D array to a sequence of points in a 3D array. (d) Union of hyperslabs in file to union of hyperslabs in memory. Partial I/O Move just part of a dataset memory disk SPEEDUP Workshop - HDF5 Tutorial
Layers – parallel example Application I/O flows through many layers from application to disk. Parallel computing system (Linux cluster) Computenode Computenode Computenode Computenode I/O library (HDF5) Parallel I/O library (MPI-I/O) Parallel file system (GPFS) Switch network/I/O servers Disk architecture & layout of data on disk SPEEDUP Workshop - HDF5 Tutorial