420 likes | 435 Views
Learn about HDF5, a high-volume data format with tools and libraries to manage complex data efficiently. Discover how to overcome I/O bottlenecks in data processing with HDF5.
E N D
HDF Experiences with I/O Bottlenecks Mike Folk The HDF Group Collaborative Expedition WorkshopToward Scalable Data ManagementOvercoming I/O Bottlenecks in Full Data Path Processing June 10, 2008 National Science Foundation Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Topics • What is HDF? • I/O bottlenecks and HDF Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
What is HDF? Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
HDF is… • A file format for managing any kind of data • Software system to manage data in the format • Designed for high volume or complex data • Designed for every size and type of system • Open format and software library, tools • There are two HDF’s: HDF4 and HDF5. • For simplicity we focus on HDF5. Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
HDF5 The Format Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks 5
palette An HDF5 “file” is a container… …into which you can put your data objects. lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
“Groups” 3-D array lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 Table palette Raster image Raster image 2-D array “Datasets” Structures to organize objects “/” (root) “/foo” Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
HDF5 model • Groups – provide structure among objects • Datasets – where the primary data goes • Data arrays • Rich set of datatype options • Flexible, efficient storage and I/O • Attributes, for metadata Everything else is built essentially from these parts. Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
HDF5The Software Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
HDF Software Tools, Applications, Libraries HDF I/O Library HDF File Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Most data consumers are here. Scientific/engineering applications. Domain-specific libraries/API, tools. Applications, tools use this API to create, read, write, query, etc. Power users (consumers) Modules to adapt I/O to specific features of system, or do I/O in some special way. “File” could be on parallel system, in memory, network, collection of files, etc. Users of HDF Software Tools & Applications HDF5 Application Programming Interface “Virtual file layer” (VFL) File system, MPI-IO, SAN, other layers “File” Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Philosophy: a single platform with multiple uses • One general format • One library, with • Options to adapt I/O and storage to data needs • Layers on top and below • Ability to interact well with other technologies • Attention to past, present, future compatibility Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Who uses HDF? Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Who uses HDF5? • Applications that deal with big or complex data • Over 200 different types of apps • 2+million product users world-wide • Academia, government agencies, industry Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Applications with large amounts of data Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Large simulations A simulation can have billions of elements Each element can have dozens of associated values Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Large images Electron tomography 25-80Å resolution 4k x 4k x 500 images now 8k x 8k x 1k images soon (256 GB) Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
It’s not just about size. Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Computational fluid dynamics simulation data Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Aqua (6/01) Terra CERES MISR MODIS MOPITT AquaCERES MODIS AMSR Aura TES HRDLS MLS OMI Earth Science (EOS) Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
High speed, multi-stream, multi-modal data collection Analyze and query specific parameters by time, space Flight test Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
I/O Bottlenecks and HDF Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
What is an I/O bottleneck? • "I/O bottleneck" – a phenomenon where the performance or capacity of an entire system is severely limited by some aspect of I/O. • Two types of bottlenecks • Technology – getting the data around quickly • Usability/accessibility – acquiring and making use of it • The role for HDF • Try not to cause bottlenecks • Offer ways to deal with bottlenecks when they occur Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
HDF Bottlenecks Tools & Applications HDF5 Application Programming Interface Low level Interface File system, MPI-IO, SAN, other layers File Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Sources of bottlenecks • Architectural features • Characteristics of data and information objects • Accessing and operating on objects • Usability/accessibility – beyond specialization Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Architecture-related I/O bottlenecks Software that does I/O often needs to operate on different systems. Differences within and among these systems can create I/O bottlenecks, as well as solutions. Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Architecture Bottlenecks Not enough memory, so apps have to swap to disk In a cluster, multiple processors doing I/O on the same file simultaneously Parallel file system has special features to avoid bottlenecks HDF response Keep an HDF file in core, so I/O goes from core to core Adaptable parallel I/O strategies, such as collective I/O, merging many small accesses into one large one Implement special I/O drivers in virtual file layer to exploit parallel file systems like PVFS, GPFS, Lustre Architecture I/O bottleneck examples Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Characteristics of data and information objects The size of objects, heterogeneity, and how we represent information. All are potential causes of I/O bottlenecks. Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Heterogeneity Bottlenecks Need to represent similar data from different sources, but it comes in different formats. Having to convert data for interoperability HDF responses Creation of common models and corresponding I/O libraries, avoiding need to convert Add I/O filters to auto-convert data Characteristics of data and information objects Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Size bottlenecks Metadata/data differences: Hard to do both big I/O and small I/O efficiently, especially on high-end systems tuned for big I/O. HDF response Metadata caching options: Caches metadata & data to avoid re-reading/writing Let application control cache App can control when cache is flushed App can advise about cache sizes, replacement strategies Characteristics of data and information objects Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Representation bottlenecks Different apps need different views of information, requiring transformation change coordinate systems ingest to database change engineering units HDF Response Group, index, reference structures provide different views at one time I/O filters can operate on data during I/O Characteristics of data and information objects Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Accessing and operating on objects I/O bottlenecks can occur when data is collected, generated, searched, analyzed, converted, and moved. Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Sequential R/W Bottlenecks Data from a single source at very high rate Data from multiple sources, simultaneously HDF response Use different file structures for sequential vs. random access Exploit available system optimizations (e.g. direct I/O to bypass system buffers) Accessing and operating on objects Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Partial access bottlenecks Access or operate on part of and object, slice through object, etc. Access to compressed object Perform a query about an object or collection HDF Response Offer rich set of partial I/O ops that recognize patterns and optimize for them Use chunking to enable fast slicing through arrays Compress in chunks, avoiding need to uncompress whole object Create and store indexes together with the data Accessing and operating on objects Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Remote access bottlenecks All of the above are exacerbated when the data is accessed from a distance or over a slow network HDF Response Avoid moving the data. Send operation to the data vs data to operation: Put HDF5 software inside remote data system, such as iRODS Implement remote query/access protocols, such as OPeNDAP Accessing and operating on objects Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Usability/accessibilityBeyond specialization Data is collected for specific purposes, then frequently turns out to have many other uses. Too often only the first users (the specialists) have the knowledge and tools to access the data and interpret it meaningfully. Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
The gaps between producer and user may be social, political, economic, semantic, temporal. The greater the gaps between producer and consumer, the greater are the challenges to usability and accessibility. Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Usability/accessibility bottlenecks • What data do I need and where do I find it? • Now that I have it, what does this data really mean? Provenance? Quality? • What tools do I need to access data? Do they exist? How do I use them? • How do I transform the data to representations that address my information needs? • How do I integrate and combine this data with my other data to create new information? • Who can help me? Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
HDF responses to support usability • Layer the software to make HDF accessible at different levels of expertise • Develop and promote standard models and representations in HDF (EOS, netCDF, EXPRESS) • Develop and promote metadata standards and their representation in HDF. • Provide simple tools to view the data • Provide tools to export just the data needed to other formats. • Work with tool builders, open & proprietary Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Supporting usability across time • Export to simple, enduring formats, such as XML • Create maps to the data • Define and store Access Information Packages • Be tenacious about backward compatibility Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Philosophy: a single platform with multiple uses • One general format • One library, with • Options to adapt I/O and storage to data needs • Layers on top and below • Ability to interact well with other technologies • Attention to past, present, future compatibility Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks
Thank you Mike Folk mfolk@hdfgroup.org Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks