1 / 42

HDF Experiences with I/O Bottlenecks

HDF Experiences with I/O Bottlenecks. Mike Folk The HDF Group Collaborative Expedition Workshop Toward Scalable Data Management Overcoming I/O Bottlenecks in Full Data Path Processing June 10, 2008 National Science Foundation. Topics. What is HDF? I/O bottlenecks and HDF. What is HDF?.

rfontana
Download Presentation

HDF Experiences with I/O Bottlenecks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HDF Experiences with I/O Bottlenecks Mike Folk The HDF Group Collaborative Expedition WorkshopToward Scalable Data ManagementOvercoming I/O Bottlenecks in Full Data Path Processing June 10, 2008 National Science Foundation Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  2. Topics • What is HDF? • I/O bottlenecks and HDF Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  3. What is HDF? Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  4. HDF is… • A file format for managing any kind of data • Software system to manage data in the format • Designed for high volume or complex data • Designed for every size and type of system • Open format and software library, tools • There are two HDF’s: HDF4 and HDF5. • For simplicity we focus on HDF5. Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  5. HDF5 The Format Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks 5

  6. palette An HDF5 “file” is a container… …into which you can put your data objects. lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  7. “Groups” 3-D array lat | lon | temp ----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 Table palette Raster image Raster image 2-D array “Datasets” Structures to organize objects “/” (root) “/foo” Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  8. HDF5 model • Groups – provide structure among objects • Datasets – where the primary data goes • Data arrays • Rich set of datatype options • Flexible, efficient storage and I/O • Attributes, for metadata Everything else is built essentially from these parts. Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  9. HDF5The Software Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  10. HDF Software Tools, Applications, Libraries HDF I/O Library HDF File Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  11. Most data consumers are here. Scientific/engineering applications. Domain-specific libraries/API, tools. Applications, tools use this API to create, read, write, query, etc. Power users (consumers) Modules to adapt I/O to specific features of system, or do I/O in some special way. “File” could be on parallel system, in memory, network, collection of files, etc. Users of HDF Software Tools & Applications HDF5 Application Programming Interface “Virtual file layer” (VFL) File system, MPI-IO, SAN, other layers “File” Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  12. Philosophy: a single platform with multiple uses • One general format • One library, with • Options to adapt I/O and storage to data needs • Layers on top and below • Ability to interact well with other technologies • Attention to past, present, future compatibility Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  13. Who uses HDF? Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  14. Who uses HDF5? • Applications that deal with big or complex data • Over 200 different types of apps • 2+million product users world-wide • Academia, government agencies, industry Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  15. Applications with large amounts of data Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  16. Large simulations A simulation can have billions of elements Each element can have dozens of associated values Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  17. Large images Electron tomography 25-80Å resolution 4k x 4k x 500 images now 8k x 8k x 1k images soon (256 GB) Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  18. It’s not just about size. Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  19. Computational fluid dynamics simulation data Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  20. Aqua (6/01) Terra CERES MISR MODIS MOPITT AquaCERES MODIS AMSR Aura TES HRDLS MLS OMI Earth Science (EOS) Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  21. High speed, multi-stream, multi-modal data collection Analyze and query specific parameters by time, space Flight test Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  22. I/O Bottlenecks and HDF Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  23. What is an I/O bottleneck? • "I/O bottleneck" – a phenomenon where the performance or capacity of an entire system is severely limited by some aspect of I/O. • Two types of bottlenecks • Technology – getting the data around quickly • Usability/accessibility – acquiring and making use of it • The role for HDF • Try not to cause bottlenecks • Offer ways to deal with bottlenecks when they occur Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  24. HDF Bottlenecks Tools & Applications HDF5 Application Programming Interface Low level Interface File system, MPI-IO, SAN, other layers File Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  25. Sources of bottlenecks • Architectural features • Characteristics of data and information objects • Accessing and operating on objects • Usability/accessibility – beyond specialization Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  26. Architecture-related I/O bottlenecks Software that does I/O often needs to operate on different systems. Differences within and among these systems can create I/O bottlenecks, as well as solutions. Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  27. Architecture Bottlenecks Not enough memory, so apps have to swap to disk In a cluster, multiple processors doing I/O on the same file simultaneously Parallel file system has special features to avoid bottlenecks HDF response Keep an HDF file in core, so I/O goes from core to core Adaptable parallel I/O strategies, such as collective I/O, merging many small accesses into one large one Implement special I/O drivers in virtual file layer to exploit parallel file systems like PVFS, GPFS, Lustre Architecture I/O bottleneck examples Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  28. Characteristics of data and information objects The size of objects, heterogeneity, and how we represent information. All are potential causes of I/O bottlenecks. Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  29. Heterogeneity Bottlenecks Need to represent similar data from different sources, but it comes in different formats. Having to convert data for interoperability HDF responses Creation of common models and corresponding I/O libraries, avoiding need to convert Add I/O filters to auto-convert data Characteristics of data and information objects Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  30. Size bottlenecks Metadata/data differences: Hard to do both big I/O and small I/O efficiently, especially on high-end systems tuned for big I/O. HDF response Metadata caching options: Caches metadata & data to avoid re-reading/writing Let application control cache App can control when cache is flushed App can advise about cache sizes, replacement strategies Characteristics of data and information objects Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  31. Representation bottlenecks Different apps need different views of information, requiring transformation change coordinate systems ingest to database change engineering units HDF Response Group, index, reference structures provide different views at one time I/O filters can operate on data during I/O Characteristics of data and information objects Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  32. Accessing and operating on objects I/O bottlenecks can occur when data is collected, generated, searched, analyzed, converted, and moved. Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  33. Sequential R/W Bottlenecks Data from a single source at very high rate Data from multiple sources, simultaneously HDF response Use different file structures for sequential vs. random access Exploit available system optimizations (e.g. direct I/O to bypass system buffers) Accessing and operating on objects Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  34. Partial access bottlenecks Access or operate on part of and object, slice through object, etc. Access to compressed object Perform a query about an object or collection HDF Response Offer rich set of partial I/O ops that recognize patterns and optimize for them Use chunking to enable fast slicing through arrays Compress in chunks, avoiding need to uncompress whole object Create and store indexes together with the data Accessing and operating on objects Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  35. Remote access bottlenecks All of the above are exacerbated when the data is accessed from a distance or over a slow network HDF Response Avoid moving the data. Send operation to the data vs data to operation: Put HDF5 software inside remote data system, such as iRODS Implement remote query/access protocols, such as OPeNDAP Accessing and operating on objects Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  36. Usability/accessibilityBeyond specialization Data is collected for specific purposes, then frequently turns out to have many other uses. Too often only the first users (the specialists) have the knowledge and tools to access the data and interpret it meaningfully. Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  37. The gaps between producer and user may be social, political, economic, semantic, temporal. The greater the gaps between producer and consumer, the greater are the challenges to usability and accessibility. Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  38. Usability/accessibility bottlenecks • What data do I need and where do I find it? • Now that I have it, what does this data really mean? Provenance? Quality? • What tools do I need to access data? Do they exist? How do I use them? • How do I transform the data to representations that address my information needs? • How do I integrate and combine this data with my other data to create new information? • Who can help me? Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  39. HDF responses to support usability • Layer the software to make HDF accessible at different levels of expertise • Develop and promote standard models and representations in HDF (EOS, netCDF, EXPRESS) • Develop and promote metadata standards and their representation in HDF. • Provide simple tools to view the data • Provide tools to export just the data needed to other formats. • Work with tool builders, open & proprietary Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  40. Supporting usability across time • Export to simple, enduring formats, such as XML • Create maps to the data • Define and store Access Information Packages • Be tenacious about backward compatibility Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  41. Philosophy: a single platform with multiple uses • One general format • One library, with • Options to adapt I/O and storage to data needs • Layers on top and below • Ability to interact well with other technologies • Attention to past, present, future compatibility Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

  42. Thank you Mike Folk mfolk@hdfgroup.org Collaborative Expedition Workshop -- Overcoming I/O Bottlenecks

More Related