Parallel and Grid I/O Infrastructure

Parallel and Grid I/O Infrastructure Rob Ross, Argonne National Lab Parallel Disk Access and Grid I/O (P4) SDM All Hands Meeting March 26, 2002

Participants • Argonne National Laboratory • Bill Gropp, Rob Ross, Rajeev Thakur, Rob Latham, Anthony Chan • Northwestern University • Alok Choudhary, Wei-keng Liao, Avery Ching, Kenin Coloma, Jianwei Li • Collaborators • Lawrence Livermore National Laboratory • Ghaleb Abdulla, Tina Eliassi-Rad, Terence Critchlow • Application groups

Focus Areas in Project • Parallel I/O on clusters • Parallel Virtual File System (PVFS) • MPI-IO hints • ROMIO MPI-IO implementation • Grid I/O • Linking PVFS and ROMIO with Grid I/O components • Application interfaces • NetCDF and HDF5 • Everything is interconnected! • Wei-keng Liao will drill down into specific tasks

Parallel Virtual File System • Lead developer R. Ross (ANL) • R. Latham (ANL), developer • A. Ching, K. Coloma (NWU), collaborators • Open source, scalable parallel file system • Project began in mid 90’s at Clemson University • Now a collaborative between Clemson and ANL • Successes • In use on large Linux clusters (OSC, Utah, Clemson, ANL, Phillips Petroleum, …) • 100+ unique downloads/month • 160+ users on mailing list, 90+ on developers list • Multiple Gigabyte/second performance shown

Keeping PVFS Relevant: PVFS2 • Scaling to thousands of clients and hundreds of servers requires some design changes • Distributed metadata • New storage formats • Improved fault tolerance • New technology, new features • High-performance networking (e.g. Infiniband, VIA) • Application metadata • New design and implementation warranted (PVFS2)

PVFS1, PVFS2, and SDM • Maintaining PVFS1 as a resource to community • Providing support, bug fixes • Encouraging use by application groups • Adding functionality to improve performance (e.g. tiled display) • Implementing next-generation parallel file system • Basic infrastructure for future PFS work • New physical distributions (e.g. chunking) • Application metadata storage • Ensuring that a working parallel file system will continue to be available on clusters as they scale

Data Staging for Tiled Display • Contact: Joe Insley (ANL) • Commodity components • projectors, PCs • Provide very high resolutionvisualization • Staging application preprocesses “frames” into a tile stream for each “visualization node” • Uses MPI-IO to access data from PVFS file system • Streams of tiles are merged into movie files on visualization nodes • End goal is to display frames directly from PVFS • Enhancing PVFS and ROMIO to improve performance

Example Tile Layout • 3x2 display, 6 readers • Frame size is 2532x1408 pixels • Tile size is 1024x768 pixels (overlapped) • Movies broken into frames with each frame stored in its own file in PVFS • Readers pull data from PVFS and send to display

Tested access patterns • Subtile • Each reader grabs a piece of a tile • Small noncontiguous accesses • Lots of accesses for a frame • Tile • Each reader grabs a whole tile • Larger noncontiguous accesses • Six accesses for a frame • Reading individual pieces is simply too slow

Noncontiguous Access in ROMIO • ROMIO performs “data sieving” to cut down number of I/O operations • Uses large reads which grab multiple noncontiguous pieces • Example, reading tile 1:

Noncontiguous Access in PVFS • ROMIO data sieving • Works for all file systems (just uses contiguous read) • Reads extra data (three times desired amount) • Noncontiguous access primitive allows requesting just desired bytes (A. Ching, NWU) • Support in ROMIO allowstransparent use of new optimization (K. Coloma,NWU) • PVFS and ROMIO supportimplemented

Metadata in File Systems • Associative arrays of information related to a file • Seen in other file systems (MacOS, BeOS, ReiserFS) • Some potential uses: • Ancillary data (from applications) • Derived values • Thumbnail images • Execution parameters • I/O library metadata • Block layout information • Attributes on variables • Attributes of dataset as a whole • Headers • Keeps header out of data stream • Eliminates need for alignment in libraries

Metadata and PVFS2 Status • Prototype metadata storage for PVFS2 implemented • R. Ross (ANL) • Uses Berkeley DB for storage of keyword/value pairs • Need to investigate how to interface to MPI-IO • Other components of PVFS2 coming along • Networking in testing (P. Carns, Clemson) • Client side API under development (Clemson) • PVFS2 beta early fourth quarter?

ROMIO MPI-IO Implementation • Written by R. Thakur (ANL) • R. Ross and R. Latham (ANL), developers • K. Coloma (NWU), collaborator • Implementation of MPI-2 I/O specification • Operates on wide variety of platforms • Abstract Device Interface for I/O (ADIO) aids in porting to new file systems • Successes • Adopted by industry(e.g. Compaq, HP, SGI) • Used at ASCI sites(e.g. LANL Blue Mountain)

ROMIO Current Directions • Support for PVFS noncontiguous requests • K. Coloma (NWU) • Hints - key to efficient use of HW & SW components • Collective I/O • Aggregation (synergy) • Performance portability • Controlling ROMIO Optimizations • Access patterns • Grid I/O • Scalability • Parallel I/O benchmarking

ROMIO Aggregation Hints • Part of ASCI Software Pathforward project • Contact: Gary Grider (LANL) • Implementation by R. Ross, R. Latham (ANL) • Hints control what processes do I/O in collectives • Examples: • All processes on same node as attached storage • One process per host • Additionally limit number of processes who open file • Good for systems w/out shared FS (e.g. O2K clusters) • More scalable

Aggregation Example • Cluster of SMPs • Only one SMP box has connection to disks • Data is aggregated to processes on single box • Processes on that box perform I/O on behalf of the others

Optimization Hints • MPI-IO calls should be chosen to best describe the I/O taking place • Use of file views • Collective calls for inherently collective operations • Unfortunately sometimes choosing the “right” calls can result on lower performance • Allow application programmers to tune ROMIO with hints rather than using different MPI-IO calls • Avoid the misapplication of optimizations (aggregation, data sieving)

Optimization Problems • ROMIO checks for applicability of two-phase optimization when collective I/O is used • With tiled display application using subtile access, this optimization is never used • Checking for applicability requires communication between processes • Results in 33% drop in throughput (on test system) • A hint that tells ROMIO not to apply the optimization can avoid this without changes to the rest of the application

Access Pattern Hints • Collaboration between ANL and LLNL (and growing) • Examining how access pattern information can be passed to MPI-IO interface, through to underlying file system • Used as input to optimizations in MPI-IO layer • Used as input to optimizations in FS layer as well • Prefetching • Caching • Writeback

Status of Hints • Aggregation control finished • Optimization hints • Collectives, data sieving read finished • Data sieving write control in progress • PVFS noncontiguous I/O control in progress • Access pattern hints • Exchanging log files, formats • Getting up to speed on respective tools

Parallel I/O Benchmarking • No common parallel I/O benchmarks • New effort (consortium) to: • Define some terminology • Define test methodology • Collect tests • Goal: provide a meaningful test suite with consistent measurement techniques • Interested parties at numerous sites (and growing) • LLNL, Sandia, UIUC, ANL, UCAR, Clemson • In infancy…

Grid I/O • Looking at ways to connect our I/O work with components and APIs used in the Grid • New ways of getting data in and out of PVFS • Using MPI-IO to access data in the Grid • Alternative mechanisms for transporting data across the Grid (synergy) • Working towards more seamless integration of the tools used in the Grid and those used on clusters and in parallel applications (specifically MPI applications) • Facilitate moving between Grid and Cluster worlds

Local Access to GridFTP Data • Grid I/O Contact: B. Allcock (ANL) • GridFTP striped server provides high-throughput mechanism for moving data across Grid • Relies on proprietary storage format on striped servers • Must manage metadata on stripe location • Data stored on servers must be read back from servers • No alternative/more direct way to access local data • Next version assumes shared file system underneath

GridFTP Striped Servers • Remote applications connect to multiple striped servers to quickly transfer data over Grid • Multiple TCP streams better utilize WAN network • Local processes would need to use same mechanism to get to data on striped servers

PVFS under GridFTP • With PVFS underneath, GridFTP servers would store data on PVFS I/O servers • Stripe information stored on PVFS metadata server

Local Data Access • Application tasks that are part of a local parallel job could access data directly off PVFS file system • Output from application could be retrieved remotely via GridFTP

MPI-IO Access to GridFTP • Applications such as tiled display reader desire remote access to GridFTP data • Access through MPI-IO would allow this with no code changes • ROMIO ADIO interface provides the infrastructure necessary to do this • MPI-IO hints provide means for specifying number of stripes, transfer sizes, etc.

WAN File Transfer Mechanism • B. Gropp (ANL), P. Dickens (IIT) • Applications • PPM and COMMAS (Paul Woodward, UMN) • Alternative mechanism for moving data across Grid using UDP • Focuses on requirements for file movement • All data must arrive at destination • Ordering doesn’t matter • Lost blocks can be retransmitted when detected, but need not stop the remainder of the transfer

WAN File Transfer Performance • Comparing TCP utilization to WAN FT technique • See 10-12% utilization with single TCP stream (8 streams to approach max. utilization) • With WAN FT obtain near 90% utilization, more uniform performance

Grid I/O Status • Planning with Grid I/O group • Matching up components • Identifying useful hints • Globus FTP client library is available • 2nd generation striped server being implemented • XIO interface prototyped • Hooks for alternative local file systems • Obvious match for PVFS under GridFTP

NetCDF • Applications in climate and fusion • PCM • John Drake (ORNL) • Weather Research and Forecast Model (WRF) • John Michalakes (NCAR) • Center for Extended Magnetohydrodynamic Modeling • Steve Jardin (PPPL) • Plasma Microturbulence Project • Bill Nevins (LLNL) • Maintained by Unidata Program Center • API and file format for storing multidimensional datasets and associated metadata (in a single file)

NetCDF Interface • Strong points: • It’s a standard! • I/O routines allow for subarray and strided access with single calls • Access is clearly split into two modes • Defining the datasets (define mode) • Accessing and/or modifying the datasets (data mode) • Weakness: no parallel writes, limited parallel read capability • This forces applications to ship data to a single node for writing, severely limiting usability in I/O intensive applications

Parallel NetCDF • Rich I/O routines and explicit define/data modes provide a good foundation • Existing applications are already describing noncontiguous regions • Modes allow for a synchronization point when file layout changes • Missing: • Semantics for parallel access • Collective routines • Option for using MPI datatypes • Implement in terms of MPI-IO operations • Retain file format for interoperability

Parallel NetCDF Status • Design document created • B. Gropp, R. Ross, and R. Thakur (ANL) • Prototype in progress • J. Li (NWU) • Focus is on write functions first • Biggest bottleneck for checkpointing applications • Read functions follow • Investigate alternative file formats in future • Address differences in access modes between writing and reading

FLASH Astrophysics Code • Developed at ASCI Center at University of Chicago • Contact: Mike Zingale • Adaptive mesh (AMR) code for simulating astrophysical thermonuclear flashes • Written in Fortran90, uses MPI for communication, HDF5 for checkpointing and visualization data • Scales to thousands of processors, runs for weeks, needs to checkpoint • At the time, I/O was a bottleneck (½ of runtime on 1024 processors)

HDF5 Overhead Analysis • Instrumented FLASH I/O to log calls to H5Dwrite H5Dwrite MPI_File_write_at

HDF5 Hyperslab Operations • White region is hyperslab “gather” (from memory) • Cyan is “scatter” (to file)

Hand-Coded Packing • Packing time is in black regions between bars • Nearly order of magnitude improvement

Wrap Up • Progress being made on multiple fronts • ANL/NWU collaboration is strong • Collaborations with other groups maturing • Balance of immediate payoff and medium term infrastructure improvements • Providing expertise to application groups • Adding functionality targeted at specific applications • Building core infrastructure to scale, ensure availability • Synergy with other projects • On to Wei-keng!

Parallel and Grid I/O Infrastructure