250 likes | 354 Views
Parallel and Grid I/O Infrastructure. W. Gropp, R. Ross, R. Thakur Argonne National Lab. A. Choudhary, W. Liao Northwestern University. G. Abdulla, T. Eliassi-Rad Lawrence Livermore National Lab. Outline. Introduction PVFS and ROMIO Parallel NetCDF Query Pattern Analysis
E N D
Parallel and Grid I/O Infrastructure W. Gropp, R. Ross, R. Thakur Argonne National Lab A. Choudhary, W. Liao Northwestern University G. Abdulla, T. Eliassi-Rad Lawrence Livermore National Lab
Outline • Introduction • PVFS and ROMIO • Parallel NetCDF • Query Pattern Analysis Please interrupt at any point for questions! Parallel and Grid I/O Infrastructure
What is this project doing? • Extending existing infrastructure work • PVFS parallel file system • ROMIO MPI-IO implementation • Helping match application I/O needs to underlying capabilities • Parallel NetCDF • Query Pattern Analysis • Linking with Grid I/O resources • PVFS backend for GridFTP striped server • ROMIO on top of Grid I/O API Parallel and Grid I/O Infrastructure
What Are All These Names? • MPI - Message Passing Interface Standard • Also known as MPI-1 • MPI-2 - Extensions to MPI standard • I/O, RDMA, dynamic processes • MPI-IO - I/O part of MPI-2 extensions • ROMIO - Implementation of MPI-IO • Handles mapping MPI-IO calls into communication (MPI) and file I/O • PVFS - Parallel Virtual File System • An implementation of a file system for Linux clusters Parallel and Grid I/O Infrastructure
Fitting the Pieces Together • Query Pattern Analysis (QPA) and Parallel NetCDF both written in terms of MPI-IO calls • QPA tools pass information down through MPI-IO hints • Parallel NetCDF written using MPI-IO for data read/write • ROMIO implementation uses PVFS as storage medium on Linux clusters or could hook to Grid I/O resources Parallel and Grid I/O Infrastructure
PVFS and ROMIO • Provide a little background on the two • What they are, example to set context, status • Motivate the work • Discuss current research and development • I/O interfaces • MPI-IO Hints • PVFS2 Our work with these two closely tied together. Parallel and Grid I/O Infrastructure
Parallel Virtual File System • Parallel file system for Linux clusters • Global name space • Distributed file data • Builds on TCP, local file systems • Tuned for high performance concurrent access • Mountable like NFS file systems • User-level interface library (used by ROMIO) • 200+ users on mailing list, 100+ downloads/month • Up from 160+ users in March • Installations at OSC, Univ. of Utah, Phillips Petroleum, ANL, Clemson Univ., etc. Parallel and Grid I/O Infrastructure
PVFS Architecture • Client - Server architecture • Two server types • Metadata server (mgr) - keeps track of file metadata (permissions, owner) and directory structure • I/O servers (iod) - orchestrate movement of data between clients and local I/O devices • Clients access PVFS one of two ways • MPI-IO (using ROMIO implementation) • Mount through Linux kernel (loadable module) Parallel and Grid I/O Infrastructure
PVFS Performance • Ohio Supercomputer Center cluster • 16 I/O servers (IA32), 70+ clients (IA64), IDE disks • Block partitioned data, accessed through ROMIO PVFS and ROMIO
ROMIO • Implementation of MPI-2 I/O specification • Operates on wide variety of platforms • Abstract Device Interface for I/O (ADIO) aids in porting to new file systems • Fortran and C bindings • Successes • Adopted by industry (e.g. Compaq, HP, SGI) • Used at ASCI sites (e.g. LANL Blue Mountain) Parallel and Grid I/O Infrastructure
Example of Software Layers • FLASH Astrophysics application stores checkpoints and visualization data using HDF5 • HDF5 in turn uses MPI-IO (ROMIO) to write out its data files • PVFS client library isused by ROMIO to writedata to PVFS file system • PVFS client libraryinteracts with PVFSservers over network Parallel and Grid I/O Infrastructure
Example of Software Layers (2) • FLASH Astrophysics application stores checkpoints and visualization data using HDF5 • HDF5 in turn uses MPI-IO (IBM) to write out its data files • GPFS File System storesdata to disks Parallel and Grid I/O Infrastructure
Status of PVFS and ROMIO • Both are freely available, widely distributed, documented, and supported products • Current work focuses on: • Higher performance through more rich file systems interfaces • Hint mechanisms for optimizing behavior of both ROMIO and PVFS • Scalability • Fault tolerance Parallel and Grid I/O Infrastructure
Why Does This Work Matter? • Much of I/O on big machines goes through MPI-IO • Direct use of MPI-IO (visualization) • Indirect use through HDF5 or NetCDF (fusion, climate, astrophysics) • Hopefully soon through Parallel NetCDF! • On clusters, PVFS is currently the most deployed parallel file system • Optimizations in these layers are of direct benefit to those users • Providing guidance to vendors for possible future improvements Parallel and Grid I/O Infrastructure
I/O Interfaces • Scientific applications keep structured data sets in memory and in files • For highest performance, the description of the structure must be maintained through software layers • Allow the scientist to describe the data layout in memory and file • Avoid packing into buffers in intermediate layers • Minimize the number of file system operations needed to perform I/O Parallel and Grid I/O Infrastructure
Memory File File System Interfaces • MPI-IO is a great starting point • Most underlying file systems only provide POSIX-like contiguous access • List I/O work was first step in the right direction • Proposed FS interface • Allows movement of lists ofdata regions in memory andfile with one call Parallel and Grid I/O Infrastructure
List I/O • Implemented in PVFS • Transparent to user throughROMIO • Distributed in latest releases Parallel and Grid I/O Infrastructure
Flattening A File Datatype Datatype size of a byte # of Datatypes 1 2 3 # of Bytes 0 1 2 3 4 5 6 7 8 9 10 11 File Offsets 0 2 6 10 File Lengths 1 3 3 2 List I/O Example • Simple datatyperepeated over file • Desire to read first9 bytes • This is converted intofour [offset,length] pairs • One can see how this process could result in a very large list of offsets and lengths Parallel and Grid I/O Infrastructure
Describing Regular Patterns • List I/O can’t describe regular patterns (e.g. a column of a 2D matrix) in an efficient manner • MPI datatypes can do this easily • Datatype I/O is our solution to this problem • Concise set of datatype constructors used to describe types • API for passing these descriptions to a file system Parallel and Grid I/O Infrastructure
Datatype I/O • Built using a generic datatype processing component (also used in MPICH2) • Optimizing for performance • Prototype for PVFS in progress • API and server support • Prototype of support in ROMIO in progress • Maps MPI datatypes to PVFS datatypes • Passes through new API • This same generic datatype component could be used in other projects as well Parallel and Grid I/O Infrastructure
# of Datatypes 1 2 # of Bytes Datatype size of a byte 3 0 1 2 3 4 5 6 7 8 9 10 11 Datatype I/O Example • Same datatype as in previous example • Describe datatype with one construct: • index {(0,1), (2,2)} describes pattern of one short block and one longer one • automatically tiled (as with MPI types for files) • Linear relationship between # of contiguous pieces and size of request is removed Parallel and Grid I/O Infrastructure
MPI Hints for Performance • ROMIO has a number of performance optimizations built in • The optimizations are somewhat general, but there are tuning parameters that are very specific • buffer sizes • number and location of processes to perform I/O • data sieving and two-phase techniques • Hints may be used to tune ROMIO to match the system Parallel and Grid I/O Infrastructure
ROMIO Hints • Currently all of ROMIO’s optimizations may be controlled with hints • data sieving • two-phase I/O • list I/O • datatype I/O • Additional hints are being considered to allow ROMIO to adapt to access patterns • collective-only I/O • sequential vs. random access • inter-file dependencies Parallel and Grid I/O Infrastructure
PVFS2 • PVFS (version 1.x.x) plays an important role as a fast scratch file system for use today • PVFS2 will supersede this version, adding • More comprehensive system management • Fault tolerance through lazy redundancy • Distributed metadata • Component-based approach for supporting new storage and network resources • Distributed metadata and fault tolerance will extend scalability into thousands and tens of thousands of clients and hundreds of servers • PVFS2 implementation is underway Parallel and Grid I/O Infrastructure
Summary • ROMIO and PVFS are a mature foundation on which to make additional improvements • New, rich I/O descriptions allow for higher performance access • Addition of new hints to ROMIO allows for fine-tuning its operation • PVFS2 focuses on the next generation of clusters Parallel and Grid I/O Infrastructure