Progress in Storage Efficient Access PnetCDF and MPI-I/O

Progress in Storage Efficient AccessPnetCDF and MPI-I/O SciDAC All Hands Meeting, March 2-3, 2005

Outline • Parallel netCDF • Building blocks • Status report • Users and applications • Future works • MPI I/O file caching sub-system • Enhance client-side file caching for parallel applications • Scalable approach for enforcing file consistency and atomicity

Compute node Compute node Compute node Compute node Parallel netCDF ROMIO User space ADIO File system space I/O Server I/O Server switch network I/O Server Parallel netCDF • Goals • Design parallel APIs • Keep the same file format • Backward compatible, easy to migrate from serial netCDF • Similar API names and argument lists but with parallel semantics • Tasks • Built on top of MPI for portability and high performance • Take advantage of existing MPI-IO optimization (collective I/O, etc.) • Additional functionality for sophisticated I/O patterns • A new set of flexible APIs incorporate MPI derived data type to address the mapping between memory and file data layout • Support C and Fortran interfaces • Support external data representations across platforms

PnetCDF Current Status • High level APIs (mimicking serial netCDF API) • Fully supported both in C and Fortran • Flexible APIs (extended to utilize MPI derived datatype) • Allow complex memory layout for mapping between I/O buffer and file space • Support varm routines (strided memory layout) ported from serial netCDF • Support array shuffles, e.g. transposition • Test suites • C and Fortran self test codes ported from Unidata netCDF package to validate against single-process results • Parallel test codes for both sets of APIs • Latest release is v0.9.4 • Pre-release v1.0 • Sync with netCDF v3.6.0 (newest release from UniData) • Parallel API user manual

PnetCDF Users and Applications • FLASH– Astrophysical Thermonuclear application from ASCI/Alliances Center at University of Chicago • ACTM– Atmospheric Chemical Transport Model from LLNL • ROMS–Regional Ocean Model System from NCSA HDF group • ASPECT – data understanding infrastructure from ORNL • pVTK– parallel visualization toolkit from ORNL • PETSc – Portable, Extensible Toolkit for Scientific Computation from ANL • PRISM– PRogram for Integrated Earth System Modeling

PnetCDF Future Works • Data type conversion for external data representation • Reducing intermediate memory copy operations • int64  int32, little-endian  big-endian, int  double • Data type caching at PnetCDF level (w/o repeated decoding) • I/O hints • Currently, only MPI hints (MPI file info) are supported • Need netCDF level hints, eg. patterns of access sequence for multiple arrays • Non-blocking I/O • Large array support (dimensionality > 231-1) • More flexible and extendable file format • Allow adding new objects dynamically • Store arrays of structured data types, such as C structure

client processors client processors local cache buffers application global cache pool program user space client-side system space file system interconnect I/O servers interconnect network network I/O server disks Client-side File Caching for MPI I/O • Traditional client-side file caching • Treats each client independently, targeting for distributed environment • Inadequate for parallel environment where clients are most likely related with each other (eg. read/write shared files) • Collective caching • Application processes cooperate with each other to perform data caching, coherence control (leaving I/O servers out of the task)

File logical parititioning application process block 0 block 1 block 2 block 3 block 4 MPI library Distributed cache meta data P P P P processes MPI I/O 0 1 2 3 block 0 status block 1 status block 2 status block 3 status collective caching block 4 status block 5 status block 6 status block 7 status user space block 8 status block 9 status block 10 status block 11 status system space client-side file system Global cache pool P P P P processes 0 1 2 3 local memory local memory local memory local memory network page 1 page 1 page 1 page 1 page 2 page 2 page 2 page 2 page 3 page 3 page 3 page 3 server-side file system Design of Collective Caching • Caching sub-system is implemented at user space • Built at the MPI I/O level  Portable across different file systems • Distributed management • For cache metadata and lock control (vs. centralized) • Two designs: • Using an I/O thread • Using the MPI remote-memory-access (RMA) facility

Read/write amount = 512 MB Read/write amount = 1 GB 600 1200 byte-range locking I/O byte-range locking I/O 500 1000 collective caching collective caching 400 800 I/O bandwidth in MB/s I/O bandwidth in MB/s 300 600 200 400 100 200 0 0 4 8 16 4 8 16 32 Number of nodes Number of nodes Read/write amount = 2 GB Read/write amount = 4 GB 1400 1600 byte-range locking I/O byte-range locking I/O 1400 1200 1200 collective caching collective caching 1000 1000 800 I/O bandwidth in MB/s I/O bandwidth in MB/s 800 600 600 400 400 200 200 0 0 4 8 16 32 4 8 16 32 Number of nodes Number of nodes Read/write amount = 8 GB Read/write amount = 16 GB 2000 2000 byte-range locking I/O byte-range locking I/O 1500 1500 collective caching collective caching I/O bandwidth in MB/s I/O bandwidth in MB/s 1000 1000 500 500 0 0 4 8 16 32 4 8 16 32 Number of nodes Number of nodes Performance Results 1 • IBM SP at SDSC using GPFS • System peak performance: 2.1 GB/s for reads, 1 GB/s for writes • Sliding-window benchmark • I/O requests are overlapped • Can cause cache coherence problem

Original Original Original Original Collective Caching Collective Caching Collective Caching Collective Caching Performance Results 2 • BTIO benchmark • From NAS Ames Research Center -- Parallel Benchmarks version 2.4 • Block Tri-diagonal array partitioning pattern • Use MPI collective I/O calls • I/O requests are not overlapped • FLASH I/O benchmark • From U. of Chicago, ASCI Alliances Center • Access pattern is non-contiguous both in memory and in file • Use HDF5 • I/O requests are not overlapped BTIO Benchmark - class A BTIO Benchmark - class B 1200 800 700 1000 600 800 500 I/O Bandwidth in MB/s I/O Bandwidth in MB/s 600 400 300 400 200 200 100 0 0 4 9 16 25 36 49 64 4 9 16 25 36 49 64 Number of nodes Number of nodes FLASH I/O - 8 x 8 x 8 FLASH I/O - 16 x 16 x 16 900 200 800 180 160 700 140 600 120 500 I/O Bandwidth in MB/s I/O bandwidth in MB/s 100 400 80 300 60 200 40 100 20 0 0 4 8 16 32 4 8 16 32 64 Number of nodes Number of nodes

Progress in Storage Efficient Access PnetCDF and MPI-I/O

Progress in Storage Efficient Access PnetCDF and MPI-I/O

Presentation Transcript

Introduction to Parallel I/O and MPI-IO

DAFS Storage for High Performance Computing using MPI-I

Data Storage and Access Methods

Dynamic, flexible and efficient storage solutions

Techniques for designing I/O-efficient Alghorithms

Data access and Storage

Data access and Storage

Lecture 8. Storage and I/O

RIOT: I/O-Efficient Numerical Computing in

I/O-efficient Algorithms and Data Structures

Storage in Computers I/O

Electricity Access Progress in GHANA

New Progress in Open MPI p2p communication: Elan and Sicortex

RIOT: I/O-Efficient Numerical Computing in

Non-Blocking Collective MPI I/O Routines

Improving the Performance of MPI Parallel I/O

I/O Peripherals, Buses and Data Storage Systems

DPM - efficient storage in diverse environments

I/O-efficient Algorithms and Data Structures

I/O-Efficient Contour Tree Simplification

CS 6290 I/O and Storage

I/O (Input Output), Storage and Computer Architecture