1 / 17

Project 4 :

Enabling High Performance Application I/O. Project 4 :. Parallel netCDF. SciDAC All Hands Meeting, September 11-13, 2002. Outline. NetCDF overview Parallel netCDF and MPI-IO Progress on API implementation Preliminary performance evaluation using LBNL test suite. NetCDF Overview.

lucas
Download Presentation

Project 4 :

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Enabling High Performance Application I/O Project 4 : Parallel netCDF SciDAC All Hands Meeting, September 11-13, 2002

  2. Outline • NetCDF overview • Parallel netCDF and MPI-IO • Progress on API implementation • Preliminary performance evaluation using LBNL test suite

  3. NetCDF Overview netCDF example { // CDL notation for netCDF dataset dimensions: // dimension names and lengths lat = 5, lon = 10, level = 4, time = unlimited; variables: // var types, names, shapes, attributes float temp(time,level,lat,lon); temp:long_name = "temperature"; temp:units = "celsius"; float rh(time,lat,lon); rh:long_name = "relative humidity"; rh:valid_range = 0.0, 1.0; // min and max int lat(lat), lon(lon), level(level), time(time); lat:units = "degrees_north"; lon:units = "degrees_east"; level:units = "millibars"; time:units = "hours since 1996-1-1"; // global attributes: :source = "Fictional Model Output"; data: // optional data assignments level = 1000, 850, 700, 500; lat = 20, 30, 40, 50, 60; lon = -160,-140,-118,-96,-84,-52,-45,-35,-25,-15; time = 12; rh = .5,.2,.4,.2,.3,.2,.4,.5,.6,.7, .1,.3,.1,.1,.1,.1,.5,.7,.8,.8, .1,.2,.2,.2,.2,.5,.7,.8,.9,.9, .1,.2,.3,.3,.3,.3,.7,.8,.9,.9, 0,.1,.2,.4,.4,.4,.4,.7,.9,.9; // 1 record allocated } • NetCDF (network Common Data Form) is an API for reading/writing multi-dimensional data arrays • Self-describing file format • A netCDF file includes information about the data it contains • Machine independent • Portable file format • Popular in both the fusion and climate communities

  4. NetCDF File Format • File header • Stores metadata for fixed-size arrays: • number of arrays, dimension lists, global attribute list, etc. • Array data • Fixed-size arrays • Stored contiguously in file • Variable-size arrays • Records from all variable-sized arrays are stored interleaved

  5. NetCDF APIs • Dataset APIs • Create /open/close a dataset, set the dataset to define/data mode, and synchronize dataset changes to disk • Define mode APIs • Define dataset: add dimensions, variables • Attribute APIs • Add , change, and read attributes of datasets • Inquiry APIs • Inquire dataset metadata: dim(id, name, len), var(name, ndims, shape, id) • Data mode APIs • Read/write variable (access method: single value, whole array, subarray, strided subarray, sampled subarray)

  6. P0 P1 P2 P3 netCDF Parallel File System P0 P1 P2 P3 Parallel netCDF Parallel File System Serial vs. Parallel netCDF • Serial netCDF • Parallel read • Implemented by simply having all processors read the file independently • Does NOT utilize native I/O provided on parallel file system – miss parallel optimizations • Sequential write • Parallel writes are carried out by shipping data to a single process – overwhelm its memory capacity • Parallel netCDF • Parallel read/write to a shared netCDF file • Built on top of MPI-IO which utilizes optimal I/O facilities provided by the parallel file systems • Can pass high-level access hints down to the file systems for further optimization

  7. Design Parallel netCDF APIs • Goals • Retain the original format • Applications using original netCDF applications can access the same files • A new set of parallel APIs • Prefix name “ncmpi_” and “nfmpi_” • Similar APIs • Minimum changes from the original APIs for easy migration • Portable across machines • High performance • Tune the API to provide better performance in today’s computing environments

  8. I/O Server I/O Server . . . . . . . . . . . . . . . . . . . . . . . . . . . Parallel File System • Parallel file system consists of multiple I/O nodes • Increase bandwidth between compute and I/O nodes • Each I/O node may contain more than one disk • Increase bandwidth between disks and I/O nodes • A file is striped across all disks in a round-robin fashion • Maximize the possibility of parallel access Compute node Compute node Compute node Compute node switch network I/O Server . . . File

  9. User space ADIO File system space I/O Server I/O Server Parallel netCDF and MPI-IO • Parallel netCDF APIs are the interfaces of applications to parallel file systems • Parallel netCDF is implemented on top of MPI-IO • ROMIO is an implementation of MPI-IO standard • ROMIO is built on top of ADIO • ADIO has implementations on various file systems, using optimal native I/O calls Compute node Compute node Compute node Compute node Parallel netCDF ROMIO switch network I/O Server

  10. Parallel API Implementations File open Dataset APIs • Collective calls • Add MPI communicator to define I/O process scope • Add MPI_Info to pass access hint for further optimization • Define mode APIs • Collective calls • Attribute APIs • Collective calls • Inquiry APIs • Collective calls • Data mode APIs • Collective mode (default) • Ensure file consistency • Independent mode ncmpi_create/open( MPI_Comm comm const char *path, int cmode, MPI_Info info, int ncidp); Switch in/out independent data mode ncmpi_begin_indep_data(int ncid); ncmpi_end_indep_data(int ncid);

  11. Data Mode APIs High-level APIs • Collective and independent calls • With suffix “_all” or not • High-level APIs • Mimics the original APIs • Easy path of migration to the parallel interface • Mapping netCDF access types to MPI derived datatypes • Flexible APIs • Better handling of internal data representations • More fully expose the capabilities of MPI-IO to the programmer ncmpi_put/get_vars_types_all( int ncid, const MPI_Offset start[ ], const MPI_Offset count[ ] const MPI_Offset stride[ ], const unsigned char *buf); Flexible APIs ncmpi_put/get_vars( int ncid, const MPI_Offset start[ ], const MPI_Offset count[ ] const MPI_Offset stride[ ], void *buf, int count, MPI_Datatype datatype);

  12. Z partition Y partition X partition YZ partition XZ partition XY partition processor 0 processor 4 processor 1 processor 5 processor 2 processor 6 processor 3 processor 7 XYZ partition LBNL Benchmark Z X • Test suite • Developed by Chris Ding et al. at LBNL • Written in Fortran • Simple block partition patterns • Access to a 3D array which is stored in a single netCDF file • Running on IBM SP2 at NERSC, LBNL • Each compute node is an SMP with 16 processors • I/O is performed using all processors Y

  13. LBNL Results – 64 MB • Array size – 256 x 256 x 256, real*4 • Read • In some cases, performance improvement over the single processor • 8 processor parallel read is 2-3 times faster than the serial netCDF • Write • Performance is not better than serial netCDF, 7-8 times slower

  14. Read 64 MB Write 64 MB 1000 1000 100 100 Bandwidth (MB/sec) Bandwidth (MB/sec) 10 10 1 1 ZYX ZYX X ZX X ZX Z Z YX YX ZY ZY Y Y 0.1 0.1 1 2 4 8 16 1 2 4 8 16 Number of processors Number of processors Our Results – 64 MB • Array size: 256 x 256 x 256, real*4 • Run on IBM SP2 at SDSC • I/O is performed using one processor per node

  15. LBNL Results – 1 GB • Array size – 512 x 512 x 512, real*8 • Read • No better performance is observed • Write • 4-8 processor writes results in 2-3 times higher bandwidth than using a single processor

  16. Read 1 GB Write 1 GB 1000 1000 100 100 Bandwidth (MB/sec) Bandwidth (MB/sec) 10 10 1 1 ZYX ZYX X ZX X ZX Z Z YX YX ZY ZY Y Y 0.1 0.1 1 2 4 8 16 32 1 2 4 8 16 32 Number of processors Number of processors Our Results – 1 GB • Array size: 512 x 512 x 512, real*8 • Run on IBM SP2 at SDSC • I/O is performed using one processor per node

  17. Summary • Complete the parallel C APIs • Identify friendly users • ORNL, LBNL • User reference manual • Preliminary performance results • Using LBNL test suite: typical access patterns • Obtained scalable results

More Related