310 likes | 319 Views
Project 4. Enabling High Performance Application I/O. Wei-keng Liao Northwestern University. SciDAC All Hands Meeting March 26-27, 2002. Outline. Design of parallel netCDF APIs Using MPI-IO underlying (student: Jianwei Li) Collaborators: Bill Gropp, Rob Ross, Rajeev Thakur (ANL)
E N D
Project 4 Enabling High Performance Application I/O Wei-keng Liao Northwestern University SciDAC All Hands Meeting March 26-27, 2002
Outline • Design of parallel netCDF APIs • Using MPI-IO underlying (student: Jianwei Li) • Collaborators: Bill Gropp, Rob Ross, Rajeev Thakur (ANL) • Non-contiguous data access on PVFS • Design of non-contiguous access APIs (student: Avery Ching) • Interfaces to the MPI-IO (student: Kenin Coloma) • Applications: FLASH, tiled visualization • Collaborators: Bill Gropp, Rob Ross, Rajeev Thakur (ANL) • High level data access patterns • ENZO astrophysics application • Access patterns of an AMR application
NetCDF Overview NetCDF (network Common Data Form) is an interface for array-oriented data access. It defines a machine-independent file format for representing multi-dimensional arrays with ancillary data, and provide I/O library for creation, access, and sharing of array-oriented data. Each netCDF file is a dataset, which contains a set of named arrays. netCDF example { // CDL notation for netCDF dataset dimensions: // dimension names and lengths lat = 5, lon = 10, level = 4, time = unlimited; variables: // var types, names, shapes, attributes float temp(time,level,lat,lon); temp:long_name = "temperature"; temp:units = "celsius"; float rh(time,lat,lon); rh:long_name = "relative humidity"; rh:valid_range = 0.0, 1.0; // min and max int lat(lat), lon(lon), level(level), time(time); lat:units = "degrees_north"; lon:units = "degrees_east"; level:units = "millibars"; time:units = "hours since 1996-1-1"; // global attributes: :source = "Fictional Model Output"; data: // optional data assignments level = 1000, 850, 700, 500; lat = 20, 30, 40, 50, 60; lon = -160,-140,-118,-96,-84,-52,-45,-35,-25,-15; time = 12; rh = .5,.2,.4,.2,.3,.2,.4,.5,.6,.7, .1,.3,.1,.1,.1,.1,.5,.7,.8,.8, .1,.2,.2,.2,.2,.5,.7,.8,.9,.9, .1,.2,.3,.3,.3,.3,.7,.8,.9,.9, 0,.1,.2,.4,.4,.4,.4,.7,.9,.9; // 1 record allocated } • Dataset Component • Dimensions • name, length • Fixed dimension • UNLIMITED dimension • Variables: named arrays • name, type, shape, attributes, array data • Fixed sized variables: array of fixed dimensions • Record variables: array with its most-significant dimension UNLIMITED • Coordinate variables: 1-D array with the same name as its dimension • Attributes • name, type, values, length • Variable attributes • Global attributes
Design Parallel netCDF APIs • Goal • Maintain exactly the same original netCDF file format • Provide parallel I/O functionalities • On top of MPI-IO • High level parallel APIs • Minimize the argument list change of netCDF APIs • For legacy codes with minimal changes • Low level parallel APIs • Using MPI-IO components, e.g. derived data types • For MPI-IO experienced users
NetCDF File Structure ● Header (dataset definition, extendable) - Number of records allocated - Dimension list - Global attribute list - Variable list ● Data (row-major, big-endian, 4 byte aligned) - Fixed-sized(non-record) data data for each variable is stored contiguously in defined order - Record data (non-contiguous between records of a var) a variable number of fixed-size records, each of which contains one record for each of the record variables in defined order.
NetCDF APIs • Dataset APIs -- create/open/close a dataset, set the dataset to define/data mode, and synchronize dataset changes to disk • Input: path, mode for create/open; dataset ID for opened dataset • Output: dataset ID for create/open • Define mode APIs -- define dataset: add dimensions, variables • Input: opened dataset ID; dimension name and length to define dimension; or variable name, number of dimensions, shape to define variable • Output: dimension ID; or variable ID • Attribute APIs -- add, change, and read attributes of datasets • Input: opened dataset ID; attribute No. or attribute name to access attribute; or attribute name, type, and value to add/change attribute • Output: attribute value for read attribute • Inquiry APIs -- inquire dataset metadata (in memory): dim(id, name, len), var(name, ndims, shape, id) • Input: opened dataset id; dim name or id, or variable name or id • Output: dimension info, or variable info • Data mode APIs – read/write variable (access method: single value, whole array, subarray, strided subarray, sampled subarray) • Input: opened dataset ID; variable id; element start index, count, stride, index map.
Design of Parallel APIs • Two file descriptors • NetCDF file descriptor: For header I/O (reuse of old code) Performed only by process 0 • MPI_File handle: For data array I/O Performed by all processes • Implicit MPI file handle and communicator • Added into the internal data structure • MPI communicator passed as an argument in create/open • I/O implementation using MPI-IO • File view and offsets are computed from metadata in header and user-provided arguments (start, count, stride) • Users choose either collective or non-collective I/O calls
Collective/Non-collective APIs • Dataset APIs • Collective calls over the communicator passed into the create or open call • All processes collectively switches between define and data mode • Define mode, attribute, inquiry APIs • Collective or non-collective calls • Operate in local memory (all processes have identical header structures) • Data mode APIs • Collective or non-collective calls • Access method: single value, whole array, subarray, strided subarray
Changes in High-level Parallel APIs * type = text | uchar | schar | short | int | long | float | double
The only change Example Code - Write • Create a dataset • Collective • The input arguments should be the same among processes • The returned ncid is different among processes (but refers the same dataset) • All processes put in define mode • Define dimensions • Non-collective • All processes should have the same definitions • Define variables • Non-collective • All processes should have the same definitions • Add attributes • Non-collective • All processes should have put the same attributes • End define • Collective • All processes switch from define mode to data mode • Write variable data • All processes do a number of collective write to write the data for each variable • Can do independent write, if you like • Each process provide different argument values which are set locally • Close the dataset • Collective status = nc_create(comm, "test.nc", NC_CLOBBER, &ncid); /* dimension */ status = nc_def_dim(ncid, "x", 100L, &dimid1); status = nc_def_dim(ncid, "y", 100L, &dimid2); status = nc_def_dim(ncid, "z", 100L, &dimid3); status = nc_def_dim(ncid, "time", NC_UNLIMITED, &udimid); square_dim[0] = cube_dim[0] = xytime_dim[1] = dimid1; square_dim[1] = cube_dim[1] = xytime_dim[2] = dimid2; cube_dim[2] = dimid3; xytime_dim[0] = udimid; time_dim[0] = udimid; /* variable */ status = nc_def_var (ncid, "square", NC_INT, 2, square_dim, &square_id); status = nc_def_var (ncid, "cube", NC_INT, 3, cube_dim, &cube_id); status = nc_def_var (ncid, "time", NC_INT, 1, time_dim, &time_id); status = nc_def_var (ncid, "xytime", NC_INT, 3, xytime_dim, &xytime_id); /* attributes */ status = nc_put_att_text (ncid, NC_GLOBAL, "title", strlen(title), title); status = nc_put_att_text (ncid, square_id, "description", strlen(desc), desc); status = nc_enddef(ncid); /* variable data */ nc_put_vara_int_all(ncid, square_id, square_start, square_count, buf1); nc_put_vara_int_all(ncid, cube_id, cube_start, cube_count, buf2); nc_put_vara_int_all(ncid, time_id, time_start, time_count, buf3); nc_put_vara_int_all(ncid, xytime_id, xytime_start, xytime_count, buf4); status = nc_close(ncid);
The only change Example Code - Read • Open the dataset • Collective • The input arguments should be the same among processes • The returned ncid is different among processes (but refers the same dataset) • All processes put in data mode • Dataset inquiries • Non-collective • Count, name, len, datatype • Read variable data • All processes do a number of collective read to read the data from each variable in (B, *, *) manner • Can do independent read, if you like • Each process provide different argument values which are set locally • Close the dataset • Collective status = nc_open(comm, filename, 0, &ncid); status = nc_inq(ncid, &ndims, &nvars, &ngatts, &unlimdimid); /* global attributes */ for (i = 0; i < ngatts; i++) { status = nc_inq_attname(ncid, NC_GLOBAL, i, name); status = nc_inq_att (ncid, NC_GLOBAL, name, &type, &len); status = nc_get_att_text(ncid, NC_GLOBAL, name, valuep); } /* variables */ for (i = 0; i < nvars; i++) { status = nc_inq_var(ncid, i, name, vartypes+i, varndims+i, vardims[i], varnatts+i); /* variable attributes */ for (j = 0; j < varnatts[i]; j++) { status = nc_inq_attname(ncid, varids[i], j, name); status = nc_inq_att (ncid, varids[i], name, &type, &len); status = nc_get_att_text(ncid, varids[i], name, valuep); } } /* variable data */ for (i = 0; i < NC_MAX_VAR_DIMS; i++) start[i] = 0; for (i = 0; i < nvars; i++) { varsize = 1; /* dimensions */ for (j = 0; j < varndims[i]; j++) { status = nc_inq_dim(ncid, vardims[i][j], name, shape + j); if (j == 0) { shape[j] /= nprocs; start[j] = shape[j] * rank; } varsize *= shape[j]; } status = nc_get_vara_int_all(ncid, i, start, shape, (int *)valuep); } status = nc_close(ncid);
Non-contiguous Data Access on PVFS • Problem definition • Design approaches • Multiple I/O • Data sieving • PVFS list_io • Integration into MPI-IO • Experimental results • Artificial benchmark • FLASH application I/O • Tile visualization
Non-contiguous Data Access • Data access that is not adjacent in memory or file • Non-contiguous in memory, contiguous in file • Non-contiguous in file, contiguous in memory • Non-contiguous in file, non-contiguous in memory • Two applications • FLASH astrophysics application • Tile visualization Contiguous in memory Memory File Non-contiguous in file Non-contiguous in memory Memory File Contiguous in file Non-contiguous in memory Memory File Non-contiguous in file
Multiple I/O Requests File • Intuitive strategy • One I/O request per contiguous data segment • Large number of I/O requests to the file system • Communication costs between applications and I/O servers become significant which can dominates the I/O time Second read request First read request Application Contiguous Contiguous Contiguous Data Region Data Region Data Region I/O I/O I/O Request Request Request I/O I/O I/O Server Server Server
Application Contiguous Contiguous Contiguous Contiguous Data Region Data Region Data Region Data Region I/O I/O Request Request I/O I/O I/O Server Server Server Data Sieving I/O File • Reads a contiguous chunk frm the file into a temporary buffer • Extract/update the requested portions • Number of requests reduced • I/O amount increased • Number of I/O requests depends on the size of sieving buffer • Write back to file (for write operations) First I/O request Second I/O request
PVFS List_io • Combine non-contiguous I/O requests into a single request • Client support • APIs pvfs_list_read, pvfs_list_write • I/O request -- a list of file offsets and file lengths • I/O server support • Wait for trailing list of file offsets and lengths following I/O request
Stride File Memory Proc 0 Proc 1 Proc 2 4 accesses Artificial Benchmark • Contiguous in memory, non-contiguous in file • Parameters: • Number of accesses • Number of processors • Stride size = file size / number of accesses • Block size = stride size / number of processors
Benchmark Results Write Read • Parameter configurations • 8 clients • 8 I/O servers • 1 Gigabyte file size 400 600 Data Sieving 350 Multiple I/O 500 Multiple I/O List_io 300 400 List_io 250 Time (seconds) Time (in seconds) 300 200 200 150 100 100 0 50 0 20k 40k 60k 80k 100k 200k 400k 600k 800k 10k 20k 30k 40k 50k 60k 70k 80k 90k Number of Accesses Number of Accesses • Avoid caching effect at I/O servers • Read/write 4 files alternatively since each I/O server has 512 MB memory
FLASH Application • An astrophysics application developed at University of Chicago • Simulate the accretion of matter onto a compact star, and the subsequent stellar evolution, including nuclear burning either on the surface of the compact star, or in its interior • I/O benchmark measures the performance of the FLASH output: produces checkpoint files, plot-files • A typical large production run generates ~ 0.5 Tbytes (100 checkpoint files and 1,000 plot-files) This image, the interior of an exploding star, depicts the distribution of pressure during a star explosion
Memory Organization Z-Axis X-Axis FLASH block structure Cut a slice of the block Y-Axis Each element has 24 variables Variable 0 Variable 1 Variable 2 Blocks to access in Y-axis Variable 23 Guard Cells Blocks to access in X-axis FLASH -- I/O Access Pattern • Each processor has 80 cubes • Each has guard cells and a sub-cube which holds the data to be output • Each element in the cube contains 24 variables, each is of type double (8 bytes) • Each variable is partitioned among all processors • Output pattern • All variables are saved into a single file, one after another
FLASH I/O Results • Access patterns: • In memory • Each contiguous segment is small, 8 bytes • Stride size between two segments is small, 192 bytes • From memory to file • Multiple I/O: 8*8*8*80*24 =983,040 request per processors • Data sieving: 24 requests per processor • List_io: 8*8*8*80*24/64 = 15,360 requests per processor (64 is the max number of offset-length pairs) • In file • Each contiguous segment is of size 8*8*8*8 = 4096 bytes written by each processor • The output file is of size • 8 MB * number of procs
Single node’s file view Proc 0 ... Proc 1 Proc 2 Tile Visualization • Preprocess “frames” into streams of tiles by staging tile data on visualization nodes • Read operations only • Each node reads one sub-tile • Each sub-tile has ghost regions overlapped with other sub-tiles • The noncontiguous nature of this file access becomes apparent in its logical file representation Tile 1 Tile 2 Tile 3 Tile 4 Tile 5 Tile 6 • Example layout • 3x2 display • Frame size of 2532x1408 pixels • Tile size of 1024x768 w/ overlap • 3 byte RGB pixels • Each frame is stored as a file of size 10MB
Datatype offsets & lengths Filetype offsets & lengths ... ... ... ... File Memory pvfs_read_list(Memory offsets/lengths, File offsets/lengths) Integrate List_io to ROMIO • ROMIO uses the internal ADIO function flatten to break both the filetypes and datatypes down into a list of offset and length pairs • Then, using the list, ROMIO steps through both file and memory addresses • ROMIO generates memory and file offsets and lengths to pass through pvfs_list_io • ROMIO calls pvfs_list_io after all data has been read, or the set max array size has been reached, in which case a new list is generated
4 compute nodes 1740 MB 4 compute nodes 435 MB 250 250 250 40 40 40 5 5 5 10 10 10 200 200 200 30 30 30 4.5 4.5 4.5 9 9 9 4 4 4 8 8 8 150 150 150 3.5 3.5 3.5 7 7 7 20 20 20 3 3 3 6 6 6 100 100 100 2.5 2.5 2.5 5 5 5 10 10 10 2 2 2 4 4 4 50 50 50 1.5 1.5 1.5 3 3 3 0 0 0 2 2 2 1 1 1 0 0 0 1 1 1 0.5 0.5 0.5 0 0 0 0 0 0 4 8 12 16 4 8 12 16 8 compute nodes 1740 MB 8 compute nodes 435 MB 8 compute nodes 108 MB 8 compute nodes 40 MB 4 8 12 16 4 8 12 16 4 8 12 16 4 8 12 16 16 compute nodes 1740 MB 16 compute nodes 435 MB 4 8 12 16 4 8 12 16 io nodes io nodes Tile I/O Results Collective data sieving Non-collective read_list Collective read_list Non-collective data sieving 4 compute nodes 40 MB 4 compute nodes 108 MB 4 8 12 16 4 8 12 16 accumulated time 16 compute nodes 40 MB 16 compute nodes 108 MB 4 8 12 16 4 8 12 16 io nodes io nodes
Analysis of Tile I/O Results • Collective operations theoretically should be faster, but … • Hardware problem • Fast Ethernet: overhead in the collective I/O takes too long to catch back up with the independent I/O requests • Software problem • A lot of extra data movement in ROMIO collectives -- the aggregation isn't as smart as it could be • Plans to do • Use MPE logging facilities to figure out the problem • Study of the ROMIO implementation, find bottlenecks in the collectives and try to weed them out
High Level Data Access Patterns • Study of file access patterns of astrophysics applications • FLASH from University of Chicago • ENZO from NCSA • Design of data management framework using XML and database • Essential metadata collection • Trigger rules for automatic I/O optimization
ENZO Application • Simulate the formation of a cluster of galaxies starting near the big bang until the present day • It is used to test theories of how galaxy forms by comparing the results with what is really observed in the sky today • File I/O using HDF-4 • Dynamic load balance using MPI • Data partitioning: Adaptive Mesh Refinement (AMR)
AMR Data Access Pattern • Adaptive Mesh Refinement partitions problem domain into sub-domains recursively an dynamically • A grid can only be owned by a processor but one processor can have many grids. • Check-pointing is performed • Each grid is written to a separate file (independent writes) • During re-start • Sub-domain hierarchy need not be re-constructed • Grids at the same time stamp are read altogether • During visualization • All grids are combined into a top grid
grid.xml <DataSet name="grid"> <Producer name="astro" /> <GridRank value="3" /> <Grid id="0" level="0"> <Dimension value="22 22 22"/> <Array name="density" dim="3"> <type IsComplex="1"> float32, int32, double64 </type> <FileName value="grid0.dat"> </Array> <Grid id="1" level="1"> <Dimension value="10 8 12"/> grid.xml table </Grid> node parent type name cdata <Grid id="2" level="1"> key key 0 element DataSet null <Dimension value="5 6 4"/> 1 attribute name 0 "grid" <Grid id="3" level="2"> 2 element Producer 0 <Dimension value="2 3 2"/> 3 attribute name 2 "ENZO" </Grid> 4 element GridRank 0 </Grid> 5 attribute value 4 3 </Grid> 6 element Grid 4 </DataSet> 7 attribute id 6 0 8 attribute level 6 0 AMR Hierarchy Represented in XML • AMR hierarchy is naturally mapped into XML hierarchy • XML is embedded in a relational database • Metadata queries/update through the database • Database can handle multiple queries simultaneously – ideal for parallel applications
File System Based XML • File system is used to support the decomposition of XML documents into files and directories • This representation consists of an arbitrary hierarchy of directories and files, and preserves the XML philosophy of being textual in representation but requires no further use of an XML parser to process the document • Metadata locates near to the scientific data
Summary • List_io API incorporated into PVFS for non-contiguous data access • Read operation is completed • Write operation in progress • Parallel netCDF APIs • High-level APIs --- will completed soon • Low-level APIs --- interfaces already defined • Validater • High level data access patterns • Access patterns of AMR applications • Other types of applications