280 likes | 402 Views
Parallel I/O . Bronson Messer Deputy Director of Science, OLCF ORNL Theoretical Astrophysics Group Dept. of Physics & Astronomy, University of Tennessee . I’ve done all this computation in parallel, but what’s the result?. How do we get output from a parallel program?
E N D
Parallel I/O Bronson Messer Deputy Director of Science, OLCF ORNL Theoretical Astrophysics Group Dept. of Physics & Astronomy, University of Tennessee
I’ve done all this computation in parallel, but what’s the result? • How do we get output from a parallel program? • All the messages from individual processes to STDOUT sure are annoying, huh? • How do we get input into a parallel program? • Read and broadcast works for command line arguments just fine • What if the input is TBs in size?
A few observations • Most large-scale applications where I/O is important are heavy on the “O” and light on the “I” • I/O really matters for performance: it is often the slowest ingredient, sometimes by orders of magnitude
Common Ways of Doing I/O in Parallel Programs • Sequential I/O: • All processes send data to rank 0, and 0 writes it to the file
Pros and Cons of Sequential I/O • Pros: • parallel machine may support I/O from only one process (e.g., no common file system) • Some I/O libraries (e.g. HDF-4, NetCDF) not parallel • resulting single file is handy for ftp, mv • big blocks improve performance • short distance from original, serial code • Cons: • lack of parallelism limits scalability, performance (single node bottleneck)
Another Way • Each process writes to a separate file • Pros: • parallelism, high performance • Cons: • lots of small files to manage • difficult to read back data from different number of processes
What is Parallel I/O? • Multiple processes of a parallel program accessing data (reading or writing) from a common file FILE P(n-1) P0 P1 P2
Why Parallel I/O? • Non-parallel I/O is simple but • Poor performance (single process writes to one file) or • Awkward and not interoperable with other tools (each process writes a separate file) • Parallel I/O • Provides high performance • Can provide a single file that can be used with other tools (such as visualization programs) • This may or may not be true with ‘modern’ viz tools, but it sure is easier to wrangle one file than 144,000.
Why is MPI a Good Setting for Parallel I/O? • Writing is like sending a message and reading is like receiving. • Any parallel I/O system will need a mechanism to • define collective operations (MPI communicators) • define noncontiguous data layout in memory and file (MPI datatypes) • Test completion of nonblocking operations (MPI request objects) • i.e., lots of MPI-like machinery
FILE P(n-1) P0 P1 P2 Using MPI for Simple I/O Each process needs to write a chunk of data to a common file
Writing to a File • Use MPI_File_write or MPI_File_write_at • Use MPI_MODE_WRONLY or MPI_MODE_RDWR as the flags to MPI_File_open • If the file doesn’t exist previously, the flag MPI_MODE_CREATE must also be passed to MPI_File_open • We can pass multiple flags by using bitwise-or ‘|’ in C, or addition ‘+” in Fortran
Using File Views • Processes write to shared file • MPI_File_set_view assigns regions of the file to separate processes
File Views • Specified by a triplet (displacement, etype, and filetype) passed to MPI_File_set_view • displacement = number of bytes to be skipped from the start of the file • etype = basic unit of data access (can be any basic or derived datatype) • filetype = specifies which portion of the file is visible to the process
File View Example MPI_File thefile; for (i=0; i<BUFSIZE; i++) buf[i] = myrank * BUFSIZE + i; MPI_File_open(MPI_COMM_WORLD, "testfile", MPI_MODE_CREATE | MPI_MODE_WRONLY, MPI_INFO_NULL, &thefile); MPI_File_set_view(thefile, myrank * BUFSIZE, MPI_INT, MPI_INT, "native", MPI_INFO_NULL); MPI_File_write(thefile, buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE); MPI_File_close(&thefile);
filetype = two MPI_INTs followed by a gap of four MPI_INTs A Simple File View Example etype = MPI_INT head of file FILE displacement filetype filetype and so on...
Ways to Write to a Shared File • MPI_File_seek • MPI_File_read_at • MPI_File_write_at • MPI_File_read_shared • MPI_File_write_shared • Collective operations like Unix seek combine seek and I/O for thread safety use shared file pointer good when order doesn’t matter
Whew, this is getting complicated… • The bad news: it’s an inherently complicated process and, in “real life”, it is more complicated
E.g. to use Lustre on jaguar Total Output Size File System Hints Transfer Size Blocksize Collective vs Independent Weak vs Strong Scaling This is why IO is complicated….. Number of Files per Ouput Dump Number of Processors Strided or Contiguous Access Striping Chunking IO Library File Size Per Processor
… but, if we have to go through all this… • …we might as well get something out of it! • What if we could have output files that • Were easily portable from machine to machine, • Could be read by a variety of visualization packages, • Could be ‘looked at’ with a simple command, • And could be written fast. • I’d buy that, right? • What’s the cost? • A LOT of typing…
Parallel HDF5 • Can store data structures, arrays, vectors, grids, complex data types, text • Can use basic HDF5 types integers, floats, reals or user defined types such as multi-dimensional arrays, objects and strings • Stores metadata necessary for portability - endian type, size, architecture • Not the only choice for platform-independent parallel I/O, but it is the most common
Data Model • Groups • Arranged in directory hierarchy • root group is always ‘/’ • Datasets • Dataspace • Datatype • Attributes • Bind to Group & Dataset • References • Similar to softlinks • Can also be subsets of data “/” (root) “author”=Jane Doe “date”=10/24/2006 “subgrp” “Dataset0” type,space “Dataset1” type, space “time”=0.2345 “validity”=None “Dataset0.1” type,space “Dataset0.2” type,space This is why all the typing is necessary…
The pain begins 23 comm = MPI_COMM_WORLD 24 info = MPI_INFO_NULL 26 CALL MPI_INIT(mpierror) 29 ! 30 ! Initialize FORTRAN predefined datatypes 32 CALL h5open_f(error) 34 ! 35 ! Setup file access property list for MPI-IO access. 37 CALL h5pcreate_f(H5P_FILE_ACCESS_F, plist_id, error) 38 CALL h5pset_fapl_mpio_f(plist_id, comm, info, error) 40 ! 41 ! Create the file collectively. 43 CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id, error, access_prp = plist_id) 45 ! 46 ! Close the file. 49 CALL h5fclose_f(file_id, error) 51 ! 52 ! Close FORTRAN interface 54 CALL h5close_f(error) 56 CALL MPI_FINALIZE(mpierror) • First, create the parallel file
Pain continued… • Then create a dataset 43 CALL h5fcreate_f(filename, H5F_ACC_TRUNC_F, file_id, error, access_prp = plist_id) 73 CALL h5screate_simple_f(rank, dimsf, filespace, error) 76 ! 77 ! Create the dataset with default properties. 78 ! 79 CALL h5dcreate_f(file_id, “dataset1”, H5T_NATIVE_INTEGER, filespace, dset_id, error) 90 ! 91 ! Close the dataset. 92 CALL h5dclose_f(dset_id, error) 93 ! 94 ! Close the file. 95 CALL h5fclose_f(file_id, error)
More typing… • Then, write to the dataset… 88 ! Create property list for collective dataset write 89 ! 90 CALL h5pcreate_f(H5P_DATASET_XFER_F, plist_id, error) 91 CALL h5pset_dxpl_mpio_f(plist_id, & H5FD_MPIO_COLLECTIVE_F, error) 92 93 ! 94 ! Write the dataset collectively. 95 ! 96 CALL h5dwrite_f(dset_id, H5T_NATIVE_INTEGER, data, & error, & file_space_id = filespace, & mem_space_id = memspace, & xfer_prp = plist_id)
h5dump is a simple way to take a look HDF5 "SDS_row.h5" { GROUP "/" { DATASET ”dataset1" { DATATYPE H5T_STD_I32BE DATASPACE SIMPLE { ( 8, 5 ) / ( 8, 5 ) } DATA { 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13 } } } }
But, in the end, you a file you can use 11552 processors 1 file
Stuff to remember • Big, multiphysics, parallel applications spew 10’s-100’s of TB over the time of a run • Without parallel I/O, millions(!!!) of files would have to be produced • Parallel I/O has to perform well if you don’t want to spend most of your time writing to disk • Figuring out how to write and optimizing performance takes considerable thought (i.e. probably as much as the original program design • Platform independent formats like HDF5 might not make that process easier, but you get a lot of added value