420 likes | 687 Views
IO Best Practices For Franklin Katie Antypas User Services Group Kantypas@lbl.gov NERSC User Group Meeting September 19, 2007. Outline. Goals and scope of tutorial IO Formats Parallel IO strategies Striping Recommendations.
E N D
IO Best Practices For Franklin Katie Antypas User Services Group Kantypas@lbl.gov NERSC User Group Meeting September 19, 2007
Outline • Goals and scope of tutorial • IO Formats • Parallel IO strategies • Striping • Recommendations Thanks to Julian Borrill, Hongzang Shan, John Shalf and Harvey Wasserman for slides and data, Nick Cardo for Franklin/Lustre tutorials and NERSC-IO group for feedback NERSC User Group Meeting, September 17, 2007
Goals • Very high level answer question of “how should I do my IO on Franklin?” • With X GB of data to output running on Y processors -- do this. NERSC User Group Meeting, September 17, 2007
Axis of IO Total Output Size File System Hints Transfer Size Blocksize Collective vs Independent Weak vs Strong Scaling This is why IO is complicated….. Number of Files per Ouput Dump Number of Processors Strided or Contiguous Access Striping Chunking IO Library File Size Per Processor NERSC User Group Meeting, September 17, 2007
Axis of IO Total Output Size File System Hints Transfer Size Blocksize Collective vs Independent Weak vs Strong Scaling Number of Files per Ouput Dump Number of Processors Strided or Contiguous Access Striping Chunking IO Library File Size Per Processor NERSC User Group Meeting, September 17, 2007
Axis of IO Primarily large block IO, transfer size same as blocksize Total File Size Transfer Size Blocksize Strong Scaling Number of Writers Number of Processors Striping Some Basic Tips IO Library File Size Per Processor Used HDF5 NERSC User Group Meeting, September 17, 2007
Parallel I/O: A User Perspective • Wish List • Write data from multiple processors into a single file • File can be read in the same manner regardless of the number of CPUs that read from or write to the file. (eg. want to see the logical data layout… not the physical layout) • Do so with the same performance as writing one-file-per-processor (only writing one-file-per-processor because of performance problems) • And make all of the above portable from one machine to the next NERSC User Group Meeting, September 17, 2007
I/O Formats NERSC User Group Meeting, September 17, 2007
Many NERSC users at this level. We would like to encourage users to transition to a higher IO library Common Storage Formats • ASCII: • Slow • Takes more space! • Inaccurate • Binary • Non-portable (eg. byte ordering and types sizes) • Not future proof • Parallel I/O using MPI-IO • Self-Describing formats • NetCDF/HDF4, HDF5, Parallel NetCDF • Example in HDF5: API implements Object DB model in portable file • Parallel I/O using: pHDF5/pNetCDF (hides MPI-IO) • Community File Formats • FITS, HDF-EOS, SAF, PDB, Plot3D • Modern Implementations built on top of HDF, NetCDF, or other self-describing object-model API NERSC User Group Meeting, September 17, 2007
HDF5 Library HDF5 is a general purpose library and file format for storing scientific data • Can store data structures, arrays, vectors, grids, complex data types, text • Can use basic HDF5 types integers, floats, reals or user defined types such as multi-dimensional arrays, objects and strings • Stores metadata necessary for portability - endian type, size, architecture NERSC User Group Meeting, September 17, 2007
HDF5 Data Model • Groups • Arranged in directory hierarchy • root group is always ‘/’ • Datasets • Dataspace • Datatype • Attributes • Bind to Group & Dataset • References • Similar to softlinks • Can also be subsets of data “/” (root) “author”=Jane Doe “date”=10/24/2006 “subgrp” “Dataset0” type,space “Dataset1” type, space “time”=0.2345 “validity”=None “Dataset0.1” type,space “Dataset0.2” type,space NERSC User Group Meeting, September 17, 2007
A Plug for Self Describing Formats ... • Application developers shouldn’t care about about physical layout of data • Using own binary file format forces user to understand layers below the application to get optimal IO performance • Every time code is ported to a new machine or underlying file system is changed or upgraded, user is required to make changes to improve IO performance • Let other people do the work • HDF5 can be optimized for given platforms and file systems by HDF5 developers • User can stay with the high level • But what about performance? NERSC User Group Meeting, September 17, 2007
IO Library Overhead Very little, if any overhead from HDF5 for one file per processor IO compared to Posix and MPI-IO Data from Hongzhang Shan NERSC User Group Meeting, September 17, 2007
Ways to do Parallel IO NERSC User Group Meeting, September 17, 2007
Serial I/O 0 1 2 3 4 5 processors • Each processor sends its data to the master who then writes the data to a file • Advantages • Simple • May perform ok for very small IO sizes • Disadvantages • Not scalable • Not efficient, slow for any large number of processors or data sizes • May not be possible if memory constrained File NERSC User Group Meeting, September 17, 2007
Parallel I/O Multi-file 0 1 2 3 4 5 processors File File File File File File • Each processor writes its own data to a separate file • Advantages • Simple to program • Can be fast -- (up to a point) • Disadvantages • Can quickly accumulate many files • With Lustre, hit metadata server limit • Hard to manage • Requires post processing • Difficult for storage systems, HPSS, to handle many small files NERSC User Group Meeting, September 17, 2007
Flash Center IO Nightmare… • Large 32,000 processor run on LLNL BG/L • Parallel IO libraries not yet available • Intensive I/O application • checkpoint files .7 TB, dumped every 4 hours, 200 dumps • used for restarting the run • full resolution snapshots of entire grid • plotfiles - 20GB each, 700 dumps • coarsened by a factor of two averaging • single precision • subset of grid variables • particle files 1400 particle files 470MB each • 154 TB of disk capacity • 74 million files! • Unix tool problems • 2 Years Later still trying to sift though data, sew files together NERSC User Group Meeting, September 17, 2007
Parallel I/O Single-file 1 2 3 4 5 0 processors File • Each processor writes its own data to the same file using MPI-IO mapping • Advantages • Single file • Manageable data • Disadvantages • Lower performance than one file per processor at some concurrencies NERSC User Group Meeting, September 17, 2007
3 5 2 9 2 4 3 1 9 8 2 4 Parallel IO single file 0 1 2 3 4 5 processors array of data Each processor writes to a section of a data array. Each must know its offset from the beginning of the array and the number of elements to write NERSC User Group Meeting, September 17, 2007
Trade offs • Ideally users want speed, portability and usability • speed - one file per processor • portability - high level IO library • usability • single shared file and • own file format or community file format layered on top of high level IO library It isn’t hard to have speed, portability or usability. It is hard to have speed, portability and usability in the same implementation NERSC User Group Meeting, September 17, 2007
Benchmarking Methodology and Results NERSC User Group Meeting, September 17, 2007
Disclaimer • IO runs done during production time • Rates dependent on other jobs running on the system • Focus on trends rather than one or two outliers • Some tests ran twice, others only once NERSC User Group Meeting, September 17, 2007
Peak IO Performance on Franklin • Expectation that IO rates will continue to rise linearly • Back end saturated around ~250 processors • Weak scaling IO, ~300 MB/proc • Peak performance ~11GB/Sec (5 DDNs * ~2GB/sec) Image from Julian Borrill NERSC User Group Meeting, September 17, 2007
Description of IOR • Developed by LLNL used for purple procurement • Focuses on parallel/sequential read/write operations that are typical in scientific applications • Can exercise one file per processor or shared file access for common set of testing parameters • Exercises array of modern file APIs such as MPI-IO, POSIX (shared or unshared), HDF5 and parallel-netCDF • Parameterized parallel file access patterns to mimic different application situations NERSC User Group Meeting, September 17, 2007
1 2 3 4 5 0 processors File 0 1 2 3 4 5 processors File File File File File File Benchmark Methodology Focus on performance difference between single shared and one file per processor NERSC User Group Meeting, September 17, 2007
Benchmark Methodology • Using IOR HDF5 Interface • Contiguous IO • Not intended to be a scaling study • Blocksize and transfer size always the same but vary from run to run • Goal is to fill out opposite chart with best IO strategy 4096 2048 Processors 1024 512 256 100 MB 1 GB 10 GB 100 GB 1 TB Aggregate Output Size NERSC User Group Meeting, September 17, 2007
Small Aggregate Output Sizes 100 MB - 1GB One File per Processor vs Shared File - GB/Sec Aggregate File Size 100 MB Aggregate File Size 1 GB Peak performance line - Anything greater than this is due to caching effect or timer granularity Clearly the ‘one file per processor’ strategy wins in the low concurrency cases correct? NERSC User Group Meeting, September 17, 2007
Small Aggregate Output Sizes 100 MB - 1GB One File per Processor vs Shared File - Time Aggregate File Size 1 GB Aggregate File Size 100 MB But when looking at absolute time, the difference doesn’t seem so big... NERSC User Group Meeting, September 17, 2007
Aggregate Output Size 100GB One File per Processor vs Shared File Rate: GB/Sec Time: Seconds Peak performance line 2.5 mins 390 MB/proc 24 MB/proc Is there anything we can do to improve the performance of the 4096 processor shared file case ? NERSC User Group Meeting, September 17, 2007
Hybrid Model 1 2 3 4 5 0 • Examine 4096 processor case more closely • Group subsets of processors to write to separate shared files • Try grouping 64, 256, 512, 1024, and 2048 processors to see performance difference from file per processor case vs single shared file case processors File File NERSC User Group Meeting, September 17, 2007
Effect of Grouping Processors into Separate Smaller Shared Files 100GB Aggregate Output Size on 4096 procs • Each processor writes out 24MB • Only difference between runs is number of files to which processors are grouped • Created a new MPI communicator in IOR for multiple shared files • User gains some from grouping files • Since very little data is written per processor, overhead for synchronization dominates Number of Files 64 procs write to single file 512 procs write to single file 2048 procs write to single file 1 file per proc Single Shared File NERSC User Group Meeting, September 17, 2007
Aggregate Output Size 1TB One File per Processor vs Shared File Rate: GB/Sec Time: Seconds ~ 3 mins 976 MB/proc 244 MB/proc Is there anything we can do to improve the performance of the 4096 processor shared file case ? NERSC User Group Meeting, September 17, 2007
Effect from grouping files is fairly substantial • But do users want to do this? • Important to show hdf5 developers to make splitting files easier in API. Effect of Grouping Processors into Separate Smaller Shared Files • Each processor writes out 244MB • Only difference between runs is number of files to which processors are grouped • Created a new MPI communicator in IOR for multiple shared files 64 procs write to single file 2048 procs write to single file 1 file per proc 512 procs write to single file Single Shared File NERSC User Group Meeting, September 17, 2007
Effect of Grouping Processors into Separate Smaller Shared Files • Each processor writes out 488MB • Only difference between runs is number of files to which processors are grouped • Created a new MPI communicator in IOR for multiple shared files 64 procs write to single file 1 file per proc 512 procs write to single file Single Shared File NERSC User Group Meeting, September 17, 2007
What is Striping? • Lustre file system on Franklin made up of an underlying set of file systems calls Object Storage Targets (OSTs), essentially a set of parallel IO servers • File is said to be striped when read and write operations access multiple OSTs concurrently • Striping can be a way to increase IO performance since writing or reading from multiple OSTs simultaneously increases the available IO bandwidth NERSC User Group Meeting, September 17, 2007
What is Striping? • File striping will most likely improve performance for applications which read or write to a single (or multiple) large shared files • Striping will likely have little effect for the following type of IO patterns • Serial IO where a single processor performs all the IO • Multiple node perform IO, but access files at different times • Multiple nodes perform IO simultaneously to different files that are small (each < 100 MB) • One file per processor NERSC User Group Meeting, September 17, 2007
Striping Commands • Striping can be set at a file or directory level • Set striping on an directory then all files created in that directory with inherit striping level of the directory • Moving a file into a directory with a set striping will NOT change the striping of that file • stripe-size - • Number of bytes in each stripe (multiple of 64k block) • OST offset - • Always keep this -1 • Choose starting OST in round robin • stripe count - • Number of OSTs to stripe over • -1 stripe over all OSTs • 1 stripe over one OST lfs setstripe <directory|file> <stripe size> <OST Offset> <stripe count> NERSC User Group Meeting, September 17, 2007
Stripe-Count Suggestions • Franklin Default Striping • 1MB stripe size • Round robin starting OST (OST Offset -1) • Stripe over 4 OSTs (Stripe count 4) • Many small files, one file per proc • Use default striping • Or 0 -1, 1 • Large shared files • Stripe over all available OSTs (0 -1 -1) • Or some number larger than 4 (0 -1 X) • Stripe over odd numbers? • Prime numbers? NERSC User Group Meeting, September 17, 2007
Recommendations Legend 4096 Single Shared File, Default or No Striping 2048 Single Shared File, Stripe over some OSTs (~10) 1024 Processors Single Shared File, Stripe over many OSTs 512 Single Shared File, Stripe over many OSTs OR File per processor with default striping 256 Benefits to mod n shared files 100 MB 1 GB 10 GB 100 GB 1 TB Aggregate File Size NERSC User Group Meeting, September 17, 2007
Recommendations • Think about the big picture • Run time vs Post Processing trade off • Decide how much IO overhead you can afford • Data Analysis • Portability • Longevity • H5dump works on all platforms • Can view an old file with h5dump • If you use your own binary format you must keep track of not only your file format version but the version of your file reader as well • Storability NERSC User Group Meeting, September 17, 2007
Recommendations • Use a standard IO format, even if you are following a one file per processor model • One file per processor model really only makes some sense when writing out very large files at high concurrencies, for small files, overhead is low • If you must do one file per processor IO then at least put it in a standard IO format so pieces can be put back together more easily • Splitting large shared files into a few files appears promising • Option for some users, but requires code changes and output format changes • Could be implemented better in IO library APIs • Follow striping recommendations • Ask the consultants, we are here to help! NERSC User Group Meeting, September 17, 2007
Questions? NERSC User Group Meeting, September 17, 2007