280 likes | 410 Views
Real World IO Experiences With The FLASH Code May 19, 2008. Anshu Dubey. Outline. I/O options for parallel applications Parallel I/O : What we want Flash I/O Libraries Checkpointing and restarts Plotfiles Performance Experiences Non-adaptive turbulence simulation on BG/L
E N D
Real World IO Experiences With The FLASH CodeMay 19, 2008 Anshu Dubey
Outline • I/O options for parallel applications • Parallel I/O : What we want • Flash I/O • Libraries • Checkpointing and restarts • Plotfiles • Performance • Experiences • Non-adaptive turbulence simulation on BG/L • Fully adaptive GCD simulations on Power 5/ XT4 • Fully adaptive RTFlame simulations on BG/P
I/O Options: Serial I/O 0 1 2 3 4 5 processors • Each processor sends its data to the master who then writes the data to a single file • Advantages: • Don’t need a parallel file system • Simple • Disadvantages: • Not scalable • Not efficient File
I/O Options: Parallel Multi-file 0 1 2 3 4 5 processors File File File File File File • Each processor writes its own data to a separate file • Advantages: Fast! • Disadvantages: • can quickly accumulate many files • hard to manage • requires post processing
I/O Options: Parallel I/O to a Single-file 0 1 3 4 processors 2 5 array of data File • Each processor writes its own data to a section of the data array in the same file • Each must know the offset and number of elements to write • Advantages: • single file, scalable • Disadvantages • requires MPI-IO mapping or other higher level libraries
What Applications Want from I/O • Write data from multiple processors into a single file • File can be read in the same manner regardless of the number of CPUs that read from or write to the file. • Want to see the logical data layout… not the physical layout • Do so with few overheads • Same performance as writing one-file-per-processor • Make all of the above, including files, portable across platforms • self describing formats
I/O Formats In FLASH • Distribution comes with support for HDF5 and PnetCDF libraries and basic support for direct binary I/O • Direct binary format is for “all else failed” situation only • Both libraries are • Portable • Use MPI-IO mappings • Are self describing and translate data between systems • The libraries can be used interchangeably in FLASH • PnetCDF has better performance • HDF5 is more robust, and supports larger file sizes
I/O in FLASH • Large Files : • Checkpoint files • Save full state of the simulation • Plot files • Data for analysis • Smaller Files : • Dat files • Integrated quantities, output in serial files • Log files • Status report of the run and logging of important run specific information • Input files • Some simulations need to read files for initialization or table look-up purposes
FLASH Checkpoint Files • Simulations can stop at intermediate time for many reasons • Unintentionally for reasons such as machine failure, queue window closing, failure in execution etc • Intentionally for analysis of intermediate results • Determine if the simulation is proceeding as expected • Decide if any parameters need to change • Tweak the direction of the simulation • Crucial to be able to resume the simulation from most of the above mentioned stoppages • Checkpoint files save the full state of the simulation Simulations can be restarted transparently from checkpoint files
More on Checkpoint Files • Checkpoint files are saved in full precision • They have no knowledge of hardware that wrote them • Restarts can be done on • Different number of processors • Different platform • Typical Flash Center production run checkpoints are a few GB in size • Some larger runs have reached a few hundred GB for a single checkpoint • Because they are large and take a long time to write, their frequency needs to be chosen judiciously • When disk space is an issue we use rolling checkpoints
FLASH Plotfiles Plotfiles store data needed for analysis • FLASH can generate two types of plot files • Mesh data (Eulerian) • Particles data (Lagrangian) • Plotfiles are smaller than checkpoints • Checkpoints save all variables, plotfiles save only those that are needed for analysis • Checkpoints are full precision, plotfiles are half precision • Sometimes coarsening of data can be employed to reduce disk use • Plotfiles are more frequent and cannot be rolled
Improving I/O Performance • Split I/O • Instead of a single file, write to a few files • Easy in parallel libraries with grouping of processors • Utility to stitch them together, most often concatenation enough
FLASH I/O Experiences • Highlight three different simulations • Run on five different platforms • With two different file systems • Turbulence run in 2005-2006 • On BG/L at Lawrence Livermore National Laboratory • Lustre file system • Gravitationally Confined Detonation Simulations • On UP at LLNL, on Seaborg and FRANKLIN at NERSC • GPFS file system with IBM platforms, Lustre with FRANKLIN • Rayleigh Taylor Flame Simulations • Currently being run on BG/P at Argonne National Laboratory • GPFS file system
Turbulence Run: The I/O Challenge • Lustre file system at LLNL was unknown • FLASH had never run in that combination • No Parallel I/O library scaled beyond 1024 nodes • Quickly put together Direct I/O implementation • Each node wrote to its own file • Each snapshot meant 32,768 files • Lagrangian particles move randomly • For analysis, sorting algorithms needed • Huge processing issue We generated 74 million files in all
Turbulence Run: Overview • Largest homogeneous, isotropic compressible turbulence run • 1856^3 base grid size • 256^3 Lagrangian tracer particles • 3D turbulent RMS Mach number = 0.3 (1D = .17) • Rel ~ 500 -- 1000 • Full eddy-turnover time in steady-state • Roughly one week wall clock on 32,768 nodes • Code Components • Hydrodynamics • Tracer particles • Uniform grid • Direct I/O
Turbulence Run: I/O Statistics • 200 checkpoints, each giving full double precision snapshot of the simulation state • 6,553,600 files for a total of about 140 TB • 700 plotfiles at reduced precision for Eulerian grid data • Coarsened to half the resolution, making them 1/8th the size • 22,937,600 files for a total of 14 TB • 1400 plotfiles for Lagrangian particles • 45,875,200 files for a total of 0.658 TB • Total disk use about 154 TB • Took more than a month to transfer plotfiles using GridFTP • Available to anyone interested • Required months of stitching together and post-processing
Gravitationally Confined Detonation (GCD) Runs Simulation Description • Start the simulation with an off center bubble • The bubble rises to the surface, developing Rayleigh-Taylor instabilities • The material cannot escape because of the gravity, so it races around the star • At the opposite end, the fronts collide to initiate detonation Code Components • Hydro with shock capturing scheme (PPM) • Newtonian self gravity (Multi-pole solver) • Nuclear flame model • Lagrangian tracer particles • AMR • Parallel I/O (HDF5)
GCD Runs: I/O Statistics • Successfully used fully parallel HDF5 I/O on all platforms • Typical run generates • 60 checkpoint files • About 1,000 each of grid and particle plot files • Spends about 17% computation time in I/O • Size of files • checkpoint files ~20 GB • plot files ~5GB • Overall storage used: 5 TB per simulation
RTFlame Simulations • Objective: Study fundamental properties of Rayleigh-Taylor driven turbulent nuclear burning to verify assumptions in GCD model • Whether sub-grid model is needed • Not if nuclear burning occurs primarily at large scales • Physical conditions under which flame transitions to distributed burning • Code Components • Hydro with shock capturing scheme (PPM) • Constant gravity • Nuclear flame model • Lagrangian tracer particles • AMR Adaptive Mesh • Parallel I/O
RTFlame: Specifications • Simulate flame in an elongated box with a square base • Sequence of resolutions • currently running 256^2 x 1024 and 512^2 x 2048 on BG/P • Node use varies from 512 to 4096 • Running in dual or quad modes
RTFlame: I/O • Checkpoint sizes are 9 -- 70 GB in range • Plotfiles are 2 -- 16 GB in range • Scaling of I/O is not very clear • Increasing the number of processors can sometimes cause the write time to increase substantially • Observations: • Read of checkpoints is extremely fast (2 min) • Write was unstable without MPI barriers between different datasets • Though there have been failures during output, files that get • completed are fine • There is a balance between scaling and I/O performance • I/O takes away some memory footprint from the simulation
RTFlame Simulation : One Checkpoint • Time to write a single checkpoint • First plot ~72GB, second ~9GB • Accounts for little less than 10% of execution time 256^2 on 512/1K procs 512^2 on 4/8K procs 4k 8k 512 1k