Real World IO Experiences With The FLASH Code May 19, 2008

Real World IO Experiences With The FLASH CodeMay 19, 2008 Anshu Dubey

Outline • I/O options for parallel applications • Parallel I/O : What we want • Flash I/O • Libraries • Checkpointing and restarts • Plotfiles • Performance • Experiences • Non-adaptive turbulence simulation on BG/L • Fully adaptive GCD simulations on Power 5/ XT4 • Fully adaptive RTFlame simulations on BG/P

I/O Options: Serial I/O 0 1 2 3 4 5 processors • Each processor sends its data to the master who then writes the data to a single file • Advantages: • Don’t need a parallel file system • Simple • Disadvantages: • Not scalable • Not efficient File

I/O Options: Parallel Multi-file 0 1 2 3 4 5 processors File File File File File File • Each processor writes its own data to a separate file • Advantages: Fast! • Disadvantages: • can quickly accumulate many files • hard to manage • requires post processing

I/O Options: Parallel I/O to a Single-file 0 1 3 4 processors 2 5 array of data File • Each processor writes its own data to a section of the data array in the same file • Each must know the offset and number of elements to write • Advantages: • single file, scalable • Disadvantages • requires MPI-IO mapping or other higher level libraries

What Applications Want from I/O • Write data from multiple processors into a single file • File can be read in the same manner regardless of the number of CPUs that read from or write to the file. • Want to see the logical data layout… not the physical layout • Do so with few overheads • Same performance as writing one-file-per-processor • Make all of the above, including files, portable across platforms • self describing formats

I/O Formats In FLASH • Distribution comes with support for HDF5 and PnetCDF libraries and basic support for direct binary I/O • Direct binary format is for “all else failed” situation only • Both libraries are • Portable • Use MPI-IO mappings • Are self describing and translate data between systems • The libraries can be used interchangeably in FLASH • PnetCDF has better performance • HDF5 is more robust, and supports larger file sizes

I/O in FLASH • Large Files : • Checkpoint files • Save full state of the simulation • Plot files • Data for analysis • Smaller Files : • Dat files • Integrated quantities, output in serial files • Log files • Status report of the run and logging of important run specific information • Input files • Some simulations need to read files for initialization or table look-up purposes

FLASH Checkpoint Files • Simulations can stop at intermediate time for many reasons • Unintentionally for reasons such as machine failure, queue window closing, failure in execution etc • Intentionally for analysis of intermediate results • Determine if the simulation is proceeding as expected • Decide if any parameters need to change • Tweak the direction of the simulation • Crucial to be able to resume the simulation from most of the above mentioned stoppages • Checkpoint files save the full state of the simulation Simulations can be restarted transparently from checkpoint files

More on Checkpoint Files • Checkpoint files are saved in full precision • They have no knowledge of hardware that wrote them • Restarts can be done on • Different number of processors • Different platform • Typical Flash Center production run checkpoints are a few GB in size • Some larger runs have reached a few hundred GB for a single checkpoint • Because they are large and take a long time to write, their frequency needs to be chosen judiciously • When disk space is an issue we use rolling checkpoints

FLASH Plotfiles Plotfiles store data needed for analysis • FLASH can generate two types of plot files • Mesh data (Eulerian) • Particles data (Lagrangian) • Plotfiles are smaller than checkpoints • Checkpoints save all variables, plotfiles save only those that are needed for analysis • Checkpoints are full precision, plotfiles are half precision • Sometimes coarsening of data can be employed to reduce disk use • Plotfiles are more frequent and cannot be rolled

FLASH I/O Performance : HDF5 vs. PnetCDF

FLASH IO Performance on BG/P

Improving I/O Performance • Split I/O • Instead of a single file, write to a few files • Easy in parallel libraries with grouping of processors • Utility to stitch them together, most often concatenation enough

FLASH I/O Experiences • Highlight three different simulations • Run on five different platforms • With two different file systems • Turbulence run in 2005-2006 • On BG/L at Lawrence Livermore National Laboratory • Lustre file system • Gravitationally Confined Detonation Simulations • On UP at LLNL, on Seaborg and FRANKLIN at NERSC • GPFS file system with IBM platforms, Lustre with FRANKLIN • Rayleigh Taylor Flame Simulations • Currently being run on BG/P at Argonne National Laboratory • GPFS file system

Turbulence Run: The I/O Challenge • Lustre file system at LLNL was unknown • FLASH had never run in that combination • No Parallel I/O library scaled beyond 1024 nodes • Quickly put together Direct I/O implementation • Each node wrote to its own file • Each snapshot meant 32,768 files • Lagrangian particles move randomly • For analysis, sorting algorithms needed • Huge processing issue We generated 74 million files in all

Turbulence Run: Overview • Largest homogeneous, isotropic compressible turbulence run • 1856^3 base grid size • 256^3 Lagrangian tracer particles • 3D turbulent RMS Mach number = 0.3 (1D = .17) • Rel ~ 500 -- 1000 • Full eddy-turnover time in steady-state • Roughly one week wall clock on 32,768 nodes • Code Components • Hydrodynamics • Tracer particles • Uniform grid • Direct I/O

Tracer Particles at Time = 0.10

Turbulence Run: I/O Statistics • 200 checkpoints, each giving full double precision snapshot of the simulation state • 6,553,600 files for a total of about 140 TB • 700 plotfiles at reduced precision for Eulerian grid data • Coarsened to half the resolution, making them 1/8th the size • 22,937,600 files for a total of 14 TB • 1400 plotfiles for Lagrangian particles • 45,875,200 files for a total of 0.658 TB • Total disk use about 154 TB • Took more than a month to transfer plotfiles using GridFTP • Available to anyone interested • Required months of stitching together and post-processing

Gravitationally Confined Detonation (GCD) Runs Simulation Description • Start the simulation with an off center bubble • The bubble rises to the surface, developing Rayleigh-Taylor instabilities • The material cannot escape because of the gravity, so it races around the star • At the opposite end, the fronts collide to initiate detonation Code Components • Hydro with shock capturing scheme (PPM) • Newtonian self gravity (Multi-pole solver) • Nuclear flame model • Lagrangian tracer particles • AMR • Parallel I/O (HDF5)

GCD Runs: I/O Statistics • Successfully used fully parallel HDF5 I/O on all platforms • Typical run generates • 60 checkpoint files • About 1,000 each of grid and particle plot files • Spends about 17% computation time in I/O • Size of files • checkpoint files ~20 GB • plot files ~5GB • Overall storage used: 5 TB per simulation

RTFlame Simulations • Objective: Study fundamental properties of Rayleigh-Taylor driven turbulent nuclear burning to verify assumptions in GCD model • Whether sub-grid model is needed • Not if nuclear burning occurs primarily at large scales • Physical conditions under which flame transitions to distributed burning • Code Components • Hydro with shock capturing scheme (PPM) • Constant gravity • Nuclear flame model • Lagrangian tracer particles • AMR Adaptive Mesh • Parallel I/O

RTFlame: Specifications • Simulate flame in an elongated box with a square base • Sequence of resolutions • currently running 256^2 x 1024 and 512^2 x 2048 on BG/P • Node use varies from 512 to 4096 • Running in dual or quad modes

RTFlame: I/O • Checkpoint sizes are 9 -- 70 GB in range • Plotfiles are 2 -- 16 GB in range • Scaling of I/O is not very clear • Increasing the number of processors can sometimes cause the write time to increase substantially • Observations: • Read of checkpoints is extremely fast (2 min) • Write was unstable without MPI barriers between different datasets • Though there have been failures during output, files that get • completed are fine • There is a balance between scaling and I/O performance • I/O takes away some memory footprint from the simulation

RTFlame Simulation : One Checkpoint • Time to write a single checkpoint • First plot ~72GB, second ~9GB • Accounts for little less than 10% of execution time 256^2 on 512/1K procs 512^2 on 4/8K procs 4k 8k 512 1k

Questions?

Real World IO Experiences With The FLASH Code May 19, 2008

Real World IO Experiences With The FLASH Code May 19, 2008

Presentation Transcript

Real Reduction Experiences

ANALYST PRESENTATION 19 MAY 2008

CHAPTER 19 Trading with the World

Real Reduction Experiences

19 May 2008 Busan, Korea

Flash Storage and IO Operation

Extraordinary General Meeting 19 May 2008

X-IO Technologies All Flash Arrays – Saviour of the storage world ? October 2013

Writing Testable Code: Real-World TDD

THE REAL WORLD

Warwickshire County ‘Real World’ Cloud Experiences

The Flash Code

Chapter 19: Living in the Real World

19 May 2008

Virtual Memory May 19, 2008

Real World Experiences with ICD-10:Trips, Traps, and Shifts

Real World IO Experiences With The FLASH Code May 19, 2008

Virtual Memory May 19, 2008

IO 6301 Change The World /newtonhelp.com