1 / 24

Petascale Science with GTC/ADIOS

Petascale Science with GTC/ADIOS. HPC User Forum 9/10/2008 Scott Klasky S. Ethier , S. Hodson , C. Jin , Z. Lin, J. Lofstead , R. Oldfeld , M. Parashar,K . Schwan , A. Shoshani , M. Wolf, Y. Xiao , F. Zheng. Outline. GTC EFFIS ADIOS. Workflow. Dashboard. Conclusions.

bono
Download Presentation

Petascale Science with GTC/ADIOS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Petascale Science with GTC/ADIOS HPC User Forum 9/10/2008 Scott Klasky S. Ethier, S. Hodson, C. Jin, Z. Lin, J. Lofstead, R. Oldfeld, M. Parashar,K. Schwan, A. Shoshani, M. Wolf, Y. Xiao, F. Zheng

  2. Outline • GTC • EFFIS • ADIOS. • Workflow. • Dashboard. • Conclusions.

  3. Advanced computing at NCCS • Compute node (13,888; 73.6 GF/node) • 2 sockets per node (F/HT1) • 4 cores per socket (111,104 cores total) • Core CPU: 2.3 GHz AMD Opteron • Memory per core: 2 GB (DDR2-800) • 256 service & I/O nodes • Local storage: ~10 PB, 200+ GB/s • Interconnect: 3D torus, SeaStar 2.1 NIC • Aggregate memory: 222 TB • Peak performance: 1.0 PF • 150 cabinets, 3400 ft2 • 6.5 MW power. 2008-2009 • Compute node (7,832; 35.2 GF/node) • 1 socket (AM2/HT1) per node • 4 cores per socket (31,328 cores total) • Core CPU: 2.2 GHz AMD Opteron • Memory per core: 2 GB (DDR2-800) • 232 service & I/O nodes • Local storage: ~750 TB, 41 GB/s • Interconnect: 3D torus, SeaStar 2.1 NIC • Aggregate memory: 63 TB • Peak performance: 275 TF

  4. Science Goals Use GTC (classic) to analyze cascades and propagation in Collisionless Trapped Electron Mode (CTEM) turbulence Resolve the critical question of ρ* scaling of confinement in large tokamaks such as ITER; what are consequences of departure from this scaling? Avalanches and turbulence spreading tend to break Gyro-Bohm scaling but zonal flows tend to restore it by shearing apart extended eddies: a competition Use GTC-S (shaped) to study electron temperature gradient (ETG) drift turbulence & compare against NSTX experiments NSTX has a spherical torus with a very low major to minor radius aspect ratio and a strongly-shaped cross-section NSTX exps have produced very interesting high frequency short wavelength modes - are these kinetic electron modes? ETG is a likely candidate but only a fully nonlinear kinetic simulation with the exact shape & exp profiles can address this Science Impact Further the understanding of CTEM turbulence by validation against modulated ECH heat pulse propagation studies on the DIII-D, JET & Tore Supra tokamaks Is CTEM the key mechanism for electron thermal transport? Electron temperature fluctuation measurements will shed light Understand the role of nonlinear dynamics of precession drift resonance in CTEM turbulence First-time for direct comparison between simulation & experiment on ETG drift turbulence GTC-S possesses right geometry and right nonlinear physics to possibly resolve this Help to pinpoint micro-turbulence activities responsible for energy loss through the electron channel in NSTX plasmas Big Simulations for early 2008: GTCScience Goals and Impact

  5. GTC Early Application: electron microturbulence in Fusion Plasma • “Scientific Discovery” - Transition to favorable scaling of confinement for both ions and electrons now observed in simulations for ITER plasmas • Electron transport less understood but more important in ITER since fusion products first heat the electrons • Simulation of electron turbulence is more demanding due to shorter time scales and smaller spatial scales • Recent GTC simulation of electron turbulence used 28,000 cores for 42 hours in a dedicated run on Jaguar at ORNL producing 60 TB of data currently being analyzed. This run pushes 15 billion particles for 4800 major time cycles Electron transport Good news for ITER! Ion transport

  6. GTC Electron Microturbulence Structure • 3D fluid data analysis provides critical information to characterize microturbulence, such as radial eddy size, eddy auto-correlation time • Flux Surface Electrostatic Potential demonstrates a ballooning structure • Radial Turbulence eddies have average size ~ 5 ion gyroradius

  7. Dashboard Visualization Wide-area Data Movement Workflow Code Coupling Provenance and Metadata Adaptable I/O EFFIS • From SDM center* • Workflow engine – Kepler • Provenance support • Wide-area data movement • From universities • Code coupling (Rutgers) • Visualization (Rutgers) • Newly developed technologies • Adaptable I/O (ADIOS)(with Georgia Tech) • Dashboard (with SDM center) Foundation Technologies Enabling Technologies Approach: place highly annotated, fast, easy-to-use I/O methods in the code, which can be monitored and controlled, have a workflow engine record all of the information, visualize this on a dashboard, move desired data to user’s site, and have everything reported to a database.

  8. Outline • GTC • EFFIS • ADIOS. • Conclusions.

  9. ADIOS: Motivation • “Those fine fort.* files!” • Multiple HPC architectures • BlueGene, Cray, IB-based clusters • Multiple Parallel Filesystems • Lustre, PVFS2, GPFS, Panasas, PNFS • Many different APIs • MPI-IO, POSIX, HDF5, netCDF • GTC (fusion) has changed IO routines 8 times so far based on performance when moving to different platforms. • Different IO patterns • Restarts, analysis, diagnostics • Different combinations provide different levels of IO performance • Compensate for inefficiencies in the current IO infrastructures to improve overall performance

  10. ADIOS Overview Scientific Codes • Allows plug-ins for different I/O implementations. • Abstracts the API from the method used for I/O. • Simple API, almost as easy as F90 write statement. • Best practices/optimize IO routines for all supported transports “for free” • Componentization. • Thin API • XML file • data groupings with annotation • IO method selection • buffer sizes • Common tools • Buffering • Scheduling • Pluggable IO routines External Metadata (XML file) ADIOS API buffering schedule feedback pHDF-5 MPI-IO MPI-CIO pnetCDF POSIX IO Viz Engines LIVE/DataTap Others (plug-in)

  11. ADIOS Philosophy (End User) • Simple API very similar to standard Fortran or C POSIX IO calls. • As close to identical as possible for C and Fortran API • open, read/write, close is the core • set_path, end_iteration, begin/end_computation, init/finalize are the auxiliaries • No changes in the API for different transport methods. • Metadata and configuration defined in an external XML file parsed once on startup. • Describe the various IO grouping including attributes and hierarchical path structures for elements as an adios-group • Define the transport method used for each adios-group and give parameters for communication/writing/reading • Change on a per element basis what is written • Change on a per adios-group basis how the IO is handled

  12. ADIOS Overview • ADIOS is an IO componentization, which allows us to • Abstract the API from the IO implementation. • Switch from synchronous to asynchronous IO at runtime. • Change from real-time visualization to fast IO at runtime. • Combines. • Fast I/O routines. • Easy to use. • Scalable architecture(100s cores) millions of procs. • QoS. • Metadata rich output. • Visualization applied during simulations. • Analysis, compression techniques applied during simulations. • Provenance tracking.

  13. Design Goals • ADIOS Fortran and C based API almost as simple as standard POSIX IO • External configuration to describe metadata and control IO settings • Take advantage of existing IO techniques (no new native IO methods) Fast, simple-to-write, efficient IO for multiple platforms without changing the source code

  14. Architecture • Data groupings • logical groups of related items written at the same time. • Not necessarily one group per writing event • IO Methods • Choose what works best for each grouping • Vetted, improved, and/or written by experts for each • POSIX (Wei-keng Liao, Northwestern) • MPI-IO (Steve Hodson, ORNL) • MPI-IO Collective (Wei-keng Liao, Northwestern) • NULL (Jay Lofstead, GT) • Ga Tech DataTap Asynchronous (HasanAbbasi, GT) • phdf5 • others.. (pnetcdf on the way).

  15. Related Work • Specialty APIs • HDF-5 – complex API • Parallel netCDF – no structure • File system aware middleware • MPI ADIO layer – File system connection, complex API • Parallel File systems • Lustre – Metadata server issues • PVFS2 – client complexity • LWFS – client complexity • GPFS, pNFS, Panasas – may have other issues

  16. Supported Features • Platforms tested • Cray CNL (ORNL Jaguar) • Cray Catamount (SNL Redstorm) • Linux Infiniband/Gigabit (ORNL Ewok) • BlueGene P now being tested/debugged. • Looking for future OSX support. • Native IO Methods • MPI-IO independent, MPI-IO collective, POSIX, NULL, Ga Tech DataTap asynchronous, Rutgers DART asynchronous, Posix-NxM, phdf5, pnetcdf, kepler-db

  17. Initial ADIOS performance. • MPI-IO method. • GTC and GTS codes have achieved over 20 GB/sec on Cray XT at ORNL. • 30GB diagnostic files every 3 minutes, 1.2 TB restart files every 30 minutes, 300MB other diagnostic files every 3 minutes. • DART: <2% overhead forwriting 2 TB/hour withXGC code. • DataTap vs. Posix • 1 file per process (Posix). • 5 secs for GTCcomputation. • ~25 seconds for Posix IO • ~4 seconds with DataTap

  18. Codes & Performance • June 7, 2008: 24 hour GTC run on Jaguar at ORNL • 93% of machine (28,672 cores) • MPI-OpenMP mixed model on quad-core nodes (7168 MPI procs) • three interruptions total (simple node failure) with 2 10+ hour runs • Wrote 65 TB of data at >20 GB/sec (25 TB for post analysis) • IO overhead ~3% of wall clock time. • Mixed IO methods of synchronous MPI-IO and POSIX IO configured in the XML file

  19. Chimera IO Performance (Supernova code) 2x scaling • Plot minimum value from 5 runs with 9 restarts/run • Error bars show maximum time for the method.

  20. Chimera Benchmark Results • Why ADIOS is better than pHDF5? ADIOS_MPI_IO vs. pHDF5 w/ MPI Indep. IO driver Use 512 cores, 5 restart dumps. Conversion time on 1 processor for the 2048 core job = 3.6s (read) + 5.6s (write) + 6.9 (other) = 18.8 s Number above are sum among all PEs (parallelism not shown)

  21. DataTap • A research transport to study asynchronous data movement • Uses server directed I/O to maintain high bandwidth, low overhead for data extraction • I/O scheduling is performed to the perturbation caused by asynchronous I/O

  22. DataTap scheduler • Due to perturbations caused by asynchronous I/O, the overall performance of the application may actually get worse • We schedule the data movement using application state information to prevent asynchronous I/O from interfering with MPI communication • 800 GB of data. • Schedule I/O takes 2x longer to move data. Overhead is 2x less.

  23. The flood of data. • Petascale GTC runs will produce 1PB per simulation. • Couple GTC with Edge core (core-edge coupling). • 4 PB of data per run. • Can’t store all of GTC runs at ORNL unless we go to tape. ( 12 days to grab data from tape if we get 1GB/sec). • 1.5 FTE looking at the the data. • Need more ‘real-time’ analysis of data. • Workflows, data-in-transit (IO graphs), …? • Can we create a staging area with “fat-nodes” • Move data from computational nodes to fat nodes using network of HPC resource. • Reduce data on fat-nodes. • Allow users to “plug-in” analysis routines on “fat-nodes” • How Fat? • Shared memory helps (don’t have to paralyze parallelize-all analysis codes. • Typical upper bound of codes we studied write 1/20th of memory/core for analysis. Want 1/20th of resources (5% overhead). Need 2x memory per core for analysis (2x overhead for memory we need (in data + out data). • On Cray at ORNL this means we will have roughly 750 sockets (quad core) for fat memory with shared memory of 34 GB of shared memory. • Also useful for codes which require memory but not as many nodes. • Can we have shared memory on this portion? • What are the other solutions?

  24. Conclusions • GTC is a code which is scaling to the petascale computers. • GBP, Cray XT. • New changes are new science and new IO (ADIOS). • Major challenge in the future is speeding up the data analysis. • ADIOS is an IO componentization. • ADIOS is being integrated integrated into Kepler. • Achieved over 50% peak IO performance for several codes on Jaguar. • Can change IO implementations at runtime. • Metadata is contained in XML file. • Petascale science starts with petascale applications. • Need enabling technologies to scale. • Need to rethink ways to do science.

More Related