190 likes | 328 Views
Requirements for an end-to-end solution the Center for Plasma Edge Simulation (FSP). SDM AHM October 5, 2005 Scott A. Klasky ORNL. Perhaps not just the CPES FSP. Can we form the CAFÉ solution? Combustion, Astrophysics, Fusion End-to-end framework Combustion SciDAC
E N D
Requirements for an end-to-end solution the Center for Plasma Edge Simulation (FSP) SDM AHM October 5, 2005 Scott A. Klasky ORNL
Perhaps not just the CPES FSP • Can we form the CAFÉ solution? • Combustion, Astrophysics, Fusion End-to-end framework • Combustion SciDAC • Astrophysics: TSI SciDAC. • Fusion SciDACS (CPES, SWIM, GPS, CEMM) • SNS: Follow closely, and try to exchange technology.
Pedestal growth Center for Plasma Edge Simulations (A Fusion Simulation Project SciDAC) How can a particular plasma edge condition dramatically improve the confinement of fusion plasma, as observed in the experiments? The physics of the transitional edge plasma that connects the hot core (of order 100-million-degree-C, ortens of keV) with the material walls is the subject of this research question. 5-year goal: Predict the edge pedestal behavior for the ITER and existing devices. This must be answered for the success of ITER We are developing a testable pedestal simulation framework which incorporates the relevant spectrum of physics processes (e.g, transport, kinetic and magnetohydrodynamic stability and turbulence, flows, and atomic physics in realistic geometry) that span the range of plasma parameters relevant to ITER. Pedestal Use Kepler for end-to-end solution with autonomic high performance NXM data transfers for code coupling, code monitoring, saving results. M3D simulation depicting edge localized modes (ELMs) Input Files Data Interpolation MHD Linear Stability monitor Job submission XGC-ET Simulation on leadership-class computer True STABLE? False Noise Monitor -ET Distributed Storage Distributed Storage M3D Simulation Portal Data Interpolation XGC-ET Compute SOL a a Out-of-core isosurface • Codes used in this project • XGC-ET: • A fully kinetic PIC code which will solve turbulence, neoclassical, and neutral dynamics self-consistently. • High velocity space resolution and arbitrary shaped wall are necessaryto solve this research problem. • Will acquire the gyrokinetic machinery from the GTC code, part of the GPS SciDAC. • Will include Degas-2 for more accurate neutral atomic physics around the boundary. • M3D-edge: • An edge modified version of M3D MHD/2-fluid code, part of the CEMM SciDAC. • For nonlinear MHD ELM crashes. • Linear solvers: • Simple preconditioners for diagonally dominant systems • Multigrid for scalable elliptic solves. • perfect weak scaling: • investigation of tree code methods (e.g. fast multipole) for direct calculation of electrostatic forces (i.e., PIC w/o cells) a a
Code Coupling: Forming a computational pipeline • 2 computers (or more) • 1 computer runs in batch. • Other system(s) is for interactive parallel use. • Security will be by-passed if we can have all computers at ORNL. Cray XT3 XGC on 1,024P Move 10MB <1 second Move 10MB <1 second I. cluster Mhd-L on 4P I. cluster M3D on 32P 30GB/minute I. cluster Noise monitor 80P
Interfaces must be designed to couple codes. • What variables are to be moved/what units? • What is the data decomposition on the sending side? On the receiving side? • Intercomm (Sussman) seems very interesting (PVM) • Development of algorithms and techniques for effectively solving key problems in software support for coupled simulations. • Concentrate on three main issues: • Comprehensive support for determining at runtime what data is to be moved between simulations • Flexibly and efficiently determining when the data should be moved • Effectively deploying coupled simulation codes in a Grid computing environment. • A major goal is to minimize the changes that must be made to each individual simulation code. • Accomplished by having an individual simulation model only specify what data will be made available for a potential data transfer and not specify when an actual data transfer will take place. • Decisions about when data transfers will take place will be made through a separate coordination specification, that generally will be provided by the person building the complete coupled simulation.
Look at Mbs, not total data sizes • Hawkes (SciDAC 2005) • INCITE calculation: • ~2000 Seaborg processors, 2.5 million hours total • ~ 5tb data, 9.3Mbs. • Blondin (SciDAC 2005) • 4 TB, 30 hours =310Mbs • CPES code coupling: 1.3Mbs, data saving (3D): 300 - 30(0)GB/10 minutes • Future is difficult to predict for data generation rates. • Codes add more physics, which slow down the code, algorithms speed up the code, new variables are generated, computers speed up,… • This is also true for analysis of the Data. • Do we need all of the data at all of the timesteps before we can analyze? • Can we do analysis and data movement together? • Analysis/Visualization systems might have to be changed.
What happens when the Mbs gets too large? • Must understand the “features” in the data. • Use AMR-like scheme to save the data. • Does the data change dramatically everywhere? • Is the data smooth in some regions? • Can save 100x in compression techniques, but must be able to “use” data. • New viz/analysis tools? • Could just stitch up the grid, and use old tools. • Useful for Level of Detail Visualization (more detail in regions which change). • Use in combination with “smart” data caching/ data compression (see below)
End-to-end/workflow requirements. • Easy to Install: • Good examples (MPI, Netcdf, HDF5, LN, bbcp) • Easy to Use: • Ensight-Gold • Must have “value-added” over simple approaches. • Value added discussed in the following slides. • Must be robust/fault tolerant. • The workflow can not crash our simulations/nodes!
Need a data model • Allows the CS community to design “modules” which can understand the data. • Allow for netcdf, hdf5. • Develop interfaces to “extract” portion of the data from the files/memory. • Must come from the application areas teaming up with the CS community. • HDF5/Netcdf is not a data model. • Can we use the data model in SciRun/AVSExpress/Ensight as a start? • Meshes[] (uniform, rectlinear, structured, unstructured). • Hierarchy in meshes (AMR). • Cell Centered, Vertex Centered, Edge Centered data. • Multiple variables on a mesh. • Can we use “simple” API’s in the codes which can write the data out?
Monitoring. • We want to “watch” portions of the data from the simulation, as the simulation progresses. • Want the ability to play back from t=0 to the current frame. I.e. snapshot movies. • Want this information presented so that we can collaborate during/after the simulation. • Highlights part of the data, to discuss with other users. • Draw on the figures. • Mostly 1D plots, some 2d (surface/contour) plots, some 3D plots. • Example (http://w3.pppl.gov/transp/ElVis/121472A03_D3D.html)
Portal to launch workflow/monitor jobs • Use the portal as a front-end to the workflow. • Would like to see the workflow/ but not monitor it. • Perhaps it will allow us to choose different workflows which were created? • Would like to launch workflow, and have automatic job submission for known “clusters/HPC”. • Submit to all, kill all when one starts running
Users want to write their own analysis • Requires that they can do this in F90, C/C++, Python. • Need wizards to allow users to describe their input/output. • Similar to AVS/Express, SciRun, OpenDX. • Common scenario • Users want the main data field (field_in), they want a string (“temperature”), they want a condition (>), they want an output field. They also want this to run on their cluster with M processors. They also want to change the inputs at any given time.
Efficient Data Movement • One same node • Use memory reference. • On same cluster • Use MPI communication. • On different clusters (NXM communication) • 2 approaches: memory-memory vs. files. • File approach is not always useable. • Will break the solution for “code-coupling” approaches since I/O can become the bottleneck. (open/close/read/write). • Working with Parashar/Kohl to look into the NXM problem. • Do we make this part of Kepler?
Distributed Data Storage - 1 • Users do NOT want to know where their data is stored. • Users want the FASTEST possible method to get to their data. • Users “seldom” look at all of their data at once. • Usually, we look at a handful of variables at a time, with only a few time slices at a time. (DON’T need 4 TB in a second). • Users require that solution works on their laptop when traveling! (must cache results from local-disk). • Users do NOT want to change their mode-of-operational during travel.
Distributed-data storage -2 • LN is a good example of an “almost” useable system. • Needs to directly understand HDF5/netcdf. • Needs to be able to cache information on local disks, and modify the eXnodes. • Needs to be able to work with HPSS. • But this is NOT enough!
Smart data cache • Users typically access their data in “similar” patterns. • Look at timestep 1 for variables A,B, look at ts=2 for A,B, ….. • If we know what the user wants, when he/she wants it, then we can use smart technologies. • In a collaboration, the data access gets more complicated. • Neural Networks to the rescue!
Need data mining technology integrated into the solution • We must understand the “features” of the data. • Requires a working relationship with app. Scientists and computer scientists. • Want to detect features “on-the-fly” (from the current, and previous timesteps). • Could feature born analysis be done by the end of the simulation? • Pre-compute everything possible by the end of the simulation. DO NOT REQUIRE the end user to wait for anything that we know we want.
Security • Users do NOT want to deal with this. • But of course, they have to. • Will DOE require single sign-ins. • Can “trusted” sites talk to other “trusted” sites via ports being opened from A-B? • Will this be the death for workflow automation? • Can automate data movement, if we must sign on each time with unique passwords.
Conclusions. • We “need” Kepler in order for the CPES project to be successful. • We need efficient NXM data moved, and monitored. • We need to be able to provide feedback to the simulation(s). • Codes must be coupled, and we need an efficient mechanism to couple the data. • What do we do with single-logins? • ORNL tells me that we can have ports open from one site to another without violating the security model. What about other sites? • Are we prepared for new architectures? • Cray XT3 has only 1 small pipe out to the world.