190 likes | 307 Views
LA-UR-14-22615. The Simulation-Code-Hardware Feedback Loop in Practice. Bill Archer LANL Advanced Simulation and Computing Program Director Salishan Conference on High-Speed Computing April 21, 2014. Outline. An A-Typical DSW Driver Trinity Mission Need Code Adaptation.
E N D
LA-UR-14-22615 The Simulation-Code-Hardware Feedback Loop in Practice Bill Archer LANL Advanced Simulation and Computing Program Director Salishan Conference on High-Speed Computing April 21, 2014
Outline • An A-Typical DSW Driver • Trinity Mission Need • Code Adaptation
NNSA procures systems to solve national security problems • In 2011 Directed Stockpile Work (DSW) had a particular classified multi-physics 3D problem they wanted to simulate • We’ll call this “The Problem” • It was simulated with a classified multi-physics integrated design code (IDC) • We’ll call this “The Code” • Run on Cielo, a Cray XE6 at LANL with 0.28 PiB of memory and 136K cores Nuclear performance assessmentrequires multi-physics codes ThermonuclearBurn Fission Radiation (Photons) Hydrodynamics Mission driven problem
The Simulation ran into many problems. • Ran on half the machine; it took 3 to 4 days to get an 8 hour allocation • Dealt with resiliency by allocating extra nodes and restarting within Moab allocation • Suffered memory exhaustion when using ½ of Cielo • 25 TiBdump file caused archiving and data movement problems Hero class problem
The Simulation proved to be too big for Cielo • Code team developed tailored physics to improve accuracy while reducing memory usage by 40% • Reduced the dump file to 9 TB • Still had memory exhaustion DSW deferred the problem after a year of trying
In 2012 the code team then tried to get The Code to run The Simulation. • Found an I/O gather that caused memory exhaustion • Very painful to debug at 65,000 cores • Further runs at 78,000 cores ran into I/O hangs • Went away with an operation system update • Took 80% of Cielo to run The Problem • 97,000 cores and 0.21 PiB memory • Throttled I/O sends and was able to run at 60% Still not practical to run, need a bigger machine
The Trinity Mission Need was driven by the need to solve this type of problem. • The desire was to increase the resolution by 2X • Allows increased geometric and physics fidelity • Adaptive mesh refinement (AMR) allows us to limit the memory increase to 3X, about 0.75 PiB • Be able to run 2 to 4 of these problems at once. • Became basis of critical decision documents and Request For Proposals 2 to 4 PBytes memory desired; No FLOPS requirement
Trinity will meet the memory requirement for our simulations. Advanced Computing at Extreme Scale (ACES) • Trinity negotiations are underway • An announcement is expected in late May • Delivery expected in Q4 FY15 • Trinity will be deployed by Los Alamos and Sandia (ACES) • Sited at Los Alamos, used by Los Alamos, Sandia, and Livermore • Partnering on the procurement with LBNL for acquisition of NERSC8 Trinity must demonstrate a significant capability improvement over current platforms (>> Cielo)
How do codes keep up with rapid changes in hardware? • ASC has multiple Codes that total several million lines of code and represent several Billion dollars of taxpayer investment • ASC is bringing in major systems every 2.5 years Peak Heavy cores Effective Light cores What do we do … today? Plots curtsey of LLNL
One answer is to isolate the physics from the hardware Code 1 Code 2 Code 3 Physics Packages • Assumes we can’t afford to rewrite codes for every system • Moving to new machine should only impact the hardware aware infrastructure • Need an “abstraction layer” to isolate the physics modules from the hardware aware infrastructure Interface Interface Interface Interface Hydro 1 Hydro 2 Explosives EOS Abstraction Layer(s) Hardware Aware Infrastructure MPI Threads I/O Viz Despite years of research the community has failed to deliver a production usable abstraction layer
Other extreme, become agile and rewrite the codes for each machine • Successfully programmed Cray vector machines with loop level pragmas • Successfully programmed parallel clusters with low level MPI calls • Success occurred during decades of hardware stability! Cray 1 Any agile examples with large code bases? ASCI Blue Mountain
Ability to affordable and quickly adapt the codes to new hardware is THE problem. • ASC systems are bought to solve mission problems • If our codes can’t use the systems, there isn’t any reason to buy them. The community needs to come together on a solution for graceful code migration
Questions? Cielo, Cray XE6, 1.4 PF/s, 2011 IBM Punch Card Accounting Machines, 20 Op/s, 1944
Abstract There is a tendency to view simulations, codes, and hardware independently and in isolation from each other. I will discuss a mission driven simulation that pushed the limits of a code and the Cielo hardware and in the end was just too large to run. This in turn was a major driver of the hardware requirements for Trinity. The selected Trinity hardware is now driving changes to the code, with the intent that this will allow us to run the original simulation, and to start preparing the codes for the next generation hardware. This feedback loop is typical of how Los Alamos leverages hardware and codes to increase the simulation space for problems of mission interest.
Biography Dr. Bill Archer carried out his doctorate research on quantum chemistry at Los Alamos. Before returning to LANL in 1999 he was a post-doc at Rice University, a main-ring superconducting magnet designer at the Superconducting Super Collider, and an operations research analyst at the Center for Naval Analyses. While at CNA he spent seven years embedded with the Fleet. Upon returning to LANL he modified an atomic physics code for parallel processing on Blue Mountain. He then moved to the ASCI Crestone Project, where he was project leader and team leader for several years as the codes were brought into general production on the Q and White machines. He was one of the first members of the Thermonuclear Burn Initiative, where he started studying burn and the history of the Weapons Program. Since 2008 he has held a variety of Los Alamos management positions: Advanced Simulation and Computing (ASC) Integrated Codes Program Manager, line manager of the Simulation Analysis and Code Development Group, line manager of the Material and Physical Data Group, ASC Deputy Program Director (acting), and currently ASC Program Director.