SST/macro Tutorial

SST/macro Tutorial Sandia National Laboratories Gilbert Hendry, Joseph Kenny, Jeremy Wilke, Benjamin Allan

Exascale Computing Roadmap • We don’t yet completely understand the incredibly complex design space in the exascale regime • Application scaling; programming, communication, execution models; data management; fault tolerance at all levels; power, performance, and cost tradeoffs for hardware technologies • Example: • Applications need as much memory as possible, but it is expensive and power hungry, even when not being used. Source: John Shalf, Exascale Initiative Steering Committee • What we will end up with if we don’t understand: • A machine we can’t turn on (because the power trips the circuit breakers, or it’s too expensive to run) • A machine that turns itself off (because it fails a lot) • A machine we could never buy, because we’re getting extreme diminishing returns on machine investment for useful work (application performance)

S3D – Brute force combustion simulation [4] 6.9x109 grids [3] 2.0x108 grids [2] 1.7x107 grids [1] 2.5x105 grids [1] T. Echekki, J.H. Chen, Comb. Flame, 1996, vol.106. [2] T. Echekki, J.H, Chen, Proc. Comb. Inst., 2002, vol. 29. [3] R. Sankaran, E.R. Hawkes, J.H. Chen, Proc. Comb. Inst., 2007, vol. 31. [4] E.R. Hawkes, O. Chatakonda, H. Kolla, A.R. Kerstein, J.H. Chen, Comb. Flame, 2012 (online). Ngrid≈ 1.8T Source: HemanthKolla, SNL • simulation cost ≈ Re11/4 • one order increase in Re requires 560x scale up in computing requirement. • current peta-scale simulations have achieved Re ≈ 104. • typical Re in practical systems: • IC engines ≈ 105. • Gas turbine engines ≈ 106.

The Modeling Challenge Intel ASCI Red 1997 Cray Jaguar XT5 ~2009 Exascale Machine ~2020 (projected) [2] [1] 1.3 Teraflops 9298 cores 850kW 1.75 Petaflops 224,256 cores 7MW 1 Exaflops 1B cores 20MW ~1000x performance ~10x concurrency ~10x power ~1000x performance ~10000x concurrency ~2x power New scaling challenges fundamentally impacts architecture  Hardware/Software Co-Design

Scaling Performance: Cost of Data Movement Embedded Platforms Big Science: To Exascale Cost of Data Movement Goal: 1-100 GOps/w Goal: 20MW/Exaflop Courtesy: Sandia National Labs :R. Murphy. • Sustain performance scaling through massive concurrency • Data movement becomes more expensive than computation

Co-Design Automation best parameter range Constitutive Model Analysis App Compiler Analysis Skeletonization Coarse-Grained Simulation (SST/macro) Machine Design and Library Implementation algorithmic changes Cycle-accurate Simulation / Emulation Providing a range of modeling tools to enable faster design space exploration

When is Macroscale/Coarse-Grained Simulation Appropriate? Sim. Cost Model Fidelity (time, memory) Accuracy Scale Constitutive Models – can be powerful in reasoning about system and tradeoffs, but hard to investigate new concepts and complex interactions Coarse-Grained Simulation – moderate accuracy, predicts trends, can scale, but requires approximations Cycle-Accurate Simulation – highly accurate for detailed studies, but limited ability to scale Emulation – essentially exact and fast, but expensive to scale Testbed/prototype – provides real numbers, but time/cost a major factor

Uses of Macroscale Modeling • Performance modeling of realistic applications at scale • Job placement, domain decomposition • Network design space exploration • Determine effects of topology, bandwidths and latencies • Software prototyping • Application/algorithm design • Programming model/library development • Fault injection/response • nodes (most common), network components

Skeleton Apps Allow Extreme Scale Exploration • Simulation via traces and direct execution is limited by existing machine size • Skeleton App: Simplified code which (approximately) reproduces some behavior of interest for a full application • Correct output not produced • Communication skeletons remove computation, leaving only control flow and communication code • Allows interconnect simulations at extreme scales (millions of tasks) • Run 2-10x faster than real execution • Currently a manual process for all but the simplest codes • Work underway on compiler-based skeletonization

Motivating Skeletonization full app full app skeleton skeleton HPCCG miniapp, memory consumption and wall time of SST/macro simulation

Motivating Skeletonization (con’t) HPCCG with 64 ranks, 32x32x32 problem setting up app/communication HPCCG miniapp, memory consumption of SST/macro simulation with full app (left) vs skeleton (right, squashed to make axes same)

The SST/Macro Simulator • General webiste: sst.sandia.gov • Note: SST/Macro is separate from SST • Open source code, C++ with C/Fortran interfaces • Easily modified or extended with new models, topologies, metrics, etc. • Runs as single process (single address space) • Application processes modeled as user-space threads • Global data requires special attention (false sharing) • Offline (trace-based) mode for MPI applications • Online (skeleton) mode • MPI, sockets, OpenSHMEM, HPX • Coming: UPC, ARMCI, GASnet, Global Arrays • Downloads, documentation, issue tracker at http://sst.sandia.gov

SST/macro: A Coarse-Grained Simulator An application code with minor modifications SST/macro-specific implementation of interfaces (such as MPI), which simulate execution and communication

A Picture of SST/macro Processes (as user-space threads) Nodes have: Cores Process/core affinity Memory contention/ NUMA effects NIC effects/contention node node node node node node node node Network switches have: Packet arbitration Adaptive Routing Queuing/buffering switch switch switch switch Messages can be: Flows Packets Packetized flows

Building and Running 20 minutes

Building SST/Macro • Obtain source via mercurial (hg) or downloads page • hg clone https://bitbucket.org/ghendry/sstmacro/ • https://bitbucket.org/ghendry/sstmacro/downloads • tutorial website: http://casl.gatech.edu/research/eiger-tutorial/ • Autoconf and related tools are needed unless you are using an unmodified release or snapshot tar archive • Autoconf: 2.64 or later should work and 2.68 is known to work • Automake: 1.11 or later should work and 1.11.1 is known to work • Libtool: 2.2.6 or later should work and 2.4 is known to work • A C++ compiler is required • Doxygen and Graphviz are needed only to build the documentation

The following steps will configure, compile, and run the validation suite for SST/macro. These steps assume that you have changed your current directory to the source code distribution directory. The INSTALLDIR variable is set to the name of the directory where you would like SST/macro to be installed. $ ./bootstrap.sh $ ./configure –prefix=$INSTALLDIR –enable-eiger $ make $ make check After some time and quite a bit of output you should see something similar to this: ======================================================================== RESULT SUMMARY: tests=34 failures=0 errors=0 disabled=0 ======================================================================== This indicates that everything worked as expected; failures or errors obviously indicate that things didn't go quite right. The actual number of tests will depend on the version of SST/macro used and the configuration options. Now SST/macro can be installed to the INSTALLDIR specified in the configure step by running: $ make install Important: make sure to add INSTALLDIR to your PATH variable. At this point the installation tests can be run: $ make installcheck A result summary similar to that generated by make check should be displayed upon completion of the tests.

Running a skeleton SST/macro Source Application code Parameter file Simulation binary link Installed SST/macro headers + libs

Skeleton Example 1: 1d Integrator You can find this simple example in sstmacro/tutorials/1d_integrator_cxx Include SST/macro headers How to specify global variables How to run: $ cd sstmacro/tutorials/1d_integrator_cxx $ make $ ./runsstmac –f parameters.ini

Example 1: 1d Integrator SST/macro is the mainrountine, it calls user_skeleton_main to start the simulation Substitute real computation for a model where possible to increase simulation speed

The Five Principles of Skeletonization • 1. Correct headers • #include <sstmac/sstmpi.h>, not #include <mpi.h> • 2. Change main() • to intuser_skeleton_main(intargc, char* argv[]) • 3. Change types of global variables • int becomes SSTMAC_INT, globals are only supported for C++ • might have to rename your file to .cc or .cpp • 4. Skeletonize as much as possible • remove computation that produces an answer • remove memory allocation associated with computation or communication • 5. If you want simulation time to pass between MPI calls, add computation modeling • Example showed a simple model (SSTMAC_compute). Later you’ll see Eiger, which is more sophisticated

Skeletonizing speed: speed: speed: speed: speed: scalability: scalability: scalability: scalability: scalability: accuracy: accuracy: accuracy: accuracy: accuracy: effort: effort: effort: effort: effort: Full application Remove computation Remove memory allocation Add simple computation models Add Eiger computation models

Parameters another file with the machine parameters launching app size (ranks), and number of nodes to use passing parameters to application app name • An SST/macro simulation reads parameters.ini by default • you can specify a file with –f my_param_file.ini on the command line • Some of the more important parameters:

Changing machine model/parameters (16x6 groups, 10 groups) changing topology layout link bandwidth check sstmacro/configurations/hoppertrain.ini for some more

GUI for building parameter file (need to configure SST/macro with –with-qt) sstmacro/build $ [sudo] make gui sstmacro/build $ open qt-gui/SSTMacro.app (on macs)

SST/macro GUI Tabs for basic parameter classes (e.g. network) Tabs for specific class (NIC) in network

SST/macro GUI:Manager Pick basic congestion model: Here we select “packet train”

SST/macro GUI: Tooltip docs Documentation of what network_name specifies Documentation for each choice of network

SST/macro GUI: Network

SST/macro GUI: Compute Node

SST/macro GUI: MPI MPI overhead parameters MPI protocol parameters

Validation 5 minutes

Validation Methodology • Validation for modern machines can be complicated • Proprietary hardware • MPI implementation • Allocation noise • Thermal events • Use microbenchmarks to instrument machine in specific ways • some in sstmacro/skeletons/osu • Idea: kick machine in the pants to test corner cases. A full application will likely be more accurate than what we train on. • Use of formal UQ tools can aid in training a high-dimensional parameter space

Validation Study MPI_Scatter MPI_Gather MPI_Allgather • Validation study has been completed against a Cray XE6 using packet train model • sstmacro/configurations/hoppertrain.ini • Able to capture congestion behavior that results from both hardware and MPI implementation effects • Demonstrated simulation validation workflow that includes formal UQ methods • Working on packaging up the UQ tools

A Real Miniapp: HPCCG 10 minutes

HPCCG: HPC Conjugant Gradient dest 100kB How to run: $ (install SST/macro) $ cd tutorial-ISPASS/apps/HPPCG_skel $ make $ ./runhpccg –f hpccg.ini src 0B • A mini-app developed by the Mantevo project (www.mantevo.org) • “This simple benchmark code is a self-contained piece of C++ software that generates a 27-point finite difference matrix with a user-prescribed sub-block size on each processor.” • by default, it weak scales

HPCCG: Overview – main.cpp Setup finite-difference matrix (initialize.cpp) … Run the algorithm … Report on metrics

HPCCG: Overview – HPCCG.hpp Struct to store data Main loop Dot product Communication Matrix vector product

HPCCG: Skeletonize ddot(…) enables profiling with callgraph/graphviz simple linear model based on loop bounds and “fudge factor”

1st Order Modeling • Just create a linear model out of loop bounds, and some estimate of the amount of work for each loop – “Cwork” • SSTMAC_Compute_loop • SSTMAC_Compute_loop(lower_bound, upper_bound, Cwork) • SSTMAC_Compute_loop2(lb1, ub1, lb2, ub2, Cwork) • SSTMAC_Compute_loop3(lb1, ub1, lb2, ub2,lb3, ub3, Cwork)

Compute modeling by example: See the pattern of changes by the examples in tutorial-ispass/apps/HPCCG_* (use a diff GUI, e.g. meld) • Original code extracted from Mantevo mini-app • HPCCG_presst/ • Original code with lwperf coarse data collection • HPCCG_presst_metrics/ (try ‘meld HPCCG_presstHPCCG_presst_metrics’) • SSTMAC model • HPCCG_skel/ • SSTMAC model with lwperf coarse data collection • HPCCG_skel_metrics/ • SSTMAC model with all original computations side by side • HPCCG_full/

Compute_loops models: the basics Whole-device modeling (accelerator card or CPU not engaged in asynchronous communication simultaneously) is straight-forward : • Collect app data for a region of MPI/network-free code • Vary the working set size (W) and (if desired) on-node core count (P) • Store wall time cost (over many samples compute an average R for each W,P) • Store loop lengths or other work-item counters (add counters if needed) • Store any system counters available that may indicate OS noise • If hardware only theoretical, use cycle simulator to get time cost R • Replace work loads with SSTMAC_compute_loop calls receiving the same counter values seen in the application. Start with the last argument numlinesofN=1. • Observe simulated time S for a given W and compute CF  R/S • Update NnewCF * Nold; converges quickly in most cases. • If some of variants W fit in cache and others don’t, you need two Ns.

Compute_loops models: some tips • A 1-D loop with arbitrary length & N=1 replaces straight-line code blocks. • Nested minor loops or conditional logic can be modeled by making N for the major loop a function of counters for minor loop or conditional trips. • Use a large block and a single call to the compute_loops library if all the loops in the block have the same bounds. • Node-local OpenMP blocks can be modeled if the timers enclose the entire parallel region. • Sanity checks when N doesn’t converge easily: • Plot all the data samples from the real application; averages may be contaminated by process interruptions and cache effects. Make an SSTmacroconfigfile entry for N. • Diagonal plots with kinks in sloped lines may mean a loop model is missing. • Diagonal plots with horizontal bars often mean the app timing data is noisy. • Use LWPERF library for C++/C/Fortran to simplify comparisons: / • There is no single N that for x86 will model both a loop run on a single core and that same loop running on multiple memory contending cores.

LWPERF: instrumenting & validating many loop nest models at once LWPERF provides a light weight macro interface and library for C/C++/Fortran 2xxx to log coarse-grained performance • Documentation and simple example available for all languages. • Simple tailoring to advanced environments • Sends data to CSV files or the Eiger API • Parameter study and graphical post-processing scripts with HPCCG • Diagonal plots and time series comparison plots for validation • Currently supports MPI and SSTmacro MPI, properly enclosed OpenMP and accelerator calls. • Cut & Paste supported: • Instrumentation looks the same in apps and skeleton apps • Instrumentation looks the same across all languages • Keeps Eiger API calls out of application source files. • A single compiler flag deactivates all LWPERF behavior in an instrumented code; it can be safely added permanently in an application code base.

Computation Validation

Basic Usage, Statistics, Visualization and Performance Metrics 15 minutes

Debugging SST/macro ./runhpccg –f hpccg.ini –d “<debug> mpicheck | <debug> launch” • Some of the more common debug flags: • mpicheck – prints out a banner after MPI_Finalize(), so you know your app completed successfully • launch – see what nodes are launching applications • topology – see how nodes are connected together • router - see routing decisions • sstmac_mpi – see MPI calls • Add “ | timestamps” at the end to print timestamps with debug output • On a Mac, add –c to command line to color-code output in the console

Visualization and analysis tools:Traffic pattern from spyplot network_spyplot_bytes.png ./runhpccg –f hpccg.ini –d “<stats> spyplot” Destination Source Logical MPI space Allocation (nodes’ perspective) Whole machine/network

Visualization and analysis tools:Traffic pattern from spyplot with launch_allocation= random Whole machine/network

Visualization and analysis tools:Traffic pattern from spyplot with launch_indexing = random Allocation

SST/macro Tutorial

SST/macro Tutorial

Presentation Transcript

Extreme Macro Photography

Automatic production of polymeric macro-beads

Access Tutorial 10 Automating Tasks with Macros

Chapter IV: Macro Processors

Macro Processors Basic Functions Machine-Independent Features Design Options Implementation Examples

Tutorial 10 Automating Tasks with Macros

Intermediate Macro

The Macro-Environment

Tutorial 10 Automating Tasks with Macros

SigTerms tutorial: using the “FindSignificantTerms” macro

History of Macro Doctrine… The Commanding Heights

Macro

How to Give a Tutorial

Macro Regions

The Macro-Environment.

C Preprocessor

Chapter 4 Macro Processors

Macro Assessment