The BigSim Parallel Simulation System

Charm++ Workshop 2010 Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University of Illinois at Urbana-Champaign The BigSim Parallel Simulation System 1

Charm++ Workshop 2010 Outline • Overview • BigSim Emulator • BigSim Simulator

Charm++ Workshop 2010 Summarizing the State of Art • Petascale • Very powerful parallel machines exist (Jaguar, Roadrunner, etc) • Application domains exist that need that kind of power • New generation of applications • Use sophisticated algorithms • Dynamic adaptive refinements • Multi-scale, multi-physics • Parallel applications are more complex than sequential ones, hard to predict without actually running it • Challenge: • Is it possible to simulate these applications on large scale using small clusters?

Charm++ Workshop 2010 BigSim • Why BigSim, and why on Charm++? • Targets large scale simulation • Object-based processor virtualization • For a virtualized execution environment • Efficient message passing runtime by Charm++ • Support fine-grained decomposition • Portability

Charm++ Workshop 2010 BigSim Infrastructure • Emulator • A virtualized execution environment • Charm++ and MPI applications • No or small changes to MPI application source codes. • facilitate code development and debugging • Simulator • Trace-driven approach • Parallel Discrete Event Simulation • Simple latency, full network contention modeling • Predict parallel performance at varying levels of resolution 5

Charm++ Workshop 2010 Architecture of BigSim Performance visualization (Projections)‏ BigSim Simulator POSE Simulation trace logs Charm++ Runtime AMPI Runtime BigSim Emulator Charm++/MPI applications 6

Charm++ Workshop 2010 MPI Alltoall Timeline

Charm++ Workshop 2010 BigSim Emulator • Emulate full machine on existing machines • Actually run a parallel program • E.g. NAMD on 256K target processors using 8K cores of Ranger cluster • Implemented on Charm++ • Libraries that link to user application • Simple architecture abstraction • Many multiprocessor (SMP) nodes connected via message passing • Do not emulate at instruction level 8

Charm++ Workshop 2010 Physical Processor BigSim Emulator: functional view Communication processors Communication processors Worker processors Worker processors Incoming queue Incoming queue Processor-level queues Processor-level queues Node-level queue Node-level queue Target Node Target Node Converse scheduler Converse Queue 9

Charm++ Workshop 2010 Processor Virtualization Programmer: Decomposes the computation into objects Runtime: Maps the computation on to the processors User View System View

Charm++ Workshop 2010 Major Challenges • Running multiple copies of code on each processor • Shared global variables • Charm++ applications already handle this • AMPI • Global/static variables • Runtime techniques, compiler tools • E.g. NAMD on 1024 target processors using 8 cores • Simulation time • Memory footprint • Global read-only variables can be shared • Out-of-core execution

Charm++ Workshop 2010 NAMD Emulation Only 19 times of slowdown Only 7 times of increase in mem

Charm++ Workshop 2010 Out-of-core Emulation • Motivation • Applications with large memory footprint • VM system can not handle well • Use hard drive • Similar to checkpointing • Message driven execution • Peek msg queue => what execute next? (prefetch)‏ 13

Charm++ Workshop 2010 What is in the Trace Logs? Traces for2 target processors Tools for reading bgTrace binary files: • charm/example/bigsim/tools/loadlog Convert to human-readable format • charm/example/bigsim/tools/log2proj Convert to trace projections log files Each SEB has: • startTime, endTime • Incoming Message ID • Outgoing messages • Dependences 14

Charm++ Workshop 2010 BigSim Simulator: BigNetSim • Post-mortem network simulator built on POSE (Parallel Object-oriented Simulation Environment), which is built on Charm++ • Parallel Discrete Event Simulation • Pass emulator traces through different network models in BigNetSim to get final performance results • Details of using BigNetSim: • http://charm.cs.uiuc.edu/workshops/charmWorkshop2009/slides/tut_BigSim09.ppt • http://charm.cs.uiuc.edu/manuals/html/bignetsim/manual.html

Charm++ Workshop 2010 POSE • Network layer constructs (NIC, Switch, Node, etc.) implemented as poser simulation objects • Network data constructs (message, packet, etc.) implemented as event methods on simulation objects

Charm++ Workshop 2010 Posers Each poser is a tiny simulation

Charm++ Workshop 2010 Performance Prediction • Two components: • Time to execute blocks of sequential, computational code • SEBs = Sequential Execution Blocks • Communication time based on a particular network topology

Charm++ Workshop 2010 Sequential Time Prediction (Emulator) • Manual • Advance processor time using BgElapse() calls in application code • Wallclock time • Use multiplier (scale factor) to account for architecture differences • Performance counters • Count instructions with hardware counters • Use expected time of each instruction on target machine to derive execution time • Instruction-level simulation (e.g., Mambo) • Record cycle-accurate execution times for functions • Use interpolation tool to replace SEB times

Charm++ Workshop 2010 Sequential Time Prediction (continued) • Model-based (recent work) • Performed after emulation • Determine application functions responsible for most of the computation time • Run these functions on target machine • Obtain run times based on function parameters to create model • Feed emulation traces through offline modeling tool (like interpolation tool) to replace SEB times • Generates corrected set of traces

Charm++ Workshop 2010 Communication Time Prediction (Simulator) • Valid for a particular network topology • Generic: Simple Latency model • Formula predicts time using latency and bandwidth parameters • Specific • BlueGene, Blue Waters, and others • Latency-only option – uses formula specific to network • Full contention

Charm++ Workshop 2010 BGnode Transceiver BGproc BGproc Net Interface Switch Specific Model (Full Network) Channel Channel Channel Channel Channel Channel

Charm++ Workshop 2010 BGnode Transceiver BGproc BGproc Net Interface Switch Generic Model (Simple Latency) Channel Channel Channel Channel Channel Channel

Charm++ Workshop 2010 What We Model • Processors • Nodes • NICs • Switches/hubs • Channels • Packet-level direct and indirect routing • Buffers with credit scheme • Virtual channels

Charm++ Workshop 2010 Other BigNetSim Features • Skip points • Set skip points in application code (e.g., after startup) • Simulate only between skip points • Transceiver • Traffic pattern generator – replaces nodes and processors • Windowing • Set file window size to decrease memory footprint • Can cut footprint in half or better, depending on trace structure • Checkpoint-to-disk (recent work) • Saves simulator state based on time or GVT interval for restart if crash occurs

Charm++ Workshop 2010 BigNetSim Tools • Located in BigNetSim/trunk/tools • Log Analyzer • Provides info about a set of traces • Number of events / simulated processor • Number of messages sent • Log Transformation (recently completed) • Produces new set of traces with remapped objects • Useful for testing load-balancing scenarios

Charm++ Workshop 2010 BigNetSim Output • BgPrintf() statements • Added to application code • “%f” converted to committed time during simulation • GVT = Global Virtual Time • Each GVT tick = 1/factor seconds • factor is defined in BigNetSim/trunk/Main/TCsim.h • Link utilization statistics • Projections traces • Use -tproj command-line parameter

Charm++ Workshop 2010 BigNetSim Output Example Charm++: standalone mode (not using charmrun) Charm warning> Randomization of stack pointer is turned on in Kernel, run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it. Thread migration may not work! Charm++> cpu topology info is being gathered! Charm++> 1 unique compute nodes detected! bgtrace: totalBGProcs=8 X=8 Y=1 Z=1 #Cth=1 #Wth=1 #Pes=1 Opts: netsim on: 0 Initializing POSE... POSE initialization complete. Using Inactivity Detection for termination. netsim skip_on 0 0 Info> timing factor 1.000000e+08 ... Info> invoking startup task from proc 0 ... [0:RECV_RESUME] Start of major loop at 0.014741 [0:RECV_RESUME] End of major loop at 0.034914 Simulation inactive at time: 38129444 Final GVT = 38129444 Final link stats [Node 0, Channel 0, ### Link]: ovt: 38129444, utilization time: 29685846, utilization %: 77.855439, packets sent: 472210 gvt=38129444 Final link stats [Node 0, Channel 3, ### Link]: ovt: 38129444, utilization time: 631019, utilization %: 0.016549, packets sent: 4259 gvt=38129444 1 PE Simulation finished at 18.052671. Program finished.

Charm++ Workshop 2010 Ring Projections Timeline

Charm++ Workshop 2010 BigNetSim Performance • Examples of sequential simulator performance on Blue Print • 4k-VP MILC • Startup time: 0.7 hours • Execution time: 5.6 hours • Total run time: 6.3 hours • Memory footprint: ~3.1 GB • 256k-VP 3D Jacobi (10x10x10 grid, 3 iterations) • Startup time: 0.5 hours • Execution time: 1.5 hours • Total run time: 2.0 hours • Memory footprint: ~20 GB • Still tuning parallel simulator performance

Charm++ Workshop 2010 Thank you! Free download of Charm++ and BigSim: http://charm.cs.uiuc.edu Send questions and comments to: ppl@charm.cs.uiuc.edu

The BigSim Parallel Simulation System

The BigSim Parallel Simulation System

Presentation Transcript

Parallel Discrete Event Simulation

Parallel Simulation

BigSim Tutorial

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation

BigSim Tutorial

Parallel and Distributed Simulation

Parallel Simulation System

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation

BigSim Tutorial

The BigSim Parallel Simulation System

Parallel and Distributed Simulation

Parallel and Distributed Simulation

Parallel and Distributed Simulation