320 likes | 476 Views
Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University of Illinois at Urbana-Champaign. The BigSim Parallel Simulation System. 1. Outline. Overview BigSim Emulator BigSim Simulator. Summarizing the State of Art. Petascale
E N D
Charm++ Workshop 2010 Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University of Illinois at Urbana-Champaign The BigSim Parallel Simulation System 1
Charm++ Workshop 2010 Outline • Overview • BigSim Emulator • BigSim Simulator
Charm++ Workshop 2010 Summarizing the State of Art • Petascale • Very powerful parallel machines exist (Jaguar, Roadrunner, etc) • Application domains exist that need that kind of power • New generation of applications • Use sophisticated algorithms • Dynamic adaptive refinements • Multi-scale, multi-physics • Parallel applications are more complex than sequential ones, hard to predict without actually running it • Challenge: • Is it possible to simulate these applications on large scale using small clusters?
Charm++ Workshop 2010 BigSim • Why BigSim, and why on Charm++? • Targets large scale simulation • Object-based processor virtualization • For a virtualized execution environment • Efficient message passing runtime by Charm++ • Support fine-grained decomposition • Portability
Charm++ Workshop 2010 BigSim Infrastructure • Emulator • A virtualized execution environment • Charm++ and MPI applications • No or small changes to MPI application source codes. • facilitate code development and debugging • Simulator • Trace-driven approach • Parallel Discrete Event Simulation • Simple latency, full network contention modeling • Predict parallel performance at varying levels of resolution 5
Charm++ Workshop 2010 Architecture of BigSim Performance visualization (Projections) BigSim Simulator POSE Simulation trace logs Charm++ Runtime AMPI Runtime BigSim Emulator Charm++/MPI applications 6
Charm++ Workshop 2010 MPI Alltoall Timeline
Charm++ Workshop 2010 BigSim Emulator • Emulate full machine on existing machines • Actually run a parallel program • E.g. NAMD on 256K target processors using 8K cores of Ranger cluster • Implemented on Charm++ • Libraries that link to user application • Simple architecture abstraction • Many multiprocessor (SMP) nodes connected via message passing • Do not emulate at instruction level 8
Charm++ Workshop 2010 Physical Processor BigSim Emulator: functional view Communication processors Communication processors Worker processors Worker processors Incoming queue Incoming queue Processor-level queues Processor-level queues Node-level queue Node-level queue Target Node Target Node Converse scheduler Converse Queue 9
Charm++ Workshop 2010 Processor Virtualization Programmer: Decomposes the computation into objects Runtime: Maps the computation on to the processors User View System View
Charm++ Workshop 2010 Major Challenges • Running multiple copies of code on each processor • Shared global variables • Charm++ applications already handle this • AMPI • Global/static variables • Runtime techniques, compiler tools • E.g. NAMD on 1024 target processors using 8 cores • Simulation time • Memory footprint • Global read-only variables can be shared • Out-of-core execution
Charm++ Workshop 2010 NAMD Emulation Only 19 times of slowdown Only 7 times of increase in mem
Charm++ Workshop 2010 Out-of-core Emulation • Motivation • Applications with large memory footprint • VM system can not handle well • Use hard drive • Similar to checkpointing • Message driven execution • Peek msg queue => what execute next? (prefetch) 13
Charm++ Workshop 2010 What is in the Trace Logs? Traces for2 target processors Tools for reading bgTrace binary files: • charm/example/bigsim/tools/loadlog Convert to human-readable format • charm/example/bigsim/tools/log2proj Convert to trace projections log files Each SEB has: • startTime, endTime • Incoming Message ID • Outgoing messages • Dependences 14
Charm++ Workshop 2010 BigSim Simulator: BigNetSim • Post-mortem network simulator built on POSE (Parallel Object-oriented Simulation Environment), which is built on Charm++ • Parallel Discrete Event Simulation • Pass emulator traces through different network models in BigNetSim to get final performance results • Details of using BigNetSim: • http://charm.cs.uiuc.edu/workshops/charmWorkshop2009/slides/tut_BigSim09.ppt • http://charm.cs.uiuc.edu/manuals/html/bignetsim/manual.html
Charm++ Workshop 2010 POSE • Network layer constructs (NIC, Switch, Node, etc.) implemented as poser simulation objects • Network data constructs (message, packet, etc.) implemented as event methods on simulation objects
Charm++ Workshop 2010 Posers Each poser is a tiny simulation
Charm++ Workshop 2010 Performance Prediction • Two components: • Time to execute blocks of sequential, computational code • SEBs = Sequential Execution Blocks • Communication time based on a particular network topology
Charm++ Workshop 2010 Sequential Time Prediction (Emulator) • Manual • Advance processor time using BgElapse() calls in application code • Wallclock time • Use multiplier (scale factor) to account for architecture differences • Performance counters • Count instructions with hardware counters • Use expected time of each instruction on target machine to derive execution time • Instruction-level simulation (e.g., Mambo) • Record cycle-accurate execution times for functions • Use interpolation tool to replace SEB times
Charm++ Workshop 2010 Sequential Time Prediction (continued) • Model-based (recent work) • Performed after emulation • Determine application functions responsible for most of the computation time • Run these functions on target machine • Obtain run times based on function parameters to create model • Feed emulation traces through offline modeling tool (like interpolation tool) to replace SEB times • Generates corrected set of traces
Charm++ Workshop 2010 Communication Time Prediction (Simulator) • Valid for a particular network topology • Generic: Simple Latency model • Formula predicts time using latency and bandwidth parameters • Specific • BlueGene, Blue Waters, and others • Latency-only option – uses formula specific to network • Full contention
Charm++ Workshop 2010 BGnode Transceiver BGproc BGproc Net Interface Switch Specific Model (Full Network) Channel Channel Channel Channel Channel Channel
Charm++ Workshop 2010 BGnode Transceiver BGproc BGproc Net Interface Switch Generic Model (Simple Latency) Channel Channel Channel Channel Channel Channel
Charm++ Workshop 2010 What We Model • Processors • Nodes • NICs • Switches/hubs • Channels • Packet-level direct and indirect routing • Buffers with credit scheme • Virtual channels
Charm++ Workshop 2010 Other BigNetSim Features • Skip points • Set skip points in application code (e.g., after startup) • Simulate only between skip points • Transceiver • Traffic pattern generator – replaces nodes and processors • Windowing • Set file window size to decrease memory footprint • Can cut footprint in half or better, depending on trace structure • Checkpoint-to-disk (recent work) • Saves simulator state based on time or GVT interval for restart if crash occurs
Charm++ Workshop 2010 BigNetSim Tools • Located in BigNetSim/trunk/tools • Log Analyzer • Provides info about a set of traces • Number of events / simulated processor • Number of messages sent • Log Transformation (recently completed) • Produces new set of traces with remapped objects • Useful for testing load-balancing scenarios
Charm++ Workshop 2010 BigNetSim Output • BgPrintf() statements • Added to application code • “%f” converted to committed time during simulation • GVT = Global Virtual Time • Each GVT tick = 1/factor seconds • factor is defined in BigNetSim/trunk/Main/TCsim.h • Link utilization statistics • Projections traces • Use -tproj command-line parameter
Charm++ Workshop 2010 BigNetSim Output Example Charm++: standalone mode (not using charmrun) Charm warning> Randomization of stack pointer is turned on in Kernel, run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it. Thread migration may not work! Charm++> cpu topology info is being gathered! Charm++> 1 unique compute nodes detected! bgtrace: totalBGProcs=8 X=8 Y=1 Z=1 #Cth=1 #Wth=1 #Pes=1 Opts: netsim on: 0 Initializing POSE... POSE initialization complete. Using Inactivity Detection for termination. netsim skip_on 0 0 Info> timing factor 1.000000e+08 ... Info> invoking startup task from proc 0 ... [0:RECV_RESUME] Start of major loop at 0.014741 [0:RECV_RESUME] End of major loop at 0.034914 Simulation inactive at time: 38129444 Final GVT = 38129444 Final link stats [Node 0, Channel 0, ### Link]: ovt: 38129444, utilization time: 29685846, utilization %: 77.855439, packets sent: 472210 gvt=38129444 Final link stats [Node 0, Channel 3, ### Link]: ovt: 38129444, utilization time: 631019, utilization %: 0.016549, packets sent: 4259 gvt=38129444 1 PE Simulation finished at 18.052671. Program finished.
Charm++ Workshop 2010 Ring Projections Timeline
Charm++ Workshop 2010 BigNetSim Performance • Examples of sequential simulator performance on Blue Print • 4k-VP MILC • Startup time: 0.7 hours • Execution time: 5.6 hours • Total run time: 6.3 hours • Memory footprint: ~3.1 GB • 256k-VP 3D Jacobi (10x10x10 grid, 3 iterations) • Startup time: 0.5 hours • Execution time: 1.5 hours • Total run time: 2.0 hours • Memory footprint: ~20 GB • Still tuning parallel simulator performance
Charm++ Workshop 2010 Thank you! Free download of Charm++ and BigSim: http://charm.cs.uiuc.edu Send questions and comments to: ppl@charm.cs.uiuc.edu