BigSim: Simulating PetaFLOPS Supercomputers

BigSim: Simulating PetaFLOPS Supercomputers Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Introduction • Petascale machines are being built • It is hard to understand parallel application without actually running it • Tools that allow one to predict the performance of applications before machines are built, also help debugging and tuning • Allows easier "offline" experimentation • Provides a feedback for machine designers 5th Annual Workshop on Charm++ and its Applications

Overview • Prediction accuracy • Sequential performance • Parallel performance (message passing) • Flexibility - different levels of fidelity/tradeoffs • Challenges for usability • Memory constraints • Time cost • How to analyze result - scalable performance analysis tools • Portability • Scalability • BigSim system 5th Annual Workshop on Charm++ and its Applications

Summary of Our Approach • Parallel Discrete Event Simulation (PDES) • the operation of a system is represented as a chronological sequence of events • Event-driven simulation method • Two phase simulation • BigSim Emulator, capable of running application • Charm++, Adaptive MPI / MPI • Predict sequential execution blocks • Generate trace logs • Parallel Discrete Event Simulator (POSE-based) • Predict message passing performance • Implemented on Charm++ 5th Annual Workshop on Charm++ and its Applications

BigSim System Architecture Performance visualization (Projections) Offline PDES Network Simulator BigSim Simulator Simulation output trace logs Charm++ Runtime Load Balancing Module Performance counters Instruction Sim (RSim, IBM, ..) Simple Network Model BigSim Emulator Charm++ and MPI applications 5th Annual Workshop on Charm++ and its Applications

BigSim Emulator – the first step • Actual execution of real applications on much smaller machines • Model each target processor as virtual processor • Using light-weight (Converse) threads simulating each target processor • Charm++/AMPI applications without change • No global variables/swap-globals/copy-globals 5th Annual Workshop on Charm++ and its Applications

Hardware thread Simulating (Host) Processor Emulation on a Parallel Machine Target Nodes 5th Annual Workshop on Charm++ and its Applications

Predicting Sequential Performance • Different level of fidelity: • Direct mapping • Performance counter • Instruction-level simulator • Tradeoff between time cost and accuracy • Exploit inherent determinacy of Charm++ application • Out of order messages do not change the computation behavior (structured dagger) 5th Annual Workshop on Charm++ and its Applications

[22] name:msgep (srcnode:0 msgID:21) ep:1 [[ recvtime:0.000498 startTime:0.000498 endTime:0.000498 ]] backward: forward: [23] [23] name:Chunk_atomic_0 (srcnode:-1 msgID:-1) ep:0 [[ recvtime:-1.000000 startTime:0.000498 endTime:0.000503 ]] msgID:3 sent:0.000498 recvtime:0.000499 dstPe:7 size:208 msgID:4 sent:0.000500 recvtime:0.000501 dstPe:1 size:208 backward: [22] forward: [24] [24] name:Chunk_overlap_0 (srcnode:-1 msgID:-1) ep:0 [[ recvtime:-1.000000 startTime:0.000503 endTime:0.000503 ]] backward: [0x80a7af0 23] forward: [25] [28] Event logs are generated for each target processor Predicted time on execution blocks with timestamps Event dependencies Message sending events Trace Logs 5th Annual Workshop on Charm++ and its Applications

Predicting message passing performance – second step • Parallel Discrete Event Simulation • Needed for correcting causality errors due to out-of-order messages • Timestamps are re-adjusted without rerunning the actual application • Simulate network behaviours: packetization, routing, contention, etc • simple latency based network model, and contention-based network model 5th Annual Workshop on Charm++ and its Applications

PDES and Charm++ • PDES is a naturally message-driven activity • Large number of entities naturally maps to virtual processors • Fine grained PDES • Migratable PDES entities and load balancing 5th Annual Workshop on Charm++ and its Applications

Network Simulator Design 5th Annual Workshop on Charm++ and its Applications

Network Simulator: Data Flow 5th Annual Workshop on Charm++ and its Applications

Network Communication Pattern Analysis • NAMD with apoa1 • 15 timestep 5th Annual Workshop on Charm++ and its Applications

Network Communication Pattern Analysis Data transferred (KB) in a single time step 5th Annual Workshop on Charm++ and its Applications

NAMD network simulation (scalability) • Simulating NAMD on 2048 processors with 3D Torus topology • Tests ran on Turing Apple cluster with Myrinet Terry L. Wilmarth, Gengbin Zheng, Eric J. Bohm, Yogesh Mehta, Praveen Jagadishprasad, Laxmikant V. Kalé, ``Performance Prediction using Simulation of Large-scale Interconnection Networks in POSE'', PADS 2005 5th Annual Workshop on Charm++ and its Applications

LeanMD Performance Analysis • Benchmark 3-away ER-GRE • 36573 atoms • 1.6 million objects • 8 step simulation • 64k BG processors • Running on PSC Lemieux 5th Annual Workshop on Charm++ and its Applications

Integration with Instruction level simulators • Instruction level simulators typically are • Standalone applications • Sequential execution • Very slow • An interpolation-based scheme • Use result of a smaller scale instruction level simulation to interpolate for large dataset • do a least-squares fit to determine the coefficients of an approximation polynomial function 5th Annual Workshop on Charm++ and its Applications

Case study: BigSim / Mambo void func( ) { StartBigSim( ) … EndBigSim( ) } Mambo Prediction for Target System Cycle-accurate prediction of sequential blocks on POWER7 processor BigSim Parallel Emulation BigSim Parallel Simulation Interpolation + Replace sequential timing Trace files Parameter files for sequential blocks Adjusted trace files 5th Annual Workshop on Charm++ and its Applications

Future Work • Use performance counter to predict sequential timing • Out-of-core execution for memory bound applications 5th Annual Workshop on Charm++ and its Applications

Out-of-core Emulation • Motivation • Physical memory is shared • VM system would not handle well • Message driven execution • Peek msg queue => what execute next? (prefetch) 5th Annual Workshop on Charm++ and its Applications

BigSim: Simulating PetaFLOPS Supercomputers