µπ A Scalable & Transparent System for Simulating MPI Programs

SimuTools, Malaga, Spain March 17, 2010 µπA Scalable & Transparent Systemfor Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D ManagerOak Ridge National Laboratory Adjunct ProfessorGeorgia Institute of Technology

Motivation & Background Software & Hardware Lifetimes Software & Hardware Design Co-design:E.g., 1 μs barrier cost/benefit Hardware:E.g., Load from application Software:Scaling, debugging, testing, customizing • Lifetime of large parallel machine:5 years • Lifetime of useful parallel code:20 years • Port, analyze, optimize • Ease of development:Obviate actual scaled hardware • Energy efficient:Reduce failed runs at actual scale

μπ = micro parallel performance investigator Performance prediction for MPI, Portals and other parallel applications Actual application code executed on the real hardware Platform is simulated at large virtual scale Timing customized by user-defined machine Scale is key differentiator Target: 1,000,000 virtual cores E.g., 1,000,000 virtual MPI ranks in simulated MPI application Based on µsikmicrosimulator kernel Highly scalable PDES engine μπ Performance Investigation System

Application Compute Compute µπ Tcomp Tcomm MPI Call Entry MPI Call Exit Generalized Interface & Timing Framework • Accommodates arbitrary level of timing detail • Compute time: can use a full system simulation (instruction-level) on the side, or model with cache-effects, other corrected processor speed, etc., depending on user desire, accuracy-cost trade-off • Communication time: can use network simulator, queueing and congestion models, etc., depending on user desire, accuracy-cost

Modify #include and recompile Change#include <mpi.h>to#include <mupi.h> Relink to μπ library Instead of –lmpiuse -lmupi Compiling MPI application with μπ

Run the modified MPI application(a μπ simulation)‏ mpirun–np4 test -nvp32runs test with 32 virtual MPI rankssimulation uses 4 real cores μπ itself uses multiple real cores to run simulation in parallel Executing MPI application over μπ

Interface Support Existing, Sufficient Planned, Optional Other wait variants Other send/recv variants Other collectives Group communication • MPI_Init(), MPI_Finalize() • MPI_Comm_rank()MPI_Comm_size() • MPI_Barrier() • MPI_Send(), MPI_Recv() • MPI_Isend(), MPI_Irecv() • MPI_Waitall() • MPI_Wtime() • MPI_COMM_WORLD Other, Performance-Oriented • MPI_Elapse_time(dt) • Added for simulation speed • Avoids actual computation, instead simply elapses time

Performance Study • Benchmarks • Zero lookahead • 10μs lookahead • Platform • Cray XT5, 226K cores • Scaling Results • Event Cost • Synchronization Overhead • Multiplexing Gain

Experimentation Platform: Jaguar* * Data and images from http://nccs.gov

Event Cost

Synchronization Speed

Multiplexing Gain

μπSummary - Quantitative • Unprecedented scalability • 27,648,000 virtual MPI ranks on 216,000 actual cores • Optimal multiplex-factor seen • 64 virtual ranks per real rank • Low slowdown even on zero-lookahead scenarios • Even on fast virtual networks

μπSummary - Qualitative • The only available simulator for highly scaled MPI runs • Suitable for source-available, trace-driven, or modeled applications • Configurable hardware timing • User-specified latencies, bandwidths, arbitrary inter-network models • Executions repeatable and deterministic • Global time-stamped ordering • Deterministic timing model, and • Purely discrete event simulation • Most suitable for applications whose MPI communication may be either trapped, instrumented or modeled • Trapped: on-line, live actual execution • Instrumented: off-line trace generation, trace-driven on-line execution • Modeled: model-driven computation and MPI communication patterns • Nearly zero perturbation with unlimited instrumentation

Ongoing Work • NAS Benchmarks • E.g., FFT • Actual at-scale application • E.g., Chemistry • Optimized implementation of certain MPI primitives • E.g., MPI_Barrier(), MPI_Reduce() • Tie to other important phenomena • E.g., energy consumption models

Thank you!Questions? Discrete Computing Systems www.ornl.gov/~2ip

µπ A Scalable & Transparent System for Simulating MPI Programs