850 likes | 1.02k Views
Presented by Gengbin Zheng, Ryan Mokos Charm++ Workshop 2009 Parallel Programming Laboratory University of Illinois at Urbana-Champaign. BigSim Tutorial. 1. Outline. Overview BigSim Emulator BigSim Simulator Post-mortem simulation BigNetSim build flow
E N D
Charm++ Workshop 2009 Presented by Gengbin Zheng, Ryan Mokos Charm++ Workshop 2009 Parallel Programming Laboratory University of Illinois at Urbana-Champaign BigSim Tutorial 1
Charm++ Workshop 2009 Outline • Overview • BigSim Emulator • BigSim Simulator • Post-mortem simulation • BigNetSim build flow • Generic network model: Simple Latency Model • Specific network models • Extensibility
Charm++ Workshop 2009 BigSim Infrastructure • BigSim for whole-system simulation of a large parallel machine. • Goal: Support early application development and identification of performance bottlenecks. • What BigSim can do: • An execution environment that can run both Charm++ and MPI applications on large scale target machines • No or small changes to MPI application source codes. • facilitate code development and debugging • Predict parallel performance at varying levels of resolution • Tune/scale performance • Machine vendors designing future machines 4
Charm++ Workshop 2009 BigSim Components • BigSim Emulator • Run AMPI/Charm++ on emulator • Capture computation and communication information • Parallel: Each physical processor is used to emulate multiple target processors, leveraging Charm++’s virtualization support • BigSim Simulator • PDES, Network contention • Produce performance data in a format compatible with the Projections graphical browser 5
Charm++ Workshop 2009 What BigSim Can not Do • BigSim • Itself does not predict cycle-accurate timing (needs instruction-level simulation) • does not predict cache effect, virtual memory • does not model O.S. jitter
Charm++ Workshop 2009 Outline • Overview • BigSim Emulator • BigSim Simulator • Post-mortem simulation • BigNetSim build flow • Generic network model: Simple Latency Model • Specific network models • Extensibility
Charm++ Workshop 2009 BigSim Emulator • Emulate full machine on existing parallel machines • Actually run a parallel program • E.g. multi-million objects on 128K target processors • Emulator is implemented on Charm++ • Libraries that link to user application • Simple architecture abstraction • Many multiprocessor (SMP) nodes connected via message passing 9
Charm++ Workshop 2009 Communication processors Communication processors Worker processors Worker processors inBuff inBuff CorrectionQ CorrectionQ Non-affinity message queues Non-affinity message queues Real Processor BigSim Emulator: functional view Affinity message queues Affinity message queues Target Node Target Node Converse scheduler Converse Q 10
Charm++ Workshop 2009 Install BigSim Emulator • Download Charm++ v6.1.2 • http://charm.cs.uiuc.edu/download/downloads.shtml • Compile Charm++/AMPI with “bigemulator” option: • ./build AMPI net-linux-x86_64 bigemulator –O • This builds charm++ and emulator libraries under net-linux-x86_64-bigemulator • Compiler wrapper for MPI applications: • charm/net-linux-x86_64-bigemulator/bin/mpicc, mpicxx, mpif90, etc 11
Charm++ Workshop 2009 Prepare MPI Applications • Make sure applications are AMPI-complaint • Adaptive MPI – an implementation of MPI standard on Charm++ • Multithreaded • Changes that may be needed: • Fortran: Program Main => Program MPI_Main • Handle global/static variables • Manual: group globals into a big structure, and allocate on heap • Semi-automatic: use thread local storage • Int static __thread var; • Automatic: -swapglobals compiler option (ELF binaries) • Only handles globals, not statics 12
Charm++ Workshop 2009 Ring Example (ring.c) #include "mpi.h" #define TIMES 10 #if CMK_BLUEGENE_CHARM extern void BgPrintf(const char *); #define BGPRINTF(x) if (myid == 0) BgPrintf(x); #else #define BGPRINTF(x) #endif Int value = 0; int main(int argc, char *argv[]) { int myid, numprocs, i; double time; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); time = MPI_Wtime(); BGPRINTF("Start of major loop at %f \n"); for (i=0; i<TIMES; i++) { if (myid == 0) { MPI_Send(&value,1,MPI_INT,myid+1,999,MPI_COMM_WORLD); MPI_Recv(&value,1,MPI_INT,numprocs-1,999,MPI_COMM_WORLD,&status); } else { MPI_Recv(&value,1,MPI_INT,myid-1,999,MPI_COMM_WORLD,&status); value += myid; MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,999,MPI_COMM_WORLD); } } BGPRINTF("End of major loop at %f \n"); if (myid==0) printf("Sum=%d, Time=%g\n", value, MPI_Wtime()-time); MPI_Finalize(); } 13
Ring Example (AMPI-complaint) #include "mpi.h" #define TIMES 10 #if CMK_BLUEGENE_CHARM extern void BgPrintf(const char *); #define BGPRINTF(x) if (myid == 0) BgPrintf(x); #else #define BGPRINTF(x) #endif int main(int argc, char *argv[]) { int myid, numprocs, I, value=0; double time; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); time = MPI_Wtime(); BGPRINTF("Start of major loop at %f \n"); for (i=0; i<TIMES; i++) { if (myid == 0) { MPI_Send(&value,1,MPI_INT,myid+1,999,MPI_COMM_WORLD); MPI_Recv(&value,1,MPI_INT,numprocs-1,999,MPI_COMM_WORLD,&status); } else { MPI_Recv(&value,1,MPI_INT,myid-1,999,MPI_COMM_WORLD,&status); value += myid; MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,999,MPI_COMM_WORLD); } } BGPRINTF("End of major loop at %f \n"); if (myid==0) printf("Sum=%d, Time=%g\n", value, MPI_Wtime()-time); MPI_Finalize(); } Charm++ Workshop 2009 Charm++ Workshop 2009 14
Charm++ Workshop 2009 How to Compile and Run MPI Applications for the Emulator • Compile with AMPI and emulator • charm/net-linux-x86_64-bigemulator/mpicc –o ring ./ring.c • with performance trace module: • charm/net-linux-x86_64-bigemulator/mpicc –o ring ./ring.c –tracemode projections • Run: • Use mpirun provided by AMPI • Give number of target processors as well number of real processors • Define target machine • Command line options • +x +y +z • +cth +wth • E.g. • mpirun –np 4 ./ring +x10 +y10 +z10 +cth2 +wth4 • Or, use Config file • mpirun –np 4 ./ring +bgconfig config 15
Charm++ Workshop 2009 Bgconfig File Format • +bgconfig ./bg_config x 10 y 10 z 10 cth 2 wth 4 stacksize 4000 timing walltime #timing bgelapse #timing counter #cpufactor 1.0 fpfactor 5e-7 traceroot /tmp log yes correct no network bluegene 16
Charm++ Workshop 2009 Ring Std Output Justice> mpirun –np 4 ./pgm +bgconfig ./bg_config Reading Bluegene Config file ./bg_config ... BG info> Simulating 8x1x1 nodes with 1 comm + 1 work threads each. BG info> Network type: ibmpower. alpha: 1.000000e-06 bandwidth :1.700000e+09. BG info> cpufactor is 1.000000. BG info> floating point factor is 0.000000. BG info> BG stack size: 30000 bytes. BG info> Using WallTimer for timing method. BG info> Generating timing log. BG info> bgTrace root is './'. LB> Load balancer ignores processor background load. Start of major loop at 0.268719 End of major loop at 0.273697 Sum=280, Time=0.00497856 [0] Number is numX:8 numY:1 numZ:1 numCth:1 numWth:1 numEmulatingPes:4 totalWorkerProcs:8 bglog_ver:5 [2] Wrote to disk for 2 BG nodes. [3] Wrote to disk for 2 BG nodes. [1] Wrote to disk for 2 BG nodes. [0] Wrote to disk for 2 BG nodes. BG> BlueGene emulator shutdown gracefully! BG> Emulation took 0.692498 seconds! 17
Charm++ Workshop 2009 Ring Output Files Justice> ls -l -rwxr-xr-x 1 gzheng kale 2194434 2009-04-15 00:03 ring -rw-r--r-- 1 gzheng kale 10105 2009-04-15 00:04 pgm.sts -rw-r--r-- 1 gzheng kale 0 2009-04-15 00:04 pgm.projrc -rw-r--r-- 1 gzheng kale 4557 2009-04-15 00:04 pgm.7.log -rw-r--r-- 1 gzheng kale 4557 2009-04-15 00:04 pgm.6.log -rw-r--r-- 1 gzheng kale 4557 2009-04-15 00:04 pgm.5.log -rw-r--r-- 1 gzheng kale 4559 2009-04-15 00:04 pgm.4.log -rw-r--r-- 1 gzheng kale 4861 2009-04-15 00:04 pgm.3.log -rw-r--r-- 1 gzheng kale 5163 2009-04-15 00:04 pgm.2.log -rw-r--r-- 1 gzheng kale 5167 2009-04-15 00:04 pgm.1.log -rw-r--r-- 1 gzheng kale 6670 2009-04-15 00:04 pgm.0.log -rw-r--r-- 1 gzheng kale 23901 2009-04-15 00:04 bgTrace3 -rw-r--r-- 1 gzheng kale 23938 2009-04-15 00:04 bgTrace2 -rw-r--r-- 1 gzheng kale 24663 2009-04-15 00:04 bgTrace1 -rw-r--r-- 1 gzheng kale 24242 2009-04-15 00:04 bgTrace0 -rw-r--r-- 1 gzheng kale 60 2009-04-15 00:04 bgTrace 8 files Only 4 files
Charm++ Workshop 2009 What is in the Trace Logs? Traces for2 target processors • Tools for reading bgTrace binary files: • charm/example/bigsim/tools/loadlog • Convert to human-readable format • charm/example/bigsim/tools/log2proj • Convert to trace projections log files • Each SEB has: • startTime, endTime • Incoming Message ID • Outgoing messages • Dependences 19
Charm++ Workshop 2009 Ring Projections Timeline
Charm++ Workshop 2009 Performance Prediction • How to predict sequential performance? • Different levels of fidelity: • User supplied timing expression • Wall clock time • Performance counters • Instruction level simulation 21
Charm++ Workshop 2009 Sequential Time - BgElapse • BgElapse • Manually advance processor time MPI_Recv(&value,1,MPI_INT,myid-1,999,MPI_COMM_WORLD,&status); value += myid; ... BgElapse(0.000005); MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,999,MPI_COMM_WORLD); • Run with +bgelapse 22
Charm++ Workshop 2009 Sequential Time – using Wallclock • Wallclock measurement of the time can be used via a suitable multiplier (scale factor) • T * factor • Run application with +bgwalltime and +bgcpufactor, or • +bgconfig ./bgconfig: timing walltime cpufactor 0.7 • Good for predicting a larger machine using a fraction of the machine 23
Charm++ Workshop 2009 Sequential Time – Performance Counters • Count floating-point, integer, memory and branch instructions (for example) with hardware counters • Derive these hardware counters to expected time on target machine. • Cache performance and the memory footprint effects can be approximated • by percentage of memory accesses and cache hit/miss ratio. • Example of use, for a floating-point intensive code: +bgconfig ./bg_config timing counter fpfactor 5e-7 • Perfex and PAPI are supported 24
Charm++ Workshop 2009 Sequential Time – Instruction level simulation • Run instruction-level simulator separately to get accurate timing information • Issues: • It is a different third-party hardware simulator • Hard to integrate with BigSim • Sequential • Does not model communication • Slow! 25
Charm++ Workshop 2009 Interpolation • BigSim and instruction-level simulator interact through logs • Reduce the problem size by sampling: An interpolation-based scheme • Run a smaller sized problem, or • Run just one processor • Assume computation can be modelled by a set of parameters: • TC = Fn(p1, p2, p3, ...) • Use sample data from the instruction-level simulation to interpolate large dataset • With sampling data, do a least-squares fit to determine the coefficients of an approximation polynomial function
Charm++ Workshop 2009 Case study: BigSim / Mambo void func( ) { startTraceBigSim( ) … endTraceBigSim( ) } Mambo Prediction for Target System Cycle-accurate prediction of sequential blocks on POWER7 processor BigSim Parallel Emulation BigSim Parallel Simulation Interpolation + Replace sequential timing Trace files Parameter files for sequential blocks Adjusted trace files 27
Charm++ Workshop 2009 Ring Example MPI_Recv(&value,1,MPI_INT,myid-1,999,MPI_COMM_WORLD,&status); startTraceBigSim(); value += myid; endTraceBigSim(); char param[128]; sprintf(param, “sum %d”, myid); tagTraceBigSim(param); MPI_Send(&value,1,MPI_INT,(myid+1)%numprocs,999,MPI_COMM_WORLD);
Charm++ Workshop 2009 Output Files justice>ls -l total 2328 -rw-r--r-- 1 gzheng kale 60 2009-04-15 11:08 bgTrace -rw-r--r-- 1 gzheng kale 36757 2009-04-15 11:08 bgTrace0 -rw-r--r-- 1 gzheng kale 37023 2009-04-15 11:08 bgTrace1 -rwxr-xr-x 1 gzheng kale 94886 2009-04-14 09:46 charmrun* -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.0 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.1 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.2 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.3 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.4 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.5 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.6 -rw-r--r-- 1 gzheng kale 3 2009-04-15 11:08 param.7 -rwxr-xr-x 1 gzheng kale 2153700 2009-04-15 11:07 ring* -rw-r--r-- 1 gzheng kale 2965 2009-04-15 11:07 ring.C justice>cat param.7 48 sum 7
Charm++ Workshop 2009 Run ring Through Instruction-level Simulator • Compile normal version of ring (not emulator) • Run sequentially through an instruction-level simulator • Sample line of Mambo output: 10900820693: (10718653772): TRACE_END: sum 7
Charm++ Workshop 2009 Compile and Run Interpolation Tool • Install GSL, the GNU Scientific Library • cd charm/examples/bigsim/tools/rewritelog • Modify the file interpolatelog.C to match your particular tastes. • OUTPUTDIR specifies a directory for the new logfiles • CYCLE_TIMES_FILE specifies the file which contains accurate timing information • Make • Run interpolation tool under bgTrace dir: • ./interpolatelog 31
Charm++ Workshop 2009 Record/Replay • Record only a subset of special logs when running full size emulation • With the special logs, replay the execution of a particular target processor through hardware simulator • Example: • ./pgm +x 32768 +y 1 +z1 +bgrecord +bgrecordprocessors 0-32767:1024 • ./pgm +bgreplay 31744
Charm++ Workshop 2009 Out-of-core Emulation • Motivation • Applications with large memory footprint • VM system can not handle well • Use hard drive • Similar to checkpointing • Message driven execution • Peek msg queue => what execute next? (prefetch) 34
Charm++ Workshop 2009 Using Out-of-core • Change bigsim configuration file: • Charm/tmp/Conv-mach-bigemulator.h • #define BIGSIM_OUT_OF_CORE 1 • Recompile Charm++ and application • Run the application through the emulator, with an addintional command line option: • +bgooc 1024 36
Charm++ Workshop 2009 Outline • Overview • BigSim Emulator • BigSim Simulator • Post-mortem simulation • BigNetSim build flow • Generic network model: Simple Latency Model • Specific network models • Extensibility
Charm++ Workshop 2009 Postmortem Simulation • Run application once, get trace logs, and run simulation with logs for a variety of network configurations • Big Network Simulator (BigNetSim) implemented on POSE simulation framework • Particularly useful when message passing performance is critical and strongly affected by network contention • Note: BigSim emulator and BigSim simulator both use same network models for latency-only calculations located in charm/src/langs/bluegene/bigsim_network.h
Charm++ Workshop 2009 Implementation • Post-Mortem Network simulators are Parallel Discrete Event Simulations • Parallel Object Simulation Environment (POSE) • Network layer constructs (NIC, Switch, Node, etc.) implemented as poser simulation objects • Network data constructs (message, packet, etc.) implemented as event methods on simulation objects
Charm++ Workshop 2009 Terms • Several network models available • Specific: e.g., BlueGene • Latency-only model – does not account for contention • Network contention model • Generic: Simple Latency Model – uses a simple equation for determining message transmission time • Emulating processors – physical processors on which emulation is run (+p?) • Simulating processors – physical processors on which simulation (BigNetSim) is run (+p?) • Target processors – virtual (or simulated) processors on which emulation and simulation are run (+vp?)
Charm++ Workshop 2009 Outline • Overview • BigSim Emulator • BigSim Simulator • Post-mortem simulation • BigNetSim build flow • Generic network model: Simple Latency Model • Specific network models • Extensibility
Charm++ Workshop 2009 BigNetSim Build Flow • Download and compile charm • Compile POSE • Compile bigsim • Download BigNetSim • Compile BigNetSim • Run simulator • Output
Charm++ Workshop 2009 Download and compile charm (if not done already) • Download the latest version of charm from the PPL archives: http://charm.cs.uiuc.edu/download/downloads.shtml • Compile charm • cd charm • ./build charm++ net-linux
Charm++ Workshop 2009 Compile POSE • cd charm • ./build pose net-linux • options are set in pose_config.h • stats enabled by POSE_STATS_ON=1 • user event tracing TRACE_DETAIL=1 • more advanced configuration options • speculation • checkpoints • load balancing
Charm++ Workshop 2009 Compile bigsim • cd charm/net-linux/tmp • make bigsim
Charm++ Workshop 2009 Download BigNetSim • Download latest revision from repository: svn co https://charm.cs.uiuc.edu/svn/repos/BigNetSim • Directory structure: BigNetSim/trunk/ • BlueGene/ RedStorm/ and others - network models • SimpleLatency/ - Simple Latency Model • Topology/ Routing/ InputVcSelection/ OutputVcSelection/ - network configuration choices • Main/ - main simulation files • tools/ - tools directory • tmp/ - working directory created during build
Charm++ Workshop 2009 Compile BigNetSim • Fix BigNetSim/trunk/Makefile.common so CHARMBASE points to your charm directory • For the Simple Latency Model: • cd BigNetSim/trunk/SimpleLatency • For parallel simulator: make • For sequential simulator (runs only on 1 simulating processor): make SEQUENTIAL=1
Charm++ Workshop 2009 Run Simulator • cd BigNetSim/trunk/tmp • Copy bgTrace files into /tmp directory • For parallel build, run with: • ./charmrun +p4 bigsimulator -lat 1 -bw 1 • For sequential build, run with: • ./bigsimulator -lat 1 -bw 1
Charm++ Workshop 2009 Output • Simulation completion time • Specified in “GVT ticks” (GVT = Global Virtual Time) • GVT tick length is determined by the value of #define factor in BigNetSim/trunk/Main/TCsim.h • Divide final GVT by factor to get simulation time in seconds • factor = 1e8 => 1 tick = 10ns • factor = 1e9 => 1 tick = 1ns
Charm++ Workshop 2009 Output (continued) • Use BgPrint(char *) in source code to print event times • Each BgPrint() called at execution time in online execution mode is stored in trace log as a printing event • In postmortem simulation, strings associated with BgPrint() events are printed when the event is committed • “%f” in the string will be replaced by committed time • Useful for determining iteration times during simulation as well as emulation
Charm++ Workshop 2009 Output (continued) • Projections • Copy emulation Projections logs and sts file into BigNetSim/trunk/tmp • Two ways to use: • Command-line parameter: -projname <name> • Creates a new set of logs by updating the emulation logs • Assumes emulation Projections logs are: <name>.*.log • Output: <name>-bg.*.log • Disadvantage: emulation Projections overhead included • Command-line parameter: -tproj • Creates a new set of logs from the trace files, ignoring the emulation logs • Must first copy <name>.sts file to tproj.sts • Output: tproj.*.log • Advantage: no emulation Projections overhead included
Charm++ Workshop 2009 Ring Example • ./bigsimulator -lat 1 -bw 1 Charm++: standalone mode (not using charmrun) Charm warning> Randomization of stack pointer is turned on in Kernel, run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it. Thread migration may not work! Charm++> cpu topology info is being gathered! Charm++> 1 unique compute nodes detected! bgtrace: totalBGProcs=8 X=8 Y=1 Z=1 #Cth=1 #Wth=1 #Pes=1 Opts: netsim on: 0 Initializing POSE... POSE initialization complete. Using Inactivity Detection for termination. netsim skip_on 0 0 Info> timing factor 1.000000e+08 ... Info> invoking startup task from proc 0 ... [0:RECV_RESUME] Start of major loop at 0.347418 [0:RECV_RESUME] End of major loop at 0.349147 Simulation inactive at time: 38129444 Final GVT = 38129444 1 PE Simulation finished at 0.052671. Program finished.
Charm++ Workshop 2009 Projections - Ring Example Emulation Simulation: -lat 1 (latency = 1s) generated with -tproj