240 likes | 913 Views
SDSC Blue Gene: Optimization and Debugging Mahidhar Tatineni SDSC, April 6, 2007. Overview of Talk. Blue Gene system overview – processor, networks Compiler optimizations Second FPU – restrictions, optimizations and limitations Virtual Node (VN) & Communication Coprocessor (CO) mode
E N D
SDSC Blue Gene: Optimization and DebuggingMahidhar TatineniSDSC, April 6, 2007
Overview of Talk • Blue Gene system overview – processor, networks • Compiler optimizations • Second FPU – restrictions, optimizations and limitations • Virtual Node (VN) & Communication Coprocessor (CO) mode • Profiling • Integrated Performance Monitoring (IPM) • Troubleshooting – common issues • Blue Gene core files – using addr2line • Standard tuning procedure • Task mapping
Blue Gene compute nodes • 700 MHz powerpc processor • 1 integer unit (FXU), 1 load/store unit, 2 FPUs • L1: 32kB, 32byte line, 64way • L2: 16, 128-byte lines acts as prefetch buffer • L3: 4MB, 35 cycles, shared • Main limitations • 512 MB memory per node • MPI only, no OpenMP or pthreads • Limited system calls w/ compute node kernel • Executables must be statically linked (no shared libraries)
Blue Gene: Networks • Three dimensional (3-D) Torus • Interconnects all compute nodes • 1.4 Gb/s on all 6 bidirectional node links (2.1 GB/s per node) • Global Tree • Collectives functionality • 2.8 GB/s of bandwidth per link • Latency of tree traversal in the order of 5µs • Interconnects all compute and I/O nodes • Gigabit Ethernet • Low Latency Global Barrier and Interrupt • Control Network
Using the compilers: Options • Compiler options -qarch=440 uses only single FPU per processor (minimum option) -qarch=440d allows both FPUs per processor (alternate option) -qtune=440 tunes for the 440 processor -O3 gives minimal optimization with no SIMDization -O3 –qarch=440d adds backend SIMDization -O3 –qhot adds TPO (a high-level inter-procedural optimizer) SIMDization, more loop optimization -O4 adds compile-time interprocedural analysis -O5 adds link-time interprocedural analysis (TPO SIMDization default with –O4 and –O5) • Current recommendation: Start with -O3 –qarch=440d –qtune=440 Try –O4, -O5 next
Practical flags • When linking mass libraries • -Wl,--allow-multiple-definition • When taking too long to compile • Try compile on bg-login4 (alternate login node) • Try –qnoipa option • When compiling .f90 files –qsuffix=f=f90 • To obtain a detailed compilation report • -qdebug=diagnostic –qlist –qsource –qreport=hotlist • With XL compilers, you can combine opt flags with -g
Second FPU • To generate code to take advantage of the second FPU 16-byte alignment is required and may need alignment assertions. • Easiest approach to take advantage of the second FPU is to use optimized math library routines (like MASS, ESSL) • The XL compiler has two different components that can generate SIMD code • The back-end optimizer with –O3 –qarch=440d • The TPO front-end, with –qhot or –O4, -O5 • In many applications loads and stores are the bottleneck and one can saturate the bandwidth to L3 or memory => double FPU instructions can help for data in L1 but not for data in L3 or memory.
Second FPU – Usage example • An example using alignment assertion FORTRAN: call alignx(16,x(1)) call alignx(16,y(1)) do i = 1, n y(i) = a*x(i)+y(i) enddo C: double *x, *y; __alignx(16,x); __alignx(16,y); for (i=0; i<n; i++) y[i]=a*x[i]+y[i];
Math libraries from IBM • Engineering & Scientific Subroutine Libraries (ESSL) • Mathematics Accelerated Scientific Subroutines (MASS) • Mathematics Accelerated Scientific Subroutines Vectorized (MASSV) • ESSL, MASS, MASSV tuned specifically for Blue Gene and will help significantly in improving performance
Modes for running jobs – VN & CO • The default mode is the communication coprocessor (CO) mode. One of the processors on the node is the main processor running the compute processor. Second processor behaves as an offload engine (mainly for communication functions). • In the Virtual Node (VN) mode both processors are used for the compute processes. In this mode the node resources (primarily the memory and torus network) are shared by both processes. Hence, in the VN mode users will have half the memory/node as compared to the CO mode. • I/O intensive tasks which require large amount of data interchange between compute nodes benefit by using the CO mode. • Applications which are primarily CPU bound and do not have a large per node memory requirement benefit from the VN mode.
Profiling your code on the Blue Gene • Standard profiling (prof, gprof) is available on the Blue Gene. • Three levels of profiling are available with gmon, depending on the –pg and –g options on the compile and link commands • Timer tick profiling information: Add –pg to the link options • Procedure level profiling with timer tick info: Add –pg to compile and link options • Full profiling – call graph info, statement level profiling, basic block profiling, and machine instruction profiling: Add –pg –g to the compile and link options • Each task generates a gmon.out.x file where x corresponds to the rank of the task. • Output can be read using the gprof command.
Example of profiling using gmon • /bgl/BlueLight/ppcfloor/blrts-gnu/bin/powerpc-bgl-blrts-gnu-gprof --sumbg poisson • gprof poisson gmon.sum > test.out Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 40.59 11.45 11.45 8 1.43 2.55 poisson 31.58 20.36 8.91 204800 0.00 0.00 solve 11.59 23.63 3.27 cvtloop 3.47 24.61 0.98 BGLML_Messager_VMadvance 1.95 25.16 0.55 __ cvt_r 1.28 25.52 0.36 BGLML_Messager_tree_advance 1.28 25.88 0.36 memcpy 1.10 26.19 0.31 WriteUnit 0.96 26.46 0.27 write 0.71 26.66 0.20 sinl 0.67 26.85 0.19 _ xlfBeginIO
Integrated Performance Monitoring (IPM) • Integrated Performance Monitoring (IPM) is a tool that allows users to obtain a concise summary of the performance and communication characteristics of their codes. • Recompile your code, linking to the IPM library by adding -L/usr/local/apps/ipm/lib/ -lipm to the link stage. For example: • C: mpcc main.c -L/usr/local/apps/ipm/lib/ -lipm • Fortran: mpxlf90 main.f -L/usr/local/apps/ipm/lib/ -lipm • Run your job using mpirun-ipm
IPM Output • For both BlueGene and DataStar, a report will be produced at the end of your output summarizing the data collected. Additionally, a file will be produced with a filename that contains your username and a number generated by IPM (for example mahidhar.1160615104.920400.0) • In order to generate a Web page showing the analysis of your code, run the ipm_parse command followed by the filename. bg-login1 0512/RUN1> /usr/local/apps/ipm/bin/ipm_parse_sdsc mahidhar.1160615104.920400.0 IPM at SDSC - Webpage creation in progress Please wait - this may take several minutes. 100..200..300..400..500.. IPM: Data processing finished - Creating HTML output - please wait. The web page will be visible at: http://www.sdsc.edu/us/tools/top/ipm/output/bgsn.14860.0 Note the webpage will stay online for 30 days It can be regenerated at any time, or a local copy can be saved using your web browser
Troubleshooting – Common Issues • Running out of memory • Rogue pointers – Blue Gene applications run in the same address space as the Compute Node Kernel and the communications buffers. You can create a pointer that references the area used for communications (Compute Kernel is protected). This could lead to spurious and unpredictable errors in communications (and may even cause the node to hang) • Forcing MPI to allocate too much memory through excessive buffering of messages • Using unsupported system calls (details on next slide)
Unsupported System Calls w/ CNK • The following calls are not supported by the Compute Node Kernel (CNK) • fork() and pthread_create() • System() function • gethostname() and getlogin() • signal(SIGTRAP,xl__trcce) or signal(SIGNAL,xl__trbk) • usleep()
Core files on the Blue Gene • The core files on the Blue Gene are in plain text. A sample is as follows: bg-login1 /gpfs-wan/mahidhar> more core.0 Summary: program.........................../a.out ended with software signal.......0x00000005 (SIGTRAP - trace trap) generated by interrupt...........0x00000006 (program interrupt) while executing instruction at...0x0020074c .. .. Memory: stack top........................0x10000000 stack frame pointer..............0x0fff7ef0 end of heap......................0x00386000 start of program.................0x00200000 brk() failed w/ ENOMEM...........0 time(s) .. .. Function Call Chain: 0x0020074c 0x002001e4 End of stack • The address can be translated using the addr2line command. bg-login1 /gpfs-wan/mahidhar> addr2line a.out 0x0020074c ??:0 /gpfs-wan/mahidhar/sample.f:38
Standard Tuning Procedure • Pick suitable dataset and optimal processor set • Get rough estimate of % of peak FLOPS • 5-15% range is normal • Understand scaling problems by running at different processor count • Run using IPM to check • Communication/Computation ratio • Any anomalies, too many messages, too many collectives • Large differences between profiles of different tasks etc. • Understand load imbalances if any • Ex: task 0 is spending too much time in I/O • Task n has very small communication time compared to others etc.
Task Mapping • The default task layout is XYZT. Hence in the VN mode this can lead to an inefficiencies. You will get two tasks per node only if you have #tasks = 2*#nodes. Otherwise the XYZT layout will leave some nodes with just one task. • Set BGLMPI_MAPPING = TXYZ to ensure that you get two tasks per node when you are in the VN mode and are asking for less than 2*#nodes tasks. Using the TXYZ mapping puts tasks 0 and 1 on the first node, tasks 2 and 3 on the next node and so on, with the nodes in x, y, z torus order. • Can use a mapfile to specify the mapping of tasks to nodes.
Mapfile • Can be used with the –mapfile option of mpirun • The mapfile contains the information for associating torus coordinates to MPI ranks 0 to N-1 • The format of the mapfile is as follows: x0 y0 z0 t0 x1 y1 z1 t1 x2 y2 z2 t2 ... where MPI task 0 is mapped to torus coordinates x0,y0,z0 using processor t0 on that node. • The processor number, t0, is always 0 for co-processor mode, and would be either 0 or 1 for virtual node mode. There is one line in the mapping file for each MPI task, in MPI order.
Cartesian communicator functions • Functions to map nodes to specific hardware or processor set (pset) configurations • PMI_Cart_comm_create() • Creates a four-dimensional communicator that mimics the exact hardware on which it is run. This is a collective operation which runs on all the nodes. • PMI_Pset_same_comm_create() • Creates a set of communicators, where all the nodes in a given communicator are part of the same pset (all share the same I/O node) • Can be used to manage I/O effectively • PMI_Pset_diff_comm_create() • Creates a set of communicators, where no two nodes in the communicator are part of the same pset. • Can be used to manage I/O effectively
References • Blue Gene Web site at SDSC http://www.sdsc.edu/us/resources/bluegene • Blue Gene Application development guide (from IBM redbooks) http://www.redbooks.ibm.com/abstracts/sg247179.html • Exploiting the Dual Floating Point Units in Blue Gene/L, Whitepaper http://www-1.ibm.com/support/docview.wss?uid=swg27007511&aid=1 • Using the XL compilers for Blue Gene http://www-1.ibm.com/support/docview.wss?uid=swg27007895&aid=1