Getting the Most Out of the TeraGrid SGI Altix UV Systems

Getting the Most Out of the TeraGrid SGI Altix UV Systems MahinMahmoodi Raghu Reddy TeraGrid11 Conference July 18, 2011 Salt Lake City

Outline • Blacklightmemory BW and latency w.r.t processor-core mapping • GRU environment variable • Portable performance evaluation tools on Blacklight • Case study: PSC Hybrid Benchmark • PAPI • IPM • SCALASCA • TAU

Blacklight memory BW and latency with respect to processor-core mapping

Blacklight per blade/processor/core Memory Layout BLADE (2 Processors) 128 GB local memory Processor Processor (socket) QPI 8 cores L1, L2 L3 L3 QPI L3: Last Level Cache = 24 MB HUB Node: 1 blade + 1 HUB L1: 64KB per core L2: 256KB per core

BlacklightNode Pair Architecture “node pair” “node” NUMAlink-5 UVHub UVHub QPI QPI Intel Nehalem EX-8 Intel Nehalem EX-8 Intel Nehalem EX-8 Intel Nehalem EX-8 64 GB RAM 64 GB RAM 64 GB RAM 64 GB RAM

HPCC Stream benchmark • Memory bandwidth is the rate that data can be read from or stored into processor memory by processor • Stream measures sustainable main memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernel. • Compute a = b + α c (SAXPY) Where b and c are two vectors of random 64-bit floating-point values for a given scalar value of α. • Problem size The STREAM benchmark is specifically designed to work with datasets much larger than the available cache on any given system, so that the results are more indicative of the performance of very large, vector style applications. • Design purpose It is designed to stress local memory bandwidth. The vectors may be allocated in an aligned manner such that no communication is required to perform the computation.

Blacklight Memory Bandwidthw.r.t Process-core Mapping • HPCC-stream used for memory BW (MB/s)

Effect of –openmpand omplace in Stream Benchmark Bandwidth • -openmpis the compilation flag • omplaceisthe run time command for an OpenMP code to ensure that threads do not migrate across the cores Take home message: If compiled with OpenMP be sure to use omplace Example: mpirun -np 16 omplace–nt 4 ./myhybrid

Modified STREAMs Benchmark The notation is: blk-stride-arraysize g: giga word; unit is in word (8 bytes) blk stride Arraysize Single core modified streams is benchmarked

Remote Memory Access • Modified Stream code is benchmarked • Data is initialized on thread 0 and resides on thread 0 • Data accessed by thread <n> (remote access) • Block=blk, Stride=S, Arraysize=n QPI 0, 1, 2, 3, 4, 5, 6, 7 8, 9, 10,11,12, 13, 14, 15 HUB HUB 24,25,26,27,28,29,30,31 16,17,18,19,20,21,22,23 QPI

HPCC Ping-pong Benchmark • Latency Time required to send an 8-byte message from one process to another • What does ping pong benchmark mean? The ping pong benchmark is executed on two processes. From the client process a message (ping) is sent to the server process and then bounced back to the client (pong). MPI standard blocking send and receive is used. The ping-pong patterns are done in a loop. To achieve the communication time of one message, the total communication time is measured on the client process and divided by twice the loop length. Additional startup latencies are masked out by starting the measurement after one non-measured ping-pong. The benchmark in hpcc uses 8 byte messages and loop length = 8 for benchmarking the communication latency. The benchmark is repeated 5 times and the shortest latency is reported. To measure the communication bandwidth, 2,000,000 byte messages with loop length 1 are repeated twice. • How is ping pong measured on more than 2 processors? The ping-pong benchmark reports the maximum latency and minimum bandwidth for a number of non-simultaneous ping-pong tests. The ping-pongs are performed between as many as possible (there is an upper bound on the time it takes to complete this test) distinct exclusive pairs of processors. Reference: http://icl.cs.utk.edu/hpcc/faq/index.html

BlacklightLatency with Respect to process-core Mapping • HPCC-pingpong used for latency measurement • Ranks send and recv 8-byte msg one at a time

GRU Environment Variable

Global Reference Unit (GRU) Hardware Overview Memory DIMMS Memory DIMMS QPI Nehalem EX 4,6,or 8 cores Nehalem EX 4,6,or 8 cores QPI QPI UV HUB 2 GRU Chiplets per HUB • GRU is a coprocessor that resides in HUB (node controller) of a UV system • GRU provide high BW & low latency socket communication • SGI MPT library uses GRU features for optimizing node communication NUMALINK 5

Run-time tuning with GRU in PSC Hybrid Benchmark • Setting GRU_RESOURCE_FACTOR variable at run-time may improve the communication time. • That is: ‘setenvGRU_RESOURCE_FACTOR <n>’, n=2,4,6,8 • All runs are on 64 cores • (rank, thread): (64, 1), (8, 8), (8,4) Walltime Comm Time

Effect of GRU on HPCC Pingpong BW • HPCC-pingpong is used in two runs • Following environments are set in one of the runs: setenv MPI_GRU_CBS 0setenv GRU_RESOURCE_FACTOR 4 setenv MPI_BUFFER_MAX 2048

Case study: PSC Hybrid Benchmark

A Case Study: PSC Hybrid Benchmark Code (Laplace Solver) • Code uses MPI and OpenMP libraries to parallelize the solution of partial differential equation (PDE) • Tests MPI/OpenMP performance of code on NUMA system • Computation: each process is assigned the task of updating the entries on the part of the array it owns • Communication: Each processor communicates with two neighbors only at block boundaries in order to receive values of neighbor points which are owned by another processor • No collective communication • Communications are simplified by allocating an overlap area at each processor for sorting the values to be received from neighbor

The Laplace Equation • To solve the equation, we want to find T(x,y) in the grid points subject to the following initial boundary conditions: • Initial T at top and left boundaries is 0. • T varies linearly from 0 to 100 along the right and bottom boundaries. • Solution method is Known as: The Point Jacobi Iteration T=0 T=0 T=100 T=100

The Point Jacobi Iteration • In this iterative method, value of each T(i,j) is replaced by the average of four neighbors until the convergence criteria are met. • T(i,j) = 0.25 * [T(i-1,j)+T(i+1,j)+T(i,j-1)+T(i,j+1)] T(i,j+1) T(i-1,j) T(i+1,j) T(i,j) T(i,j-1)

Data DecompositionIn PSC Laplace benchmark • 1D block, row-wise block partition is used • Each processor (PEs): compute Jacobi points in its block and communicate those with neighbor(s) only at block boundaries. PE0 PE1 PE2 PE3

Portable performance evaluation tools on Blacklight

Portable Performance Evaluation Tools on blacklight Goals: • Give an over view of the programming tools suite available on blacklight • Explain the functionality of individual tools • Teach how to use the tools effectively • Capabilities • Basic use • Hybrid profiling analysis • Reducing the profiling overhead • Common environment variables

Available Open Source Performance Evaluation Tools on Blacklight • PAPI • IPM • SCALASCA • TAU • module avail <tool> to view the available versions • module load <tool>bring into the environment eg: module load tau

What is PAPI? • Middleware to provide a consistent programming interface for the hardware performance counter found in most major micro-processors. • Countable hardware events: PRESET: platform neutral events NATIVE : platform dependent events Derived: preset events can be derived from multiple native events. Multiplexed: events can be multiplexed if counters are limited.

PAPI Utilities • Utilities are available in PAPI bin directory. Load the module first to append it to the PATH or use the absolute path to the utility Example: % module load papi % Which papi_avail /usr/local/packages/PAPI/usr/4.1.3/bin/papi_avail • Execute the utilities in compute nodes as mmtimer is not available in login nodes. • Use <utility> -h for more information Example: % papi_cost –h It computes min / max / mean / std. deviation for PAPI start/stop pairs; for PAPI reads, and for PAPI_accums. Usage: cost [options] [parameters] …

PAPI Utilities Cont. • Execute papi_availfor PAPI preset events % papi_avail …… Name Code Avail Deriv Description (Note) PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache misses PAPI_L2_DCM 0x80000002 Yes Yes Level 2 data cache misses • Execute papi_native_availfor available native events %papi_native_avail ……. Event Code Symbol | Long Description | 0x40000005 LAST_LEVEL_CACHE_REFERENCES | This is an alias for LLC_REFERENCE || S • Execute papi_event_chooser to select a compatible set of events that can be counted simultaneously. % papi_event_chooser Usage: papi_event_chooser NATIVE|PRESET evt1 evt2 ... % papi_event_chooser PRESET PAPI_FP_OPS, PAPI_L1_DCM event_chooser.cPASSED

PAPI High-level Interface • Meant for application programmers wanting coarse-grained measurements • Calls the lower level API • Allows only PAPI preset events • Easier to use and less setup (less additional code) than low-level • Supports 8 calls in C or Fortran:

PAPI High-level Example #include "papi.h” #define NUM_EVENTS 2 long_long values[NUM_EVENTS]; unsigned int Events[NUM_EVENTS]={PAPI_TOT_INS,PAPI_TOT_CYC}; /* Start the counters */ PAPI_start_counters((int*)Events,NUM_EVENTS); /* What we are monitoring… */ do_work(); /* Stop counters and store results in values */ retval = PAPI_stop_counters(values,NUM_EVENTS);

Low-level Interface • Increased efficiency and functionality over the high level PAPI interface • Obtain information about the executable, the hardware, and the memory environment • Multiplexing • Callbacks on counter overflow • Profiling • About 60 functions

PAPI Low-level Example #include "papi.h” #define NUM_EVENTS 2 int Events[NUM_EVENTS]={PAPI_FP_INS,PAPI_TOT_CYC}; intEventSet; long_long values[NUM_EVENTS]; /* Initialize the Library */ retval = PAPI_library_init(PAPI_VER_CURRENT); /* Allocate space for the new eventset and do setup */ retval = PAPI_create_eventset(&EventSet); /* Add Flops and total cycles to the eventset */ retval = PAPI_add_events(EventSet,Events,NUM_EVENTS); /* Start the counters */ retval = PAPI_start(EventSet); do_work(); /* What we want to monitor*/ /*Stop counters and store results in values */ retval = PAPI_stop(EventSet,values);

Example: FLOPS with PAPI calls program mflops_example implicit none #include 'fpapi.h' integer :: i double precision :: a, b, c integer, parameter :: n = 1000000 integer (kind=8) :: flpops = 0 integer :: check real (kind=4) :: real_time = 0., proc_time = 0., mflops = 0. a = 1.e-8 b = 2.e-7 c = 3.e-6 call PAPIF_flops(real_time, proc_time, flpops, mflops, check) print *, "first: ", flpops, proc_time, mflops, check do i = 1, n a = a + b * c end do call PAPIF_flops(real_time, proc_time, flpops, mflops, check) print *, "second: ", flpops, proc_time, mflops, check print *, 'sum = ', a end program mflops_example Compilation: % module load papi % ifort -fpp $PAPI_INC -o mflopsmflops_example.f$PAPI_LIB Execution: module load papi ./ a.out Output: flpops, proc_time, mflops, `check first: 0 0.0000000E+00 0.0000000E+00 0 second: 1000009 1.4875773E-03 672.2400 0 sum = 6.100000281642980E-007

IPM: Integrated Performance Monitoring • Lightweight and easy to use • Profiles only MPIcode (not serial, not OpenMP) • Profiles only MPI routines (not computational routines) • Accesses hardware performance counters using PAPI • Lists message size information • Provides communication topology • Reports walltime, comm%, flops, total memory usage, MPI routines load imbalance and time breakdown • IPM-1 and IPM-2 (pre-release) are installed on blacklight • Generates text report and visual data (html-based)

How to Use IPM on backlight: basics Compilation • module load ipm • Link your code to IPM library at compile time eg_1: icctest.c $PAPI_LIB $IPM_LIB -lmpi eg_2: ifort –openmp test.f90 $PAPI_LIB $IPM_LIB -lmpi Execution • Optionally, set the run time environment variables Example: export IPM_REPORT=FULL export IPM_HPM = PAPI_FP_OPS,PAPI_L1_DCM ( a List of comma separated PAPI counters) • % module load ipm • Execute the binary normally (This step generates an xml file for visual data) Profiling report • Text report will be available in the batch output after the execution completes • For html-based report, run ‘ipm_parse –html <xml_file>’. Transfer the generated directory on your Windows workstation. Click on index.html for the visual data

IPM Communication StatisticsPSC Hybrid Benchmark

IPM Profiling, Message Sizes • Message size per MPI call: • In 100% of comm time, 2MB msg is used in MPI_Wait and MPI_Irecv

IPM Profiling: Load Imbalance Information

SCALASCA • Automated profile-based performance analysis • Automatic search for bottlenecks based on properties formalizing expert knowledge • MPI wait states • Processor utilization hardware counters Automatic performance analysis toolset Scalable performance analysis of large-scale applications • Particularly focused on MPI & OpenMP paradigms • Analysis of communication & synchronization overheads • Automatic and manual instrumentation capabilities • Runtime summarization and/or event trace analyses • Automatic search of event traces for pattern of inefficiency

How to Use SCALASCA on backlight: basics • module load scalasca • Run scalascacommand(%scalasca) withoutargument for basic usage info. • ‘scalasca –h’ shows quick reference guide (pdf document) • Instrumentation • Prepend skin(or scalasca –instrument) to compiler/link commands Example: skinicc –openmptest.c –lmpi (hybrid code) • Measurement & analysis • Prepend scan (or scalasca –analyze) to the usual execution command (This step generates epik directory) • Example: omplace –nt 4 scan–t mpirun -np 16 ./exe (optional –t for trace generation) • Report examination • Run square (or scalasca –examine) on the generated epik measurement directory for interactively examining the report (visual data) • Example: squareepik_a.out_32x2_sum or • Run ‘cube3_score –s’ on the epikdirectory for text report

Distribution of Time for Selected call tree by process/thread Metric pane Call tree pane process/thread pane

Distribution of Load imbalance for work_syncroutineby process/thread Color code Profiling of 64 cores, 8 threads per rank job on Blacklight

Global Computational Imbalance(not individual functions)

SCALASCA Metric On-line Description(Right click on metric)

Instruction for ScalascaTextual Report flt type max_tbc time % region ANY 5788698 20951.46 100.00 (summary) ALL MPI 5760322 8876.37 42.37 (summary) MPI OMP 23384 12063.81 57.58 (summary) OMP COM 4896 3.35 0.02 (summary) COM USR 72 1.10 0.01 (summary) USR MPI 2000050 16.38 0.08 MPI_Isend MPI 1920024 7785.68 37.16 MPI_Wait MPI 1840000 1063.18 5.07 MPI_Irecv OMP 8800 56.31 0.27 !$omp parallel @homb.c:754 OMP 4800 8102.48 38.67 !$omp for @homb.c:758 COM 4800 3.26 0.02 work_sync OMP 4800 3620.97 17.28 !$ompibarrier @homb.c:765 OMP 4800 2.41 0.01 !$ompibarrier @homb.c:773 MPI 120 11.03 0.05 MPI_Barrier EPK 48 6.83 0.03 TRACING OMP 44 0.03 0.00 !$omp parallel @homb.c:465 OMP 44 121.81 0.58 !$omp parallel @homb.c:557 MPI 40 0.01 0.00 MPI_Gather MPI 40 0.00 0.00 MPI_Reduce USR 24 0.00 0.00 gtimes_report COM 24 0.00 0.00 timeUpdate MPI 24 0.05 0.00 MPI_Finalize OMP 24 23.46 0.11 !$ompibarrier @homb.c:601 OMP 24 136.24 0.65 !$omp for @homb.c:569 COM 24 0.00 0.00 initializeMatrix USR 24 1.10 0.01 createMatrix … % module load scalasca • Run cube3_score with –r flag on the Cube file generated in the epik directory to see the text report Example: • Regions classification: MPI (pure MPI functions) OMP (pure OpenMP regions) USR (user-level computational routines) COM (combined USR + MPI/OpenMP) ANY/ALL (aggregate of all regions type) % cube3_score -r epik_homb_8x8_sum/epitome.cube

Scalasca Notable Run Time Environment Variables • Set ELG_BUFFER-SIZE to avoid intermediate flushes to disk • Example: setenv ELG_BUFFER-SIZE 10000000 (bytes) • For ELG_BUFFER-SIZE, run the following command on the epikdirectory. • % scalasca -examine –s epik_homb_8x8_sum • ………………… • Estimated aggregate size of event trace (total_tbc): 41694664 bytesEstimated size of largest process trace (max_tbc): 5788698 bytes(Hint: When tracing set ELG_BUFFER_SIZE > max_tbc to avoid intermediate flushes or reduce requirements using file listing names of USR regions to be filtered.) • Set EPK_METRICS to colon seperated list of PAPI counters Example: setenv EPK_METRICS PAPI_TOT_INS:PAPI_FP_OPS:PAPI_L2_TCM • Set EPK_FILTER to the name of filtered routines to reduce the instrumentation and measurement overhead. • Example: setenvEPK_FILTERroutines_filt • %cat routines_filt • sumTrace • gmties_report • statistics • stdoutIO

Time Spent in omp Region Is Selected & Idle Threads Source code Idle threads greyed-out

TAU Parallel Performance Evaluation Toolset • Portable to essentially all computing platforms • Supported programming languages and paradigms: Fortran, C/C++, Java, Python, MPI, OpenMP, hybrid, multithreading • Supported instrumentation methods: • Source code instrumentation, object and binary code, Library wrapping • Levels of instrumentation: • routine, loop, block, IO BW & volume, memory tracking, Cuda, hardware counters, tracing • Data analyzers: ParaProf, PerfExplorer, vampir, jumpshot • Throttling effect of frequently called small subroutines • Automatic and manual instrumentation • Interface to databases (Oracle, mysql, …) ….

How to use TAU on Blacklight: basics Step 0 % module avail tau (shows available tau versions) % module load tau Step 1: Compilation • Choose a TAU Makefile stub based on the kind of profiling you wish. Available Makefile stubs are here • ls $TAU_ROOT_DIR/x86_64/lib/Makefile* • eg: Makefile.tau-icpc-mpi-pdt-openmp-opari for MPI+OpenMP code • Optionally set TAU_OPTIONS to specify compilation specific options • Eg: setenv TAU_OPTION “"-optVerbose -optKeepFiles“for verbose & keeping the instrumented files. • export TAU_OPTIONS=‘-optTauSelectFile=select.tau–optVerbose’ (selective instrumentation) • Use one of TAU wrapper script to compile your code (tau_f90.sh, tau_cc.sh, or tau_cxx.sh). • Eg, tau_cc.sh foo.c (generates instrumented binary) Step 2: Execution • Optionally, set TAU runtime environment variables for generating desired choosing metrics • eg, setenv TAU_CALLPATH 1 (for callgraph generation) • eg, setenv (papi counters) • Run the instrumented binary ,from step 1, normally (profile file will be generated) Step3: Data analysis • Run pprof, where profile files reside, for text profiling • Run paraproffor visual data • Run perfExplorerfor multiple set of profiling • Run Jumpshot or vampir for trace files analysis

Hybrid Code Profiled with TAU Routines time breakdown per node/thread

Hybrid code Profiled with TAU Cont. Routines exclusive time %, on node0 & thread0 Routines exclusive time %, on rank3& thread4

Getting the Most Out of the TeraGrid SGI Altix UV Systems

Getting the Most Out of the TeraGrid SGI Altix UV Systems

Presentation Transcript

Getting the Most Out of GALILEO

Getting the most out of the workshop

Getting the most out of Google

Getting the Most Out of Assessment

Getting the most out of FlyBase

Getting the most out of NSLDS

Getting the Most Out of Supervision

Getting the Most Out of CLEAR

Getting the most out of the Platform

Getting the Most Out of Google

Getting the most out of lectures

Getting the most out of

Getting the Most Out of IIS

Getting The Most Out of Newcastle

Parallel/Concurrent Programming on the SGI Altix

Getting the most out of the media

Getting the most out of … series.

GETTING THE MOST OUT OF THE PREACHING

Getting the most out of the Platform

Getting the most out of ArcMap

Getting the Most out of the PDBe

GETTING THE MOST OUT OF GIVING