390 likes | 654 Views
HPCPI/Xtools Performance Analysis Toolset. David LaFrance-Linden High Performance Computing Division. Overview. HPCPI Statistical sampling profiler From DCPI (Digital Continuous Profiling Infrastructure) Compare (vaguely) to: OProfile (open source): conceptually based on DCPI
E N D
HPCPI/Xtools Performance Analysis Toolset David LaFrance-Linden High Performance Computing Division
Overview • HPCPI • Statistical sampling profiler • From DCPI (Digital Continuous Profiling Infrastructure) • Compare (vaguely) to: • OProfile (open source): conceptually based on DCPI • Caliper (from HP for Itanium): has many other modes/features • Vtune (from Intel): with GUI • CodeAnalyst (from AMD): with GUI • Xtools • Performance visualization tools • xclus: cluster-wide visualization tool • xperf: node-specific visualization tool HPCPI/Xtools Performance Analysis Toolset
HPCPI – Standard sampling • Set default database location % setenv HPCPIDB ~/hpcpidb • Start daemon: % hpcpid Using info for 'AMD64 (family 0Fh)' PMU 1 tags, user definition: pretty formal interval duty randomize ------ ---------------- -------- ------ --------- Cycles CPU_CLK_UNHALTED 60000 always no maintainVCT = false 1 groups; user definition: # 0 1 2 3 1 CPU_CLK_UNHALTED <empty> <empty> <empty> ---- multiplexing interval = 1000000 ---- Logging to /tmp/david_ll/hpcpid-hpc6.log Daemon is running on pid 6310 HPCPI/Xtools Performance Analysis Toolset
HPCPI – Standard sampling • Run programs: % time ./mb_pi.O.exe -iters 100 3.1415926535897932384626433832795028 3.995u 0.000s 0:04.00 99.7% 0+0k 0+0io 0pf+0w % time ./mb_pi.g.exe -iters 100 3.1415926535897932384626433832795028 8.439u 0.000s 0:08.44 99.8% 0+0k 0+0io 0pf+0w • Flush database to disk % hpcpictl flush hpcpictl flush successful • Analyze % hpcpiprof % hpcpiprof ./mb_pi.g.exe ./mb_pi.O.exe % hpcpilist mandel_val ./mb_pi.g.exe HPCPI/Xtools Performance Analysis Toolset
hpcpiprof (by image) % hpcpiprof Event Name Events Period Samples ---------------- ------------ ------ ------- CPU_CLK_UNHALTED 163217580000 60000 2720293 CPU_CLK_ UNHALTED % cum% image -------- ----- ------ ---------------------------------- 136108e6 83.4% 83.4% vmlinux-2.6.9-34.7hp.XCsmp.o.hpcpi 18077e6 11.1% 94.5% mb_pi.g.exe 8618e6 5.3% 99.7% mb_pi.O.exe 345e6 0.2% 100.0% emacs 19e6 0.0% 100.0% libc-2.3.4.so 7e6 0.0% 100.0% Xorg 6e6 0.0% 100.0% tg3.ko 5e6 0.0% 100.0% libgobject-2.0.so.0.400.7 4e6 0.0% 100.0% libX11.so.6.2 4e6 0.0% 100.0% libgdk-x11-2.0.so.0.400.13 4e6 0.0% 100.0% libglib-2.0.so.0.400.7 2e6 0.0% 100.0% hald 2e6 0.0% 100.0% ld-2.3.4.so 2e6 0.0% 100.0% ohci_hcd.ko ... HPCPI/Xtools Performance Analysis Toolset
hpcpiprof (by procedure) % hpcpiprof ./mb_pi.O.exe ./mb_pi.g.exe Event Name Events Period Samples ---------------- ----------- ------ ------- CPU_CLK_UNHALTED 26694720000 60000 444912 CPU_CLK_ UNHALTED % cum% procedure image -------- ----- ------ --------------- ----------- 175927e5 65.9% 65.9% mandel_val mb_pi.g.exe 84917e5 31.8% 97.7% mandel_val mb_pi.O.exe 4840e5 1.8% 99.5% mb_fill_in_data mb_pi.g.exe 1264e5 0.5% 100.0% mb_fill_in_data mb_pi.O.exe HPCPI/Xtools Performance Analysis Toolset
hpcpilist (by source/assembly) % hpcpilist mandel_val mb_pi.g.exe Event Name Events Period ---------------- ----------- ------ CPU_CLK_UNHALTED 17592720000 60000 CPU_CLK_UNHALTED Source ---------------- ----------------------------------------------------- 34560e03 159: { 160: register NUMTYPE zr = cr, zi = ci, zr2, zi2; 161: register int n = 0, delta = nmax - nmin; 302220e03 162: register NUMTYPE rad2, four = CONSTANT4(4.0); 163: register int keepgoing; 365100e03 164: while (((n += 1), 165: (zr2 = MULT(zr,zr)), 166: (zi2 = MULT(zi,zi)), 167: (zi = 2*MULT(zr,zi) + ci), 168: (keepgoing = (n < nmax)), 169: (rad2 = zr2 + zi2), 170: (zr = zr2 - zi2 + cr), 171: (rad2 <= four)) 172: && keepgoing) 173: ; 16696e06 174: if (cause_segv && cr > CONSTANT(0.0) ...) { 175: if (1) cause_segv = 0; 176: *(char*)(long)cause_segv = 1; 177: } 103620e03 178: return (n >= nmax ? delta : 179: n < delta ? n : 180: n == delta ? 0 : 181: n%delta); 182: } HPCPI/Xtools Performance Analysis Toolset
HPCPI’s differentiators • Sample rate • Overhead • Features • Can sample more than 1 event • Can sample arbitrary number of events • The ‘label’ feature • Attention to accuracy HPCPI/Xtools Performance Analysis Toolset
Sample rate and overhead • Default sample rate higher (interval lower), minumum sample rate high in comparison: • Low overhead: (Itanium; can’t on x86_64) HPCPI/Xtools Performance Analysis Toolset
Feature: Can sample more than one event • Useful for deriving metrics at image, routine or loop level • So can OProfile and Vtune and CodeAnalyst, but not yet Caliper • IPC • CPU_CLK_UNHALTED • RETIRED_INSTRS HPCPI/Xtools Performance Analysis Toolset
Example: IPC for ‘mb_pi’ • Restart: % hpcpictl quit hpcpictl quit successful % hpcpid -events IPCEvents Using info for 'AMD64 (family 0Fh)' PMU 2 tags, user definition: pretty formal interval duty randomize ------- ---------------- -------- ------ --------- Cycles CPU_CLK_UNHALTED 60000 always no Retired RETIRED_INSTRS 60000 1 no maintainVCT = false 1 groups; user definition: # 0 1 2 3 1 CPU_CLK_UNHALTED RETIRED_INSTRS <empty> <empty> ---- multiplexing interval = 1000000 ---- Logging to /tmp/david_ll/hpcpid-hpc6.log Daemon is running on pid 6365 HPCPI/Xtools Performance Analysis Toolset
Example: IPC for ‘mb_pi’ • Collect and report: %./mb_pi.O.exe -iters 100 3.1415926535897932384626433832795028 %./mb_pi.g.exe -iters 100 3.1415926535897932384626433832795028 % hpcpictl flush hpcpictl flush successful % hpcpiprof mb_pi.g.exe mb_pi.O.exe Event Name Events Period Samples ---------------- ----------- ------ ------- CPU_CLK_UNHALTED 27032280000 60000 450538 RETIRED_INSTRS 25146600000 60000 419110 CPU_CLK_ RETIRED_ UNHALTED % cum% INSTRS procedure image -------- ----- ------ -------- --------------- ----------- 177810e5 65.8% 65.8% 145161e5 mandel_val mb_pi.g.exe 86298e5 31.9% 97.7% 100236e5 mandel_val mb_pi.O.exe 5000e5 1.8% 99.6% 4334e5 mb_fill_in_data mb_pi.g.exe 1213e5 0.4% 100.0% 1735e5 mb_fill_in_data mb_pi.O.exe 1e5 0.0% 100.0% 0 main mb_pi.g.exe 1e5 0.0% 100.0% 0 main mb_pi.O.exe % tclsh % expr {145161e5 / 177810e5} 0.816382655644 % expr {100236e5 / 86298e5} 1.16151011611 HPCPI/Xtools Performance Analysis Toolset
Multiplex arbitrary events • DCacheEvents • CPU_CLK_UNHALTED • RETIRED_INSTRS • DISPATCH_STALLS • DATA_CACHE_ACCESSES • DATA_CACHE_MISSES • DATA_CACHE_REFILLS_FROM_L2.ALL • DATA_CACHE_REFILLS_FROM_SYSTEM.ALL • DATA_CACHE_LINES_EVICTED.ALL • L1DTLB_MISS_L2DTLB_HIT • L1DTLB_AND_L2DTLB_MISS • L2_REQUESTS.ALL • L2_MISSES.ALL • L2_FILLS.ALL • Why not just do them all? • Or more! • Unique to HPCPI HPCPI/Xtools Performance Analysis Toolset
DCacheEvents event set on dear_rate • Setup: % hpcpid -events DCacheEvents Using info for 'AMD64 (family 0Fh)' PMU 13 tags, user definition: pretty formal interval duty randomize --------- ---------------------------------- -------- ------ --------- Cycles CPU_CLK_UNHALTED 60000 always no Retired RETIRED_INSTRS 60000 1 no StallsAll DISPATCH_STALLS 60000 1 no DATA_CACHE_ACCESSES 12000 1 no DATA_CACHE_MISSES 12000 1 no DATA_CACHE_REFILLS_FROM_L2.ALL 12000 1 no DATA_CACHE_REFILLS_FROM_SYSTEM.ALL 12000 1 no DATA_CACHE_LINES_EVICTED.ALL 12000 1 no L1DTLB_MISS_L2DTLB_HIT 12000 1 no L1DTLB_AND_L2DTLB_MISS 12000 1 no L2_REQUESTS.ALL 12000 1 no L2_MISSES.ALL 12000 1 no L2_FILLS.ALL 12000 1 no maintainVCT = false 4 groups; user definition: # 0 1 2 3 1 CPU_CLK_UNHALTED RETIRED_INSTRS DISPATCH_STALLS DATA_CACHE_ACCESSES 2 CPU_CLK_UNHALTED DATA_CACHE_MISSES DATA_CACHE_REFILLS_FROM_L2.ALL DATA_CACHE_REFILLS_FROM_SYSTEM.ALL 3 CPU_CLK_UNHALTED DATA_CACHE_LINES_EVICTED.ALL L1DTLB_MISS_L2DTLB_HIT L1DTLB_AND_L2DTLB_MISS 4 CPU_CLK_UNHALTED L2_REQUESTS.ALL L2_MISSES.ALL L2_FILLS.ALL ---- multiplexing interval = 1000000 ---- Logging to /tmp/david_ll/hpcpid-hpc6.log Daemon is running on pid 6877 HPCPI/Xtools Performance Analysis Toolset
DCacheEvents event set on dear_rate • Run: % (limit cpu 20sec; ./dear_rate.x86_64-Linux.exe 8 4 20 91) 8 reads in blocks of 4, increment 91 20000 iters in 19366069 cycles = 968.30 cycles/iter, 145.36MB/sec 40000 iters in 37588341 cycles = 939.71 cycles/iter, 149.80MB/sec 160000 iters in 144011554 cycles = 900.07 cycles/iter, 156.42MB/sec 1111987 iters in 999881564 cycles = 899.18 cycles/iter, 156.58MB/sec 1111987 iters in 1000096567 cycles = 899.38 cycles/iter, 156.54MB/sec 1111987 iters in 999305772 cycles = 898.67 cycles/iter, 156.67MB/sec Cputime limit exceeded • Flush: % hpcpictl flush hpcpictl flush successful HPCPI/Xtools Performance Analysis Toolset
DCacheEvents event set on dear_rate • Observe: % hpcpiprof ./dear_rate.x86_64-Linux.exe Event Name Events Period Samples Active Fraction ---------------------------------- ----------- ------ ------- --------------- CPU_CLK_UNHALTED 42096660000 60000 701611 100.00% RETIRED_INSTRS 2004960000 60000 8354 25.00% DISPATCH_STALLS 41377200000 60000 172405 25.00% DATA_CACHE_ACCESSES 729216000 12000 15192 25.00% DATA_CACHE_MISSES 395664000 12000 8243 25.00% DATA_CACHE_REFILLS_FROM_L2.ALL 394944000 12000 8228 25.00% DATA_CACHE_REFILLS_FROM_SYSTEM.ALL 389664000 12000 8118 25.00% DATA_CACHE_LINES_EVICTED.ALL 784080000 12000 16335 25.00% L1DTLB_MISS_L2DTLB_HIT 4560000 12000 95 25.00% L1DTLB_AND_L2DTLB_MISS 71712000 12000 1494 25.00% L2_REQUESTS.ALL 490704000 12000 10223 25.00% L2_MISSES.ALL 390336000 12000 8132 25.00% L2_FILLS.ALL 395904000 12000 8248 25.00% DATA_CACHE_ DATA_CACHE_ DATA_CACHE_ REFILLS_ REFILLS_ LINES_ L1DTLB_ CPU_CLK_ RETIRED_ DISPATCH_ DATA_CACHE_ DATA_CACHE_ FROM_L2 FROM_SYSTEM EVICTED MISS_ L1DTLB_AND_ L2_REQUESTS L2_MISSES L2_FILLS UNHALTED % cum% INSTRS STALLS ACCESSES MISSES .ALL .ALL .ALL L2DTLB_HIT L2DTLB_MISS .ALL .ALL .ALL procedure image -------- ------ ------ -------- --------- ----------- ----------- ----------- ----------- ----------- ---------- ----------- ----------- --------- -------- ------------- -------------------------- 420965e5 100.0% 100.0% 20050e5 413772e5 7292e5 3957e5 3949e5 3897e5 7841e5 46e5 717e5 4907e5 3903e5 3959e5 read_memory8d dear_rate.x86_64-Linux.exe 1e5 0.0% 100.0% 0 0 0e5 0 0 0 0 0 0 0 0 0 main dear_rate.x86_64-Linux.exe HPCPI/Xtools Performance Analysis Toolset
The ‘label’ feature • Partitions samples, usually based on process(es) • See the man page for hpcpilabel • DCPI classic label: % hpcpictl label run1 a.out one 1 uno % hpcpictl label run2 a.out two 2 dos • Restrict to a script and its children: % hpcpictl label specs –pgid this runSpec • Snapshot a system-wide interval: % hpcpictl label oneMinute –pid -1 –not sleep 60 • “Attach” to a process % hpcpictl label attached –pid desiredPID sleep 99999 • Monitor the idle process on CPU 0 for 5 minutes: % hpcpictl label pid0cpu0 –pid 0 –cpu 0 –and sleep 300 • Can be initiated and managed by programs • Use popen() of hpcpictl with ‘–pgid this’ or ‘-pidparent’ • Don’t forget to hpcpictl flush • Use ‘-label labelName’ with the analysis tools HPCPI/Xtools Performance Analysis Toolset
Attention to accuracy (Itanium) • Wrote micro-benchmarks with known behavior • Eliminated post-unfreeze-pre-RFI event leaks • Micro-benchmark has no NOPS nor any predicate-squashed instructions • Determined event-based multiplexing better than time-based • Micro-benchmark has known (high) IPC HPCPI/Xtools Performance Analysis Toolset
Xtools • Pair of visualization tools • Separable and cooperative with HPCPI • xclus • Cluster-wide monitoring • Utilizations: CPU, DRAM, HyperTransports • xperf • Single-node monitoring • Graphs of derived events based on hardware counters • CPU utilization, IPC, cycle accounting, cache penalties, I/O activity, etc HPCPI/Xtools Performance Analysis Toolset
D R A M CPU/Core CPU/Core D R A M CPU/Core CPU/Core 1/11 10/11 0 1 2 N-3 N-2 N-1 … MPI_Finalize() Basic structure of a system For icon-design of xclus: • Processors with CPUs/cores • Local DRAM • HyperTransports mpi_two_way 1/11: • Outer processes exchange data • Rank 0 sends 1/11 to rank n-1; receives 10/11 • Rank n-1 sends 10/11 to rank 0; receives 1/11 • 1:10 ratio; easy to see HPCPI/Xtools Performance Analysis Toolset
xclus, cluster running mpi_two_way; default: Utilization HPCPI/Xtools Performance Analysis Toolset
xclus, cluster running mpi_two_way; show bandwidth (control and data) HPCPI/Xtools Performance Analysis Toolset
xclus, cluster running mpi_two_way; show bandwidth, data only HPCPI/Xtools Performance Analysis Toolset
xclus, cluster running mpi_two_way; wave-over pop-up HPCPI/Xtools Performance Analysis Toolset
1/11 1/11 1/11 1/11 10/11 10/11 10/11 10/11 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 8 8 8 8 9 9 9 9 A A A A B B B B MPI_Finalize() MPI_Finalize() MPI_Finalize() MPI_Finalize() xclus node grouping Run mpi_two_way on 12 processes (3 nodes), 4 times • bsub –n 12 [ xterm –e ] mpirun –srun mpi_two_way 1/11 • bsub –n 12 [ xterm –e ] mpirun –srun mpi_two_way 1/11 • bsub –n 12 [ xterm –e ] mpirun –srun mpi_two_way 1/11 • bsub –n 12 [ xterm –e ] mpirun –srun mpi_two_way 1/11 Without node grouping: • xclus [ –no-group-nodes ] Force node grouping: • xclus –group-nodes HPCPI/Xtools Performance Analysis Toolset
xclus, (4x3)x4 mpi_two_way, not grouped HPCPI/Xtools Performance Analysis Toolset
xclus, (4x3)x4 mpi_two_way, grouped HPCPI/Xtools Performance Analysis Toolset
xclus, (4x3)x4 mpi_two_way, grouped, node waveover HPCPI/Xtools Performance Analysis Toolset
xclus, (4x3)x4 mpi_two_way, grouped, HyperTransport waveover HPCPI/Xtools Performance Analysis Toolset
xperf: node-specific time-graphs of counter-based metrics • xperf can be started by itself • xperf –nodenodename • Or by clicking a node in xclus • Demonstration program: memr_rate • Run in such a way that • On CPU 0: hits in L1 cache, gets 12+ GB/sec • On CPU 1: misses L1, hits L2, gets 2 GB/sec • On CPU 2: starts missing L2, gets 700 MB/sec • On CPU 3: misses L2, gets 400 MB/sec HPCPI/Xtools Performance Analysis Toolset
xperf, initially HPCPI/Xtools Performance Analysis Toolset
xperf, hide several graphs HPCPI/Xtools Performance Analysis Toolset
xperf, with tear-off color keys HPCPI/Xtools Performance Analysis Toolset
xperf: HPCPI’s initial presentation HPCPI/Xtools Performance Analysis Toolset
xperf/HPCPI, top image HPCPI/Xtools Performance Analysis Toolset
xperf/HPCPI, top procedure HPCPI/Xtools Performance Analysis Toolset
xperf/HPCPI, top procedure, scrolled HPCPI/Xtools Performance Analysis Toolset
Recap • HPCPI • Sampling profiler • High frequency • Low overhead (Itanium) • Arbitrary events (auto-placement, auto-multiplexing) • Attention to accuracy • ‘label’ feature • Xtools • xclus: cluster-wide utilization visualizer • xperf: node-specific time-graphs of counter-based metrics • Integrated with HPCPI HPCPI/Xtools Performance Analysis Toolset
[End] • Questions? • Discussion? • Break • Next: HPCPI documentation review HPCPI/Xtools Performance Analysis Toolset