150 likes | 302 Views
Performance Modeling and Analysis with PEBIL. Michael Laurenzano, Ananta Tiwari , Laura Carrington Performance Modeling and Characterization ( PMaC ) Laboratory San Diego Supercomputer Center. Outline. Motivation Performance modeling in High Performance Computing (HPC)
E N D
Performance Modeling and Analysis with PEBIL Michael Laurenzano, AnantaTiwari, Laura Carrington Performance Modeling and Characterization (PMaC) Laboratory San Diego Supercomputer Center
Outline • Motivation • Performance modeling in High Performance Computing (HPC) • How does binary instrumentation fit in? • PEBIL = PMaC’sEfficient Binary Instrumentation for Linux/x86 • Binary instrumentation overview • Use case: memory tracing • Use case: function profiling
HPC Target System PMaC HPC Performance Models Performance Model – a calculable expression of the runtime, efficiency, memory use, etc. of an HPC program on some machine HPC Target System HPC Application HPC Application Application signature – detailed summaries of the fundamental operations to be carried out by the application Machine Profile – characterizations of the rates at which a machine can carry out fundamental operations Requirements of HPC Application – Application Signature Characteristics of HPC system – Machine Profile Measured or projected via simple benchmarks on 1-2 nodes of the system Collected via trace tools Performance of Application on Target system Convolution Methods map Application Signatures to Machine Profiles produce performance prediction
Application Signature • Application signature – fundamental operations used by the application • Requires low-level details of application • Details attached to specific structures within the application • Measurement? (e.g. timers or hardware counters) • Measuring at fine grain with reasonable overheads & transparently is HARD Use binary instrumentation
Binary Instrumentation • Instrumentation – inserting extra code into a program, usually to inspect some aspect of behavior • Binary instrumentation – instrumentation of the compiled object/executable void incrementby(int& n, int c){ counter++; // instrumentation code n += c; }
The Case for Binary Instrumentation • Low-level details of application • Program is in its binary form • Compilers transform and optimize • Basic program structure • Memory access • Vectorization • Data dependencies • The executable might be all we have • Easy to tie details to application structures int identity(int n){ int c = 0; while (c < n) c++; return c; } int identity(int n){ return n; }
Runtime Overhead is a Big Deal • PEBIL… the E stands for Efficient • We want to model real HPC applications • Relatively long runtimes: minutes, hours, days? • Lots of CPUS: O(105) in largest supercomputers • High slowdowns create problems • Too long for queue • Unsympathetic administrators/managers • Inconvenience • Unnecessarily use resources Mitigate problems by minimizing runtime overhead
Example Use Cases • Memory address trace collection • Capture all application loads/stores • Use a buffer, batch process them • Very widely used • Performance/energy models (e.g., PMaC) • Cache design • Memory bug detection • For efficiency, this is often used with sampling • Function/loop measurement • Insert calls to measurement routines around functions/loops • TAU uses this feature
PEBIL Design • Efficiency is priority #1 • Designed around a few use cases • Execution counting • Memory tracing • Static binary rewriter • Write instrumented + runnable executable to disk • Keep original behavior intact • Gather information as a side-effect • Instrument once, run many times • No instrumentation cost at runtime • Code patching (not just-in-time compiled!)
How Binary Instrumentation Works (Basic block counting) Original Instrumented 0000c000 <foo>: c000: 48 89 7d f8 mov %rdi,-0x8(%rbp) c004: 5e pop %rsi c005: 75 f8 jne 0xc004 c007: c9 leaveq c008: c3 retq Basic Block 1 Basic Block 2 Basic Block 3 0000d000 <foo>: d000: e9 de ad be efjmp 0x1000 # to instrumentation d005: 48 89 7d f8 mov %rdi,-0x8(%rbp) d000: e9 de ad be efjmp 0x1010 # to instrumentation d00a: 5e pop %rsi d00b: 75 00 00 00 f8 jne 0xd009 d000: e9 de ad be efjmp 0x1020 # to instrumentation d00a: c9 leaveq d00b: c3 retq // do stuff // jump back
Use case: Memory Address Collection • Collect the address of every load/store issued by the application • Put addresses in a buffer, process addresses in batch • Fewer function calls • Less cache pollution for (i = 0; i < n; i++){ A[i] = B[i]; } if (cur + 2 > BUF_SIZE) clear_buf(); buffer[cur + 0] = &(A[i]); buffer[cur + 1] = &(B[i]);
Optimization – Sampling w/ Instrumentation Point Disabling • Processing addresses is usually expensive • Cache simulation (multiple caches), locality analysis, address stream compression • Use interval-based sampling • Process the first X of every Y addresses (Y >= X) • Obvious result: reduced processing overhead • Not so obvious: reduced collection overhead by skipping address collection during sampled regions • Different approaches • PEBIL – swap instrumentation with nops • Very lightweight, limited functionality • PIN / Dyninst – Arbitrarily remove, re-instrument • Heavyweight, rich functionality
Memory Trace Overhead w/ Sampling OpenMP NAS Parallel Benchmarks (8 threads)
Use case: Inserting Profiling Routines • Insert calls to timers/tracking code around functions and loops • Want l low overhead, especially where no instrumentation is introduced • Small overhead = accurate profile • “throttle” instrumentation points that are called too frequently • Don’t just ignore them, disable them! • Collaboration w/ Tuning Analysis and Utilities (TAU) project void compute(){ // function id 0 for (i = 0; i < n; i++){ // loop id 1 A[i] = B[i]; } for (i = 0; i < n; i++){ // loop id 2 A[i] += C[i]; } } profile_begin(0); profile_begin(1); profile_end(1); profile_begin(2); profile_end(2); profile_end(0);
Contact Info download https://github.com/mlaurenzano/PEBIL email michaell@sdsc.edu, lcarring@sdsc.edu