320 likes | 405 Views
pSIGMA A Symbolic Instrumentation Infrastructure to Guide Memory Analysis. Simone Sbaraglia sbaragli@us.ibm.com. AGENDA. Why pSIGMA? pSIGMA Highlights pSIGMA Infrastructure Examples Next steps Questions/Answers. Why pSIGMA?
E N D
pSIGMAA Symbolic Instrumentation Infrastructure to Guide Memory Analysis Simone Sbaraglia sbaragli@us.ibm.com
AGENDA • Why pSIGMA? • pSIGMA Highlights • pSIGMA Infrastructure • Examples • Next steps • Questions/Answers
Why pSIGMA? • Understanding and tuning memory system performance of scientific applications is a critical issue, especially on shared-memory systems • None of the memory analysis tools currently available combines fine-grained information and flexibility (what-if questions and architectural changes) • To implement performance tools, we rely on instrumentation libraries (DPCL, Dyninst). • These libraries are not performance-oriented: • If the probe is invoked on a mem. operation, there is no information about the effect of the mem. op. on the memory subsystem (hit/miss, prefetches, invalidations, c2c transfers, evictions etc) • The description of the rules that should trigger the probe cannot use performance metrics (“invoke the probe on each L1 miss”) • These libraries are not symbolic: • If a probe function is invoked on a memory operation, there is no information about which data structure/function the mem op refers to. • The description of the rules that should trigger the probe cannot use symbolic entities in the source program (“invoke the probe every time the array A is touched”)
pSIGMA Highlights • Symbolic, performance-oriented infrastructure for instrumentation of parallel applications. • Main Features: • Injects user-supplied probes into a parallel application • Operates on the binary, allows compiler optimization • Provides symbolic and memory-performance information to the probes • The rules that should trigger the probe are specified using symbolic names and memory-related performance metrics • Provides information at the granularity of individual instructions • Activate/deactivate the instrumentation during program execution • Dynamically turn on/off generation/handling of certain events based on other events • Supports MPI and (almost) OMP applications • Builtin probes to detect memory bottlenecks on a data-structure basis • Builtin parametric memory simulator • Prefetching API for user-defined prefetching algorithms with builtin IBM algorithms • Simulate data-structure changes (padding)
user event-handler user event-handler pSigma event-handler instrumented program execution memory simulation during execution catalogue of events user script of desired events, handlers and machine configuration pSigma Software application binary library of event- handlers psigmaInst instrumented binary instrumented binary events standardized event description interface activate/ deactivate
Specification of Events: event_description file • Set of directives, each specifying an event and the corresponding action • An event is either a predicate on the current operation (eg. operation is a load, or was an L1 miss) or a predicate on the state of the system (L1 miss count exceeded a constant) • Events can use symbolic names and performance metrics • Events can be combined using logical operators • Events can be qualified by a context, if the event is to be considered only for certain data structures/functions. • The event description also specifies the parameters of the memory system to simulate
Specification of Events: event_description file directive ::= onevent [and/oreventand/or …] doaction event ::= instrEvent [context] | counter [context] relOp counter [context] instrEvent ::= load, store, L1miss, L2hit, malloc, free, f_entry, f_exit,… context ::= for data-structure in function counter ::= L1Misses, L2Misses, Loads, Hits, … relOp ::= greater than, smaller than, equal to Examples: • on Load for A in F do call myHandler • on L1Miss for A in F do call myHandler • on L1Miss for A in F and (L1Misses for A in F > 1000) do call myHandler • A GUI is provided to build the event_description file, instrument and run the application
Standardized event description interface The event handlers have access to the following information: • InstructionData: characterizes the instruction that triggered the event • Instruction that triggered the event • Virtual address of the instruction • FileName, FunctionName, lineNum of the instruction • Opcode (load,store, function call, …) • Address and symbol loaded or stored • Current stack pointer • Data loaded or stored • Address and number of bytes allocated or freed • MemoryData: characterizes the memory impact of the instruction • Hit/miss in each level, evicted lines, prefetched lines, … • CumulativeData: represents the current state of the system • InstructionsExecuted, Loads, Stores, Hits, Misses, PrefetchedLines, PrefetchedUsedLines, …
Example: parameter (nx=512, ny=512) real a1(nx,ny), a2(nx,ny), a3(nx,ny) do id = 1,nd do jd=1,nd do iy=1,ny do ix=1,nx ixy = ix + (iy-1)*nx a1(ix,iy) = a1(ix,iy) + c11(ixy, id, jd)*b1(ixy, jd) + c12(ixy, id, jd)*b2(ixy, jd) + … a2(ix,iy) = a2(ix,iy) + c11(ixy, id, jd)*b1(ixy, jd) + c12(ixy, id, jd)*b2(ixy, jd) + … a3(ix,iy) = a3(ix,iy) + c11(ixy, id, jd)*b1(ixy, jd) + c12(ixy, id, jd)*b2(ixy, jd) + … end do end do end do end do Event Description File: on load for a1 do call myHandler on store for a1 do call myHandler on load for a2 do call myHandler on store for a2 do call myHandler on load for a3 do call myHandler on store for a3 do call myHandler nCaches 2 TLBEntries 256 TLBAssoc 2-way TLBRepl LRU …
Event Handler: • myHandler is invoked when a1, a2, a3 are touched. It computes the TLB Hit Ratio for a1, a2, a3 void myHandler(unsigned int *p) { resultPtr cacheresult; char *symbName; /* get a pointer to the cache result structure */ cacheresult = getCacheResult(0); /* symbol name */ symbName = getSymbolName(p) if (!strcmp(symbName, "a1@array1")) { Accesses[0]++; Hits[0] += cacheresult->tlbResult[0]; } else if (!strcmp(symbName, "a2@array1")) { Accesses[1]++; Hits[1] += cacheresult->tlbResult[0]; } else if (!strcmp(symbName, "a3@array1")) { Accesses[2]++; Hits[2] += cacheresult->tlbResult[0]; } }
Memory Profile for a1, a2, a3 L1 L2 TLB Data: a1 Load Hit Ratio: 96% 99% 7% Store Hit Ratio: 99% 16% - Data: a2 Load Hit Ratio: 97% 50% 7% Store Hit Ratio: 99% 16% - Data: a3 Load Hit Ratio: 96% 59% 0.1% Store Hit Ratio: 99% 17% - Restructured Kernel: parameter (nx=512, ny=512) real a1(nx+1,ny), a2(nx+1,ny), a3(nx+1,ny) … New Memory Profile for a1, a2, a3 L1 L2 TLB Data: a1 Load Hit Ratio: 96% 99% 99% Store Hit Ratio: 99% 16% - Data: a2 Load Hit Ratio: 97% 50% 98% Store Hit Ratio: 99% 16% - Data: a3 Load Hit Ratio: 96% 59% 96% Store Hit Ratio: 99% 17% -
Builtin event handlers: Memory Profile • Execute functional cache simulation and provide memory profile • Power3/Power4 architectures prefetching implemented • Write-Back/Write-Through caches, replacement policies etc archFile specifies: • Number of cache levels • Cache size, line size, associativity, repl. policy, write policy, prefetching algorithm • Coherency level • Multiple machines can be simulated at once
architecture parameters source files trace files Repository binary Address Map Simulator sigmaInst Sigma Inst binary SiGMA Memory Analysis Flow Query Output Memory Profile
Memory Profile: Provide counters such as hits, misses, cold misses for • each cache level • each function • each data structure • each data structure within each function Output sorted by the SIGMA memtime: • SUM( LoadHits(i)*LoadLat(i) + StoreHits(i)*StoreLat(i) ) + • #TLBmisses * Lat(TLBmiss) • memtime should track wall time for memory bound applications
Memory Profile Output L1 L2 L3 TLB MEM FUNCTION: calc1 (memtime = 0.0050) Load Acc/Miss/Cold 522819/2252/1 2252/345/0 345/0/0 784419/126/0 0/-/- Load Acc/Miss Ratio 232 6 - 6225 - Load Hit Ratio 99.57% 84.68% 100.00% 99.98% - Est. Load Latency 0.0008 sec 0.0000 sec 0.0000 sec 0.0001 sec 0.0000 sec Load Traffic - 238.38 Kb 43.12 Kb - 0.00 Kb ……… FUNCTION: calc2 (memtime = 0.0042) Load Acc/Miss/Cold 622230/2631/0 2631/1661/0 1661/0/0 814269/94/0 0/-/- Load Acc/Miss Ratio 236 1 - 8662 - Load Hit Ratio 99.58% 36.87% 100.00% 99.99% - Est. Load Latency 0.0010 sec 0.0000 sec 0.0001 sec 0.0001 sec 0.0000 sec Load Traffic - 121.25 Kb 207.62 Kb - 0.00 Kb ……… L1 L2 L3 TLB MEM DATA: u (memtime = 0.0012) Load Acc/Miss/Cold 167710/708/0 708/317/0 317/0/0 216097/31/0 0/-/- Load Acc/Miss Ratio 236 2 - 6970 - Load Hit Ratio 99.58% 55.23% 100.00% 99.99% - Est. Load Latency 0.0003 sec 0.0000 sec 0.0000 sec 0.0000 sec 0.0000 sec Load Traffic - 48.88 Kb 39.62 Kb - 0.00 Kb ………. DATA: v (memtime = 0.0012) Load Acc/Miss/Cold 167710/721/0 721/316/0 316/0/0 216097/31/0 0/-/- Load Acc/Miss Ratio 232 2 - 6970 - Load Hit Ratio 99.57% 56.17% 100.00% 99.99% - Est. Load Latency 0.0003 sec 0.0000 sec 0.0000 sec 0.0000 sec 0.0000 sec Load Traffic - 50.62 Kb 39.50 Kb - 0.00 Kb ……….
Metrics Data Structure Control Structure SIGMA Repository TLB Misses, Cache Misses, Hit Ratio. … Select from list of data Structure in the .addr file Output from any subspace in the 3D space File, Function, Code segment
Visualiz: Repository and Query Language • Build tables and lists • ASCII and bar-chart output • Support arithmetic operators (+ - * / ./) • Compute derived Metrics
Partial Instrumentation • Statically select functions to instrument: • sigmaInst –d –dfunc f1,f2,…,fn appbin • Dynamically select code sections to instrument: • #include "signal_sigma.h“ • for (lp=1; lp<NIT; lp++) { • /* start sigma */ • if ((lp == 2) || (lp==7)) { • signal_sigma(TRACE_ON, lp); • } • for ( i=0; i<n; i++ ) { • for ( k=0; k< n; k++ ) { • u[n * i + k] = 0.0; • f[n * i + k] = rhs; • } • } • /* stop sigma */ • if ((lp == 2) || (lp==7)) { • signal_sigma(TRACE_OFF, lp); • } • }
Unless the entire application is instrumented, there will besomeinaccuracyin the results • Use dynamic selection atphase or loop boundariesor when the cache can be assumed ‘cold’ • Usesigma signalsto reset the cache • Ongoing research to findoptimal sampling techniques • Ongoing research onautomatic sampling
Bultin Event Handlers: Trace Generation • Generate a compressed memory trace • The cache simulation can run on the compressed trace • Control-Flow based Trace Compression (Patented): • Compress the addresses produced by each instruction separately • Capture strides, repetitions, nested patterns • Compression is performed online • Compress all trace events, not only addresses
Bultin Event Handler: detecting False Sharing (work in progress): Target: memory performance of shared-memory apps (infinite cache) • Detect cache misses due to invalidation • Detect false-sharing misses • Measure c2c transfers • Provide suggestions for code rearrangement to minimize false-sharing misses and/or minimize cache invalidations and/or maximize c2c transfers • Remapping of data structures • Remapping of computation • Changing scheduling policy
pSigma and the New HPC Toolkit (4Q2005) • Look at all aspects of performance (communication, processor, memory, sharing of data) from within a single interface • Operate on thebinary– no source code modification • Provide information in terms ofsymbolic names • Centralized GUIfor instrumentation and analysis • Dynamic instrumentationcapabilities • Graphicscapabilities (bar charts, plots etc) • Simultaneous instrumentation for all tools (one run!) • Querycapabilities: compute derived metrics and plot them • Selective instrumentation of MPI functions
pSigma in the new HPC toolkit Binary Application PeekPerf GUI Memory Profiler Communication Profiler Visualization Query Analysis CPU Profiler Shared-Memory Profiler I/O Profiler pSigma execution Binary instrumentation Instrumented Binary
Next Steps • Linux porting • Complete the support for shared-memory apps • More handlers (sharing of data, data flow, etc) • Support 64-bit applications • Support C++ codes