1 / 32

pSIGMA A Symbolic Instrumentation Infrastructure to Guide Memory Analysis

pSIGMA A Symbolic Instrumentation Infrastructure to Guide Memory Analysis. Simone Sbaraglia sbaragli@us.ibm.com. AGENDA. Why pSIGMA? pSIGMA Highlights pSIGMA Infrastructure Examples Next steps Questions/Answers. Why pSIGMA?

Download Presentation

pSIGMA A Symbolic Instrumentation Infrastructure to Guide Memory Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. pSIGMAA Symbolic Instrumentation Infrastructure to Guide Memory Analysis Simone Sbaraglia sbaragli@us.ibm.com

  2. AGENDA • Why pSIGMA? • pSIGMA Highlights • pSIGMA Infrastructure • Examples • Next steps • Questions/Answers

  3. Why pSIGMA? • Understanding and tuning memory system performance of scientific applications is a critical issue, especially on shared-memory systems • None of the memory analysis tools currently available combines fine-grained information and flexibility (what-if questions and architectural changes) • To implement performance tools, we rely on instrumentation libraries (DPCL, Dyninst). • These libraries are not performance-oriented: • If the probe is invoked on a mem. operation, there is no information about the effect of the mem. op. on the memory subsystem (hit/miss, prefetches, invalidations, c2c transfers, evictions etc) • The description of the rules that should trigger the probe cannot use performance metrics (“invoke the probe on each L1 miss”) • These libraries are not symbolic: • If a probe function is invoked on a memory operation, there is no information about which data structure/function the mem op refers to. • The description of the rules that should trigger the probe cannot use symbolic entities in the source program (“invoke the probe every time the array A is touched”)

  4. pSIGMA Highlights • Symbolic, performance-oriented infrastructure for instrumentation of parallel applications. • Main Features: • Injects user-supplied probes into a parallel application • Operates on the binary, allows compiler optimization • Provides symbolic and memory-performance information to the probes • The rules that should trigger the probe are specified using symbolic names and memory-related performance metrics • Provides information at the granularity of individual instructions • Activate/deactivate the instrumentation during program execution • Dynamically turn on/off generation/handling of certain events based on other events • Supports MPI and (almost) OMP applications • Builtin probes to detect memory bottlenecks on a data-structure basis • Builtin parametric memory simulator • Prefetching API for user-defined prefetching algorithms with builtin IBM algorithms • Simulate data-structure changes (padding)

  5. user event-handler user event-handler pSigma event-handler instrumented program execution memory simulation during execution catalogue of events user script of desired events, handlers and machine configuration pSigma Software application binary library of event- handlers psigmaInst instrumented binary instrumented binary events standardized event description interface activate/ deactivate

  6. Specification of Events: event_description file • Set of directives, each specifying an event and the corresponding action • An event is either a predicate on the current operation (eg. operation is a load, or was an L1 miss) or a predicate on the state of the system (L1 miss count exceeded a constant) • Events can use symbolic names and performance metrics • Events can be combined using logical operators • Events can be qualified by a context, if the event is to be considered only for certain data structures/functions. • The event description also specifies the parameters of the memory system to simulate

  7. Specification of Events: event_description file directive ::= onevent [and/oreventand/or …] doaction event ::= instrEvent [context] | counter [context] relOp counter [context] instrEvent ::= load, store, L1miss, L2hit, malloc, free, f_entry, f_exit,… context ::= for data-structure in function counter ::= L1Misses, L2Misses, Loads, Hits, … relOp ::= greater than, smaller than, equal to Examples: • on Load for A in F do call myHandler • on L1Miss for A in F do call myHandler • on L1Miss for A in F and (L1Misses for A in F > 1000) do call myHandler • A GUI is provided to build the event_description file, instrument and run the application

  8. Standardized event description interface The event handlers have access to the following information: • InstructionData: characterizes the instruction that triggered the event • Instruction that triggered the event • Virtual address of the instruction • FileName, FunctionName, lineNum of the instruction • Opcode (load,store, function call, …) • Address and symbol loaded or stored • Current stack pointer • Data loaded or stored • Address and number of bytes allocated or freed • MemoryData: characterizes the memory impact of the instruction • Hit/miss in each level, evicted lines, prefetched lines, … • CumulativeData: represents the current state of the system • InstructionsExecuted, Loads, Stores, Hits, Misses, PrefetchedLines, PrefetchedUsedLines, …

  9. Example: parameter (nx=512, ny=512) real a1(nx,ny), a2(nx,ny), a3(nx,ny) do id = 1,nd do jd=1,nd do iy=1,ny do ix=1,nx ixy = ix + (iy-1)*nx a1(ix,iy) = a1(ix,iy) + c11(ixy, id, jd)*b1(ixy, jd) + c12(ixy, id, jd)*b2(ixy, jd) + … a2(ix,iy) = a2(ix,iy) + c11(ixy, id, jd)*b1(ixy, jd) + c12(ixy, id, jd)*b2(ixy, jd) + … a3(ix,iy) = a3(ix,iy) + c11(ixy, id, jd)*b1(ixy, jd) + c12(ixy, id, jd)*b2(ixy, jd) + … end do end do end do end do Event Description File: on load for a1 do call myHandler on store for a1 do call myHandler on load for a2 do call myHandler on store for a2 do call myHandler on load for a3 do call myHandler on store for a3 do call myHandler nCaches 2 TLBEntries 256 TLBAssoc 2-way TLBRepl LRU …

  10. Event Handler: • myHandler is invoked when a1, a2, a3 are touched. It computes the TLB Hit Ratio for a1, a2, a3 void myHandler(unsigned int *p) { resultPtr cacheresult; char *symbName; /* get a pointer to the cache result structure */ cacheresult = getCacheResult(0); /* symbol name */ symbName = getSymbolName(p) if (!strcmp(symbName, "a1@array1")) { Accesses[0]++; Hits[0] += cacheresult->tlbResult[0]; } else if (!strcmp(symbName, "a2@array1")) { Accesses[1]++; Hits[1] += cacheresult->tlbResult[0]; } else if (!strcmp(symbName, "a3@array1")) { Accesses[2]++; Hits[2] += cacheresult->tlbResult[0]; } }

  11. Memory Profile for a1, a2, a3 L1 L2 TLB Data: a1 Load Hit Ratio: 96% 99% 7% Store Hit Ratio: 99% 16% - Data: a2 Load Hit Ratio: 97% 50% 7% Store Hit Ratio: 99% 16% - Data: a3 Load Hit Ratio: 96% 59% 0.1% Store Hit Ratio: 99% 17% - Restructured Kernel: parameter (nx=512, ny=512) real a1(nx+1,ny), a2(nx+1,ny), a3(nx+1,ny) … New Memory Profile for a1, a2, a3 L1 L2 TLB Data: a1 Load Hit Ratio: 96% 99% 99% Store Hit Ratio: 99% 16% - Data: a2 Load Hit Ratio: 97% 50% 98% Store Hit Ratio: 99% 16% - Data: a3 Load Hit Ratio: 96% 59% 96% Store Hit Ratio: 99% 17% -

  12. Builtin event handlers: Memory Profile • Execute functional cache simulation and provide memory profile • Power3/Power4 architectures prefetching implemented • Write-Back/Write-Through caches, replacement policies etc archFile specifies: • Number of cache levels • Cache size, line size, associativity, repl. policy, write policy, prefetching algorithm • Coherency level • Multiple machines can be simulated at once

  13. architecture parameters source files trace files Repository binary Address Map Simulator sigmaInst Sigma Inst binary SiGMA Memory Analysis Flow Query Output Memory Profile

  14. Memory Profile: Provide counters such as hits, misses, cold misses for • each cache level • each function • each data structure • each data structure within each function Output sorted by the SIGMA memtime: • SUM( LoadHits(i)*LoadLat(i) + StoreHits(i)*StoreLat(i) ) + • #TLBmisses * Lat(TLBmiss) • memtime should track wall time for memory bound applications

  15. Memory Profile Output L1 L2 L3 TLB MEM FUNCTION: calc1 (memtime = 0.0050) Load Acc/Miss/Cold 522819/2252/1 2252/345/0 345/0/0 784419/126/0 0/-/- Load Acc/Miss Ratio 232 6 - 6225 - Load Hit Ratio 99.57% 84.68% 100.00% 99.98% - Est. Load Latency 0.0008 sec 0.0000 sec 0.0000 sec 0.0001 sec 0.0000 sec Load Traffic - 238.38 Kb 43.12 Kb - 0.00 Kb ……… FUNCTION: calc2 (memtime = 0.0042) Load Acc/Miss/Cold 622230/2631/0 2631/1661/0 1661/0/0 814269/94/0 0/-/- Load Acc/Miss Ratio 236 1 - 8662 - Load Hit Ratio 99.58% 36.87% 100.00% 99.99% - Est. Load Latency 0.0010 sec 0.0000 sec 0.0001 sec 0.0001 sec 0.0000 sec Load Traffic - 121.25 Kb 207.62 Kb - 0.00 Kb ……… L1 L2 L3 TLB MEM DATA: u (memtime = 0.0012) Load Acc/Miss/Cold 167710/708/0 708/317/0 317/0/0 216097/31/0 0/-/- Load Acc/Miss Ratio 236 2 - 6970 - Load Hit Ratio 99.58% 55.23% 100.00% 99.99% - Est. Load Latency 0.0003 sec 0.0000 sec 0.0000 sec 0.0000 sec 0.0000 sec Load Traffic - 48.88 Kb 39.62 Kb - 0.00 Kb ………. DATA: v (memtime = 0.0012) Load Acc/Miss/Cold 167710/721/0 721/316/0 316/0/0 216097/31/0 0/-/- Load Acc/Miss Ratio 232 2 - 6970 - Load Hit Ratio 99.57% 56.17% 100.00% 99.99% - Est. Load Latency 0.0003 sec 0.0000 sec 0.0000 sec 0.0000 sec 0.0000 sec Load Traffic - 50.62 Kb 39.50 Kb - 0.00 Kb ……….

  16. Memory Profile Viewer – Data Structure Focus

  17. Metrics Data Structure Control Structure SIGMA Repository TLB Misses, Cache Misses, Hit Ratio. … Select from list of data Structure in the .addr file Output from any subspace in the 3D space File, Function, Code segment

  18. Visualiz: Repository and Query Language • Build tables and lists • ASCII and bar-chart output • Support arithmetic operators (+ - * / ./) • Compute derived Metrics

  19. Partial Instrumentation • Statically select functions to instrument: • sigmaInst –d –dfunc f1,f2,…,fn appbin • Dynamically select code sections to instrument: • #include "signal_sigma.h“ • for (lp=1; lp<NIT; lp++) { • /* start sigma */ • if ((lp == 2) || (lp==7)) { • signal_sigma(TRACE_ON, lp); • } • for ( i=0; i<n; i++ ) { • for ( k=0; k< n; k++ ) { • u[n * i + k] = 0.0; • f[n * i + k] = rhs; • } • } • /* stop sigma */ • if ((lp == 2) || (lp==7)) { • signal_sigma(TRACE_OFF, lp); • } • }

  20. Unless the entire application is instrumented, there will besomeinaccuracyin the results • Use dynamic selection atphase or loop boundariesor when the cache can be assumed ‘cold’ • Usesigma signalsto reset the cache • Ongoing research to findoptimal sampling techniques • Ongoing research onautomatic sampling

  21. Bultin Event Handlers: Trace Generation • Generate a compressed memory trace • The cache simulation can run on the compressed trace • Control-Flow based Trace Compression (Patented): • Compress the addresses produced by each instruction separately • Capture strides, repetitions, nested patterns • Compression is performed online • Compress all trace events, not only addresses

  22. Trace Compression Rate

  23. Bultin Event Handler: detecting False Sharing (work in progress): Target: memory performance of shared-memory apps (infinite cache) • Detect cache misses due to invalidation • Detect false-sharing misses • Measure c2c transfers • Provide suggestions for code rearrangement to minimize false-sharing misses and/or minimize cache invalidations and/or maximize c2c transfers • Remapping of data structures • Remapping of computation • Changing scheduling policy

  24. pSigma and the New HPC Toolkit (4Q2005) • Look at all aspects of performance (communication, processor, memory, sharing of data) from within a single interface • Operate on thebinary– no source code modification • Provide information in terms ofsymbolic names • Centralized GUIfor instrumentation and analysis • Dynamic instrumentationcapabilities • Graphicscapabilities (bar charts, plots etc) • Simultaneous instrumentation for all tools (one run!) • Querycapabilities: compute derived metrics and plot them • Selective instrumentation of MPI functions

  25. pSigma in the new HPC toolkit Binary Application PeekPerf GUI Memory Profiler Communication Profiler Visualization Query Analysis CPU Profiler Shared-Memory Profiler I/O Profiler pSigma execution Binary instrumentation Instrumented Binary

  26. Next Steps • Linux porting • Complete the support for shared-memory apps • More handlers (sharing of data, data flow, etc) • Support 64-bit applications • Support C++ codes

  27. Questions / Comments

More Related