Loop-Based Automated Performance Analysis

Loop-Based Automated Performance Analysis Eli Collins eli@cs.wisc.edu Computer Sciences Department University of Wisconsin-Madison Madison, WI 53706 USA

Motivation • Automated performance analysis • Ongoing work, APART • Previous work: Callgraph, Deepstart • Faster, more efficient searching • This work: better localize performance problems • Report performance data at finer granularity

Motivation (Cont.) • Function granularity works well • Don’t overload user w/ fine-grain data • Why is a function a bottleneck? • Large function w/ multiple bottlenecks • Small function called repeatedly in a loop • Idea: search inside bottleneck functions

Performance Consultant (PC) • For code, PC searches the callgraph • Breadth First Search • Prune non-bottleneck functions • Introduce a new callgraph level that • Is a logical unit of computation • Improves granularity • Partitions functions for searching • Keeps search space manageable (scalability)

Loops in the Callgraph main main f1 f2 loop 1 f1 f1 f2 Callgraph Callgraph w/ loops

Why Loops? • Loops may be bottlenecks themselves • Especially in scientific and long-running applications • Loops are natural sources of parallelism • Compilers/HW exploit • OpenMP PARALLEL DO, loop unrolling/fusion • Provide feedback as to the effectiveness of these optimizations

Why Loops? • Loops logically decompose functions • Natural hierarchy (name by nesting) • We instrument loops in binary • Binary is what actually executes • Typically can correlate PC results w/ original source • Difficult w/ basic block, instruction granularity

What’s new? • Loop-level performance data is not new • Existing tools: DPOMP, HPCView, SvPablo • Edge instrumentation in EEL and OM • Integrate loops into automated search • Techniques to instrument loops on-the-fly • Technical challenges doing this efficiently • Especially on IA32 (AMD64/EM64T) • Results for some MPI/OpenMP applications

Binary Loop Instrumentation 1 2 3 4 LP: inc %edx inc %eax cmp $0x64,%eax jg DONE inc %edx inc %eax cmp %edx,%eax jl LP DONE: do { ... if (x > 100) break; ... } while (x < y); Entry Begin iter. End iter. Exit

New Instrumentation Techniques • Traditional function, edge instrumentation • Function relocation, previously • Function entry, exit, callsites • For loops, may relocate function again • Ensure enough padding around basic blocks which need to be instrumented • Avoid trap-based instrumentation

Loop-based Search Strategy • PC uses loops as steps in its refinement

Loop-based Search Strategy • Inclusive metric, instrument loop entry/exit • If a node is a bottleneck, instrument • Function: outermost loops and call sites • Loop: nested loops and call sites • # of PC experiments • More total experiments possible w/ loops • But loops can help prune search • E.g. loops which contain multiple call sites

Test Applications

Results • Loops were frequently bottlenecks • 10 total leaf-level function bottlenecks • 7 of these contained loop bottlenecks • Bottleneck functions had many loops • Especially true for Fortran applications • OM3: 1 function, 83% CPU, 90 loops • Good results, even when code not modular • Correlate loop w/ source using call sites

Bottlenecks (ALARA)

Bottlenecks (SPhot)

Summary • Not much overhead • Avoid trap-based instrumentation • Only instrument loops of bottlenecks functions • Find bottlenecks at similar rate • Loop-aware finds more, more in total to find • More precise results • Little change in search time • Similar rates of experimentation

Loop-Based Automated Performance Analysis eli@cs.wisc.edu http://www.paradyn.org http://www.dyninst.org

Loop-Based Automated Performance Analysis

Loop-Based Automated Performance Analysis

Presentation Transcript

automated slide analysis

Performance-Based Monitoring Analysis System PBMAS

Loop (Mesh) Analysis

Automated Malware Analysis

Performance Analysis of FlexRay-based ECU Networks

Comprehensive Performance with Automated Execution Plan Analysis (ExecStats)

Automated Computation of One-Loop Amplitudes

Automated Data Analysis

Performance Based Monitoring Analysis System

Loop (Mesh) Analysis

Automated Computation of One-Loop Amplitudes

Rendezvous-Based Directional Routing: A Performance Analysis

Closed-loop Analysis-III

LOOP ANALYSIS

Performance-Based Monitoring Analysis System

cpmPlus Loop Performance Manager 3.2

Closed Loop Performance

Performance Based Seismic Analysis of Buildings

Rendezvous-Based Directional Routing: A Performance Analysis

Optimize IT Loop Performance Manager 3.1

Understanding the Control Loop Performance System