180 likes | 316 Views
Loop-Based Automated Performance Analysis. Eli Collins eli@cs.wisc.edu Computer Sciences Department University of Wisconsin-Madison Madison, WI 53706 USA. Motivation. Automated performance analysis Ongoing work, APART Previous work: Callgraph, Deepstart
E N D
Loop-Based Automated Performance Analysis Eli Collins eli@cs.wisc.edu Computer Sciences Department University of Wisconsin-Madison Madison, WI 53706 USA
Motivation • Automated performance analysis • Ongoing work, APART • Previous work: Callgraph, Deepstart • Faster, more efficient searching • This work: better localize performance problems • Report performance data at finer granularity
Motivation (Cont.) • Function granularity works well • Don’t overload user w/ fine-grain data • Why is a function a bottleneck? • Large function w/ multiple bottlenecks • Small function called repeatedly in a loop • Idea: search inside bottleneck functions
Performance Consultant (PC) • For code, PC searches the callgraph • Breadth First Search • Prune non-bottleneck functions • Introduce a new callgraph level that • Is a logical unit of computation • Improves granularity • Partitions functions for searching • Keeps search space manageable (scalability)
Loops in the Callgraph main main f1 f2 loop 1 f1 f1 f2 Callgraph Callgraph w/ loops
Why Loops? • Loops may be bottlenecks themselves • Especially in scientific and long-running applications • Loops are natural sources of parallelism • Compilers/HW exploit • OpenMP PARALLEL DO, loop unrolling/fusion • Provide feedback as to the effectiveness of these optimizations
Why Loops? • Loops logically decompose functions • Natural hierarchy (name by nesting) • We instrument loops in binary • Binary is what actually executes • Typically can correlate PC results w/ original source • Difficult w/ basic block, instruction granularity
What’s new? • Loop-level performance data is not new • Existing tools: DPOMP, HPCView, SvPablo • Edge instrumentation in EEL and OM • Integrate loops into automated search • Techniques to instrument loops on-the-fly • Technical challenges doing this efficiently • Especially on IA32 (AMD64/EM64T) • Results for some MPI/OpenMP applications
Binary Loop Instrumentation 1 2 3 4 LP: inc %edx inc %eax cmp $0x64,%eax jg DONE inc %edx inc %eax cmp %edx,%eax jl LP DONE: do { ... if (x > 100) break; ... } while (x < y); Entry Begin iter. End iter. Exit
New Instrumentation Techniques • Traditional function, edge instrumentation • Function relocation, previously • Function entry, exit, callsites • For loops, may relocate function again • Ensure enough padding around basic blocks which need to be instrumented • Avoid trap-based instrumentation
Loop-based Search Strategy • PC uses loops as steps in its refinement
Loop-based Search Strategy • Inclusive metric, instrument loop entry/exit • If a node is a bottleneck, instrument • Function: outermost loops and call sites • Loop: nested loops and call sites • # of PC experiments • More total experiments possible w/ loops • But loops can help prune search • E.g. loops which contain multiple call sites
Results • Loops were frequently bottlenecks • 10 total leaf-level function bottlenecks • 7 of these contained loop bottlenecks • Bottleneck functions had many loops • Especially true for Fortran applications • OM3: 1 function, 83% CPU, 90 loops • Good results, even when code not modular • Correlate loop w/ source using call sites
Summary • Not much overhead • Avoid trap-based instrumentation • Only instrument loops of bottlenecks functions • Find bottlenecks at similar rate • Loop-aware finds more, more in total to find • More precise results • Little change in search time • Similar rates of experimentation
Loop-Based Automated Performance Analysis eli@cs.wisc.edu http://www.paradyn.org http://www.dyninst.org