150 likes | 356 Views
An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes. Jeremy Abramson and Pedro C. Diniz University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292.
E N D
An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes Jeremy Abramson and Pedro C. Diniz University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292
Motivation • Performance analysis is conceptually easy • Just run the program! • The “what” of performance. Is this interesting? • Is that realistic? • Huge programs with large data sets • “Uncertainty principle” and intractability of profiling/instrumenting • Performance prediction and analysis is in practice very hard • Not just interested in wall clock time • The “why” of performance is a big concern • How to accurately characterize program behavior? • What about architecture effects? • Can’t reuse wall clock time • Can reuse program characteristics
Motivation (2) • What about the future? • Different architecture = better results? • Compiler transformations (loop unrolling) • Need a fast, scalable, automated way of determining program characteristics • Determine what causes poor performance • What does profiling tell us? • How can the programmer use profiling (low-level) information?
Overview • Approach • High level / low level synergy • Not architecture-bound • Experimental results • CG core • Caveats and future work • Conclusion
Low versus High level information la $r0, a lw $r1 i mult $offset, $r1, 4 add $offset, $offset, $r0 lw $r2, $offset add $r3, $r2, 1 la $r4, b sw $r4, $r3 or • Which can provide meaningful performance information to a programmer? • How do we capture the information at a low level while maintaining the structure of high level source?
Low versus High level information (2) • Drawbacks of looking at low-level • Too much data! • You found a “problem” spot. What now? • How do programmers relate information back to source level? • Drawbacks of looking at source-level • What about the compiler? • Code may look very different • Architecture impacts? • Solution: Look at high-level structure, try to anticipate compiler
Experimental Approach • Goal: Derive performance expectations from source code for different architectures • What should the performance be and why? • What is limiting the performance? • Data-dependencies? • Architecture limitations? • Use high level information • WHIRL intermediate representation in Open64 • Arrays not lowered • Construct DFG • Decorate graph with latency information • Schedule the DFG • Compute as-soon-as-possible schedule • Variable number of functional units • ALU, Load/Store, Registers • Pipelining of operations
Compilation process OPR_STID: B OPR_ADD OPR_ARRAY OPR_LDA: A OPR_LDID: i OPR_CONST: 1 for (i; i < 0; … … B = A[i] + 1 … 1. Source (C/Fortran) 2. Open64 WHIRL (High-level) 3. Annotated DFG
i is a loop induction variable Array node represents address calculation at a high level Register hit? Assign latency Array expression is affine. Assume a cache hit, and assign latency accordingly Memory modeling approach 0
Example: CG do 200 j = 1, n xj = x(j) do 100 k = colstr(j) , colstr(j+1)-1 y(rowidx(k)) = y(rowidx(k)) + a(k) + xj 100 continue 200 continue
Figure 4. Validation results of CG on a MIPS R10000 machine CG Analysis Results Prediction results consistent with un-optimized version of the code
Figure 5. Cycle time for an iteration of CG with varying architectural configurations CG Analysis Results (2) • What’s the best way to use processor space? • Pipelined ALUs? • Replicate standard ALUs?
Caveats, Future Work • More compiler-like features are needed to improve accuracy • Control flow • Implement trace scheduling • Multiple-paths can give upper/lower performance bounds • Simple compiler transformations • Common sub-expression elimination • Strength reduction • Constant folding • Register allocation • “Distance”-based methods? • Anticipate cache for spill code • Software pipelining? • Unrolling exploits ILP • Run-time data? • Array references, loop trip counts, access patterns from performance skeletons
Conclusions • SLOPE provides very fast performance prediction and analysis results • High-level approach gives more meaningful information • Still try to anticipate compiler and memory hierarchy • More compiler transformations to be added • Maintain high-level approach, refine low-level accuracy
An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes Jeremy Abramson and Pedro C. Diniz University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292