1 / 15

An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes

An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes. Jeremy Abramson and Pedro C. Diniz University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292.

coligny
Download Presentation

An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes Jeremy Abramson and Pedro C. Diniz University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292

  2. Motivation • Performance analysis is conceptually easy • Just run the program! • The “what” of performance. Is this interesting? • Is that realistic? • Huge programs with large data sets • “Uncertainty principle” and intractability of profiling/instrumenting • Performance prediction and analysis is in practice very hard • Not just interested in wall clock time • The “why” of performance is a big concern • How to accurately characterize program behavior? • What about architecture effects? • Can’t reuse wall clock time • Can reuse program characteristics

  3. Motivation (2) • What about the future? • Different architecture = better results? • Compiler transformations (loop unrolling) • Need a fast, scalable, automated way of determining program characteristics • Determine what causes poor performance • What does profiling tell us? • How can the programmer use profiling (low-level) information?

  4. Overview • Approach • High level / low level synergy • Not architecture-bound • Experimental results • CG core • Caveats and future work • Conclusion

  5. Low versus High level information la $r0, a         lw $r1 i         mult $offset, $r1, 4   add $offset, $offset, $r0   lw $r2, $offset        add $r3, $r2, 1        la $r4, b                sw $r4, $r3                or • Which can provide meaningful performance information to a programmer? • How do we capture the information at a low level while maintaining the structure of high level source?

  6. Low versus High level information (2) • Drawbacks of looking at low-level • Too much data! • You found a “problem” spot. What now? • How do programmers relate information back to source level? • Drawbacks of looking at source-level • What about the compiler? • Code may look very different • Architecture impacts? • Solution: Look at high-level structure, try to anticipate compiler

  7. Experimental Approach • Goal: Derive performance expectations from source code for different architectures • What should the performance be and why? • What is limiting the performance? • Data-dependencies? • Architecture limitations? • Use high level information • WHIRL intermediate representation in Open64 • Arrays not lowered • Construct DFG • Decorate graph with latency information • Schedule the DFG • Compute as-soon-as-possible schedule • Variable number of functional units • ALU, Load/Store, Registers • Pipelining of operations

  8. Compilation process OPR_STID: B OPR_ADD OPR_ARRAY OPR_LDA: A OPR_LDID: i OPR_CONST: 1 for (i; i < 0; … … B = A[i] + 1 … 1. Source (C/Fortran) 2. Open64 WHIRL (High-level) 3. Annotated DFG

  9. i is a loop induction variable Array node represents address calculation at a high level Register hit? Assign latency Array expression is affine. Assume a cache hit, and assign latency accordingly Memory modeling approach 0

  10. Example: CG do 200 j = 1, n xj = x(j) do 100 k = colstr(j) , colstr(j+1)-1 y(rowidx(k)) = y(rowidx(k)) + a(k) + xj 100 continue 200 continue

  11. Figure 4. Validation results of CG on a MIPS R10000 machine CG Analysis Results Prediction results consistent with un-optimized version of the code

  12. Figure 5. Cycle time for an iteration of CG with varying architectural configurations CG Analysis Results (2) • What’s the best way to use processor space? • Pipelined ALUs? • Replicate standard ALUs?

  13. Caveats, Future Work • More compiler-like features are needed to improve accuracy • Control flow • Implement trace scheduling • Multiple-paths can give upper/lower performance bounds • Simple compiler transformations • Common sub-expression elimination • Strength reduction • Constant folding • Register allocation • “Distance”-based methods? • Anticipate cache for spill code • Software pipelining? • Unrolling exploits ILP • Run-time data? • Array references, loop trip counts, access patterns from performance skeletons

  14. Conclusions • SLOPE provides very fast performance prediction and analysis results • High-level approach gives more meaningful information • Still try to anticipate compiler and memory hierarchy • More compiler transformations to be added • Maintain high-level approach, refine low-level accuracy

  15. An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes Jeremy Abramson and Pedro C. Diniz University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292

More Related