1 / 17

Optimizing the Use of High Performance Software Libraries

This paper discusses the optimization opportunities lost in programming languages without compilers and proposes a solution of using a compiler for libraries to extend their capabilities and support library-level optimizations.

mroscoe
Download Presentation

Optimizing the Use of High Performance Software Libraries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing the Use of High Performance Software Libraries Samuel Z. Guyer Calvin Lin University of Texas at Austin

  2. Overview • Libraries are programming languages • Extend the capabilities of the base language • Define new data types and operators • Problem: languages without compilers • Programmers are responsible for performance • Optimization opportunities are lost • Solution: a compiler for libraries • Extend the compiler to library types and operators • Support library-leveloptimizations

  3. Outline • Motivating example • System architecture • Details – by example • Related work • Conclusions

  4. 1 2 4 3 5 6 Example: Math Library Source Code for (i=1;i<=N;i++) { d1 = 2.0 * i; d2 = pow(x, i); d3 = 1.0/z; d4 = cos(z); uint = uint/4; d5 = sin(y)/cos(y); } Traditional Optimizer d1 = 0.0; d3 = 1.0/z; for (i=1;i<=N;i++) { d1 += 2.0; d2 = pow(x, i); d4 = cos(z); uint = uint >> 2; d5 = sin(y)/cos(y); } Library-level Optimizer d1 = 0.0; d2 = 1.0; d3 = 1.0/z; d4 = cos(z); for (i=1;i<=N;i++) { d1 += 2.0; d2 *= x; uint = uint >>2; d5 = tan(y); } • How can a compiler do this automatically?

  5. Application source code Compiled code Optimized, integrated Broadway Compiler Library Header files + source code + Annotations System Architecture • The Broadway compiler • Configurable compiler mechanisms • Annotation file • Conveys library-specific information • Accompanies the usual library files

  6. Benefits • Practical • One set of annotations for many applications • Works for existing libraries and applications • Development process essentially unchanged • Conceptual: separation of concerns • Compiler provides the mechanisms • Annotations provide the library-specific expertise • Application developer can focus on design • OK, but how does it work?

  7. Specifying Optimizations • Problem: configurability • Each library has its own optimizations • Solution: pattern-based transformations • Pattern: code template with meta-variables • Action: replace, remove, move code pattern { ${obj:y} = sin(${obj:x})/cos(${obj:x}); } { replace { $y = tan($x); } } • What about non-functional interfaces?

  8. Surface A A_upper A_lower Internal view1 view2 view3 data1 Non-functional Interfaces • Problem: library calls have side-effects • Pass-by-reference using pointers • Complex data structures • Example: PLAPACK parallel linear algebra • Manipulate distributed matrices through views PLA_Obj_horz_split_2(A, height, & A_upper, & A_lower) [van de Geijn 1997]

  9. Dependence Annotations • Solution: explicit dependence information • Summarizes library routine behavior • Requires heavy-duty pointer analyzer [Wilson & Lam, 1995] • Supports many traditional optimizations procedure PLA_Obj_horz_split_2(A, height, A_upper, A_lower) { on_entry { A  view1  data1; } access { view1, height } modify {} on_exit { A_upper  new view2  data1; A_lower  new view3  data1; } }

  10. Processor grid PLA_Gemm( , , );  PLA_Local_gemm PLA_Gemm( , , );  PLA_Rankk Domain Information • PLAPACK matrices are distributed • Optimizations exploit special cases • Example: Matrix multiply • Problem: how to extract this information? • How do we describe this property? • How do we track it through the program?

  11. Library-specific Analysis • Solution: configurable dataflow analyzer • Compiler provides interprocedural framework • Library defines flow values • Each library routine defines transfer functions • Issue: how much configurability? • Avoid exposing the underlying lattice theory • Simple flow value type system

  12. Analysis Annotations • Accompany the dependence annotations procedure PLA_Obj_horz_split_2(A, height, A_upper, A_lower) { on_entry { A  view1  data1; } access { view1, height } modify {} on_exit { A_upper  new view2  data1; A_lower  new view3  data1; } } property Distribution : map-of<object,{General, RowPanel, ColPanel, Local, Empty}>; analyze Distribution { (view1 == General) => view2 = RowPanel, view3 = General; (view1 == ColPanel) => view2 = Local, view3 = ColPanel; (view1 == Local) => view2 = Local, view3 = Empty; }

  13. Using the Results of Analysis • Patterns can test the flow values pattern { PLA_Gemm(${obj:A}, ${obj:B}, ${obj:C}); } { when ((Distribution[viewA] == Local) && (Distribution[viewB] == Local) && (Distribution[viewC] == Local)) replace { PLA_Local_gemm($A, $B, $C); } on_entry { A  viewA  dataA; B  viewB  dataB; C  viewC  dataC; } } • DFA and patterns are complementary

  14. Status • Prototype • Partially automated • Significant speed-up for PLAPACK • Current status • Interprocedural pointer and dependence analyzer • Continuing work on annotation language • Dataflow analyzer and pattern matcher in progress

  15. Related Work • Supporting work • Dataflow analysis, pointer analysis • Pattern matching, partial evaluation • Pattern-based code generators • Configurable compilers • Optimizer generators (Genesis) • PAG abstract interpretation system • Open compilers (Magik, SUIF, MOPS) • Software generators and transformation systems • Specialization (Synthetix, Speckle)

  16. Conclusions • Many opportunities • Many existing libraries and applications • Future: class libraries • Not easy • Complexity of libraries • Configurability – power versus usability • We have a promising solution • Good initial results • Many interesting research directions

  17. Cholesky (3072×3072) PLA_Trsm kernel 300 MFLOPS MFLOPS/Proc Broadway Hand-optimized Baseline 0 0 0 40 Processors 500 4500 Matrix Size PLAPACK Results • Compared three Cholesky programs • Baseline: clean and simple, but still fast • Hand-optimized by PLAPACK group • Broadway: automatic analysis, manual transforms

More Related