170 likes | 177 Views
This paper discusses the optimization opportunities lost in programming languages without compilers and proposes a solution of using a compiler for libraries to extend their capabilities and support library-level optimizations.
E N D
Optimizing the Use of High Performance Software Libraries Samuel Z. Guyer Calvin Lin University of Texas at Austin
Overview • Libraries are programming languages • Extend the capabilities of the base language • Define new data types and operators • Problem: languages without compilers • Programmers are responsible for performance • Optimization opportunities are lost • Solution: a compiler for libraries • Extend the compiler to library types and operators • Support library-leveloptimizations
Outline • Motivating example • System architecture • Details – by example • Related work • Conclusions
1 2 4 3 5 6 Example: Math Library Source Code for (i=1;i<=N;i++) { d1 = 2.0 * i; d2 = pow(x, i); d3 = 1.0/z; d4 = cos(z); uint = uint/4; d5 = sin(y)/cos(y); } Traditional Optimizer d1 = 0.0; d3 = 1.0/z; for (i=1;i<=N;i++) { d1 += 2.0; d2 = pow(x, i); d4 = cos(z); uint = uint >> 2; d5 = sin(y)/cos(y); } Library-level Optimizer d1 = 0.0; d2 = 1.0; d3 = 1.0/z; d4 = cos(z); for (i=1;i<=N;i++) { d1 += 2.0; d2 *= x; uint = uint >>2; d5 = tan(y); } • How can a compiler do this automatically?
Application source code Compiled code Optimized, integrated Broadway Compiler Library Header files + source code + Annotations System Architecture • The Broadway compiler • Configurable compiler mechanisms • Annotation file • Conveys library-specific information • Accompanies the usual library files
Benefits • Practical • One set of annotations for many applications • Works for existing libraries and applications • Development process essentially unchanged • Conceptual: separation of concerns • Compiler provides the mechanisms • Annotations provide the library-specific expertise • Application developer can focus on design • OK, but how does it work?
Specifying Optimizations • Problem: configurability • Each library has its own optimizations • Solution: pattern-based transformations • Pattern: code template with meta-variables • Action: replace, remove, move code pattern { ${obj:y} = sin(${obj:x})/cos(${obj:x}); } { replace { $y = tan($x); } } • What about non-functional interfaces?
Surface A A_upper A_lower Internal view1 view2 view3 data1 Non-functional Interfaces • Problem: library calls have side-effects • Pass-by-reference using pointers • Complex data structures • Example: PLAPACK parallel linear algebra • Manipulate distributed matrices through views PLA_Obj_horz_split_2(A, height, & A_upper, & A_lower) [van de Geijn 1997]
Dependence Annotations • Solution: explicit dependence information • Summarizes library routine behavior • Requires heavy-duty pointer analyzer [Wilson & Lam, 1995] • Supports many traditional optimizations procedure PLA_Obj_horz_split_2(A, height, A_upper, A_lower) { on_entry { A view1 data1; } access { view1, height } modify {} on_exit { A_upper new view2 data1; A_lower new view3 data1; } }
Processor grid PLA_Gemm( , , ); PLA_Local_gemm PLA_Gemm( , , ); PLA_Rankk Domain Information • PLAPACK matrices are distributed • Optimizations exploit special cases • Example: Matrix multiply • Problem: how to extract this information? • How do we describe this property? • How do we track it through the program?
Library-specific Analysis • Solution: configurable dataflow analyzer • Compiler provides interprocedural framework • Library defines flow values • Each library routine defines transfer functions • Issue: how much configurability? • Avoid exposing the underlying lattice theory • Simple flow value type system
Analysis Annotations • Accompany the dependence annotations procedure PLA_Obj_horz_split_2(A, height, A_upper, A_lower) { on_entry { A view1 data1; } access { view1, height } modify {} on_exit { A_upper new view2 data1; A_lower new view3 data1; } } property Distribution : map-of<object,{General, RowPanel, ColPanel, Local, Empty}>; analyze Distribution { (view1 == General) => view2 = RowPanel, view3 = General; (view1 == ColPanel) => view2 = Local, view3 = ColPanel; (view1 == Local) => view2 = Local, view3 = Empty; }
Using the Results of Analysis • Patterns can test the flow values pattern { PLA_Gemm(${obj:A}, ${obj:B}, ${obj:C}); } { when ((Distribution[viewA] == Local) && (Distribution[viewB] == Local) && (Distribution[viewC] == Local)) replace { PLA_Local_gemm($A, $B, $C); } on_entry { A viewA dataA; B viewB dataB; C viewC dataC; } } • DFA and patterns are complementary
Status • Prototype • Partially automated • Significant speed-up for PLAPACK • Current status • Interprocedural pointer and dependence analyzer • Continuing work on annotation language • Dataflow analyzer and pattern matcher in progress
Related Work • Supporting work • Dataflow analysis, pointer analysis • Pattern matching, partial evaluation • Pattern-based code generators • Configurable compilers • Optimizer generators (Genesis) • PAG abstract interpretation system • Open compilers (Magik, SUIF, MOPS) • Software generators and transformation systems • Specialization (Synthetix, Speckle)
Conclusions • Many opportunities • Many existing libraries and applications • Future: class libraries • Not easy • Complexity of libraries • Configurability – power versus usability • We have a promising solution • Good initial results • Many interesting research directions
Cholesky (3072×3072) PLA_Trsm kernel 300 MFLOPS MFLOPS/Proc Broadway Hand-optimized Baseline 0 0 0 40 Processors 500 4500 Matrix Size PLAPACK Results • Compared three Cholesky programs • Baseline: clean and simple, but still fast • Hand-optimized by PLAPACK group • Broadway: automatic analysis, manual transforms