1.12k likes | 1.24k Views
The SGI Pro64 Compiler Infrastructure. - A Tutorial. Guang R. Gao (U of Delaware) J. Dehnert (SGI) J. N. Amaral (U of Alberta) R. Towle (SGI). Acknowledgement. The SGI Compiler Development Teams The MIPSpro/Pro64 Development Team University of Delaware CAPSL Compiler Team
E N D
The SGI Pro64 Compiler Infrastructure - A Tutorial Guang R. Gao (U of Delaware)J. Dehnert (SGI) J. N. Amaral (U of Alberta)R. Towle (SGI)
Acknowledgement The SGI Compiler Development Teams • The MIPSpro/Pro64 Development Team University of Delaware • CAPSL Compiler Team These individuals contributed directly to this tutorial A. Douillet(Udel)F. Chow (Equator) S. Chan (Intel) W. Ho (Routefree) Z. Hu (Udel) K. Lesniak (SGI) S. Liu (HP) R. Lo (Routefree) S. Mantripragada (SGI) C. Murthy(SGI) M. Murphy(SGI) G. Pirocanac (SGI) D. Stephenson (SGI) D. Whitney (SGI) H. Yang (Udel)
What is Pro64? • A suite of optimizing compiler tools for Linux/ Intel IA-64 systems • C, C++ and Fortran90/95 compilers • Conforming to the IA-64 Linux ABI and API standards • Open to all researchers/developers in the community • Compatible with HP Native User Environment
Who Might Want to Use Pro64? • Researchers: test new compiler analysis and optimization algorithms • Developers : retarget to another architecture/system • Educators: a compiler teaching platform
Outline • Background and Motivation • Part I: An overview of the SGI Pro64 compiler infrastructure • Part II: The Pro64 code generator design • Part III: Using Pro64 in compiler research & development • SGI Pro64 support • Summary
Outline • Logical compilation model and component flow • WHIRL Intermediate Representation • Inter-Procedural Analysis (IPA) • Loop Nest Optimizer (LNO) and Parallelization • Global optimization (WOPT) • Feedback • Design for debugability and testability
Logical Compilation Model driver (sgicc/sgif90/sgiCC) front end + IPA (gfec/gfecc/mfef90) back end (be, as) linker (ld) WHIRL (.B/.I) obj (.o) Src (.c/.C/.f) a.out/.so Data Path Fork and Exec
Components of Pro64 Front end Interprocedural Analysis and Optimization Loop Nest Optimization and Parallelization Global Optimization Code Generation
Data Flow Relationship Between Modules -O3 -IPA LNO Local IPA Main IPA Lower to High W. .B Inliner gfec .I lower I/O gfecc (only for f90) .w2c.c WHIRL C f90 .w2c.h .w2f.f WHIRL fortran -O0 Take either path Lower all CG Very high WHIRL -phase: w=off High WHIRL Main opt Lower Mid W -O2/O3 Mid WHIRL Low WHIRL
Front Ends • C front end based on gcc • C++ front end based on g++ • Fortran90/95 front end from MIPSpro
Intermediate Representation IR is called WHIRL • Tree structured, with references to symbol table • Maps used for local or sparse annotation • Common interface between components • Multiple languages, multiple targets • Same IR, 5 levels of representation • Continuous lowering during compilation • Optimization strategy tied to level
IPA Main Stage Analysis • alias analysis • array section • code layout Optimization (fully integrated) • inlining • cloning • dead function and variable elimination • constant propagation
IPA Design Features • User transparent • No makefile changes • Handles DSOs, unanalyzed objects • Provide info (e.g. alias analysis, procedure properties) smoothly to: • loop nest optimizer • main optimizer • code generator
Loop Nest Optimizer/Parallelizer • All languages (including OpenMP) • Loop level dependence analysis • Uniprocessor loop level transformations • Automatic parallelization
Loop Level Transformations • Loop Fission • Loop Fusion • Loop Unroll and Jam • Loop Interchange • Based on unified cost model • Heuristics integrated with software pipelining • Loop vector dependency info passed to CG • Loop Peeling • Loop Tiling • Vector Data Prefetching
Parallelization • Automatic Array privatization Doacross parallelization Array section analysis • Directive based OpenMP Integrated with automatic methods
Global Optimization Phase • SSA is unifying technology • Use only SSA as program representation • All traditional global optimizations implemented • Every optimization preserves SSA form • Can reapply each optimization as needed
Pro64 Extensions to SSA • Representing aliases and indirect memory operations (Chow et al, CC 96) • Integrated partial redundancy elimination (Chow et al, PLDI 97; Kennedy et al, CC 98, TOPLAS 99) • Support for speculative code motion • Register promotion via load and store placement (Lo et al, PLDI 98)
Feedback Used throughout the compiler • Instrumentation can be added at any stage • Explicit instrumentation data incorporated where inserted • Instrumentation data maintained and checked for consistency through program transformations.
Design for Debugability (DFD) and Testability (DFT) • DFD and DFT built-in from start • Can build with extra validity checks • Simple option specification used to: • Substitute components known to be good • Enable/disable full components or specific optimizations • Invoke alternative heuristics • Trace individual phases
Where to Obtain Pro64 Compiler and its Support • SGI Source download http://oss.sgi.com/projects/Pro64/ • University of Delaware Pro64 Support Group http://www.capsl.udel.edu/~pro64 pro64@capsl.udel.edu
PART II Overview of The Pro64 Code Generator
Outline • Code generator flow diagram • WHIRL/CGIR and TARG-INFO • Hyperblock formation and predication (HBF) • Predicate Query System (PQS) • Loop preparation (CGPREP) and software pipelining • Global and local instruction scheduling (IGLS) • Global and local register allocation (GRA, LRA)
Flowchart of Code Generator WHIRL Control Flow Opt II EBO WHIRL-to-TOP Lowering EBO: Extended basic block optimization peephole, etc. CGIR: Quad Op List IGLS: pre-pass GRA, LRA, EBO IGLS: post-pass Control Flow Opt Control Flow Opt I EBO Hyperblock Formation Critical-Path Reduction PQS: Predicate Query System Code Emission Process Inner Loops: unrolling, EBO Loop prep, software pipelining
From WHIRL to CGIR An Example ST aa int *a; int i; int aa; aa = a[i]; T1 = sp + &a; T2 = ld T1 T3 = sp + &i; T4 = ld T3 T5 = sxt T4 T6 = T5 << 2 T7 = T6 T8 = T2 + T7 T9 = ld T8 T10 = sp + &aa := st T10 T9 LD + a * CVTL32 4 i (a) Source (b) WHIRL (c) CGIR
Code Generation Intermediate Representation (CGIR) • TOPs (Target Operations) are “quads” • Operands/results are TNs • Basic block nodes in control flow graph • Load/store architecture • Supports predication • Flags on TOPs (copy ops, integer add, load, etc.) • Flags on operands (TNs)
From WHIRL to CGIR Cont’d • Information passed • alias information • loop information • symbol table and maps
The Target Information Table (TARG_INFO) Objective: • Parameterized description of a target machine and system architecture • Separates architecture details from the compiler’s algorithms • Minimizes compiler changes when targeting a new architecture
The Target Information Table (TARG_INFO) Cont’d • Based on an extension of Cydra tables, with major improvements • Architecture models have already targeted: • Whole MIPS family • IA-64 • IA-32 • SGI graphics processors (earlier version)
Flowchart of Code Generator WHIRL Control Flow Opt II EBO WHIRL-to-TOP Lowering EBO: Extended basic block optimization peephole, etc. CGIR: Quad Op List IGLS: pre-pass GRA, LRA, EBO IGLS: post-pass Control Flow Opt Control Flow Opt I EBO Hyperblock Formation Critical-Path Reduction PQS: Predicate Query System Code Emission Process Inner Loops: unrolling, EBO Loop prep, software pipelining
Hyperblock Formation and Predicated Execution • Hyperblock single-entry multiple-exit control-flow region: • loop body, hammock region, etc. • Hyperblock formation algorithm • Based on Scott Mahlke’s method [Mahlke96] • But, less aggressive tail duplication
Hyperblock Formation Algorithm Region Identification • Hammock regions • Innermost loops • General regions (path based) • Paths sorted by priorities (freq., size, length, etc.) • Inclusion of a path is guided by its impact on resources, scheduling height, and priority level • Internal branches are removed via predication • Predicate reuse Block Selection Tail Duplication If Conversion Objective: Keep the scheduling height close to that of the highest priority path.
Hyperblock Formation - An Example 1 1 aa = a[i]; bb = b[i]; switch (aa) { case 1: if (aa < tabsiz) aa = tab[aa]; case 2: if (bb < tabsiz) bb = tab[bb]; default: ans = aa + bb; 4 2 4 2 1 5 4,5 5 2 6’ 6 6 6,7 7’ 8 7 7 8’ 8 8 H1 H2 (a) Source (c) Hyperblock formation with aggressive tail duplication (b) CFG
Hyperblock Formation - An Example Cont’d 1 1 1 4 2 4 2 4 2 H1 5 5 5 6’ 6 6 6 7’ 7 7 7 8’ 8 H2 8 H1 H2 8 (b) Hyperblock formation with aggressive tail duplication (c) Pro64 hyperblock formation (a) CFG
Features of the Pro64 Hyperblock Formation (HBF) Algorithm • Form “good” vs. “maximal” hyperblocks • Avoid unnecessary duplication • No reverse if-conversion • Hyperblocks are not a barrier to global code motion later in IGLS
Predicate Query System (PQS) • Purpose: gather information and provide interfaces allowing other phases to make queries regarding the relationships among predicate values • PQS functions (examples) BOOL PQSCG_is_disjoint (PQS_TN tn1, PQS_TN tn2) BOOL PQSCG_is_subset (PQS_TN_SET& tns1, PQS_TN_SET& tns2)
Flowchart of Code Generator WHIRL Control Flow Opt II EBO WHIRL-to-TOP Lowering EBO: Extended basic block optimization peephole, etc. CGIR: Quad Op List IGLS: pre-pass GRA, LRA, EBO IGLS: post-pass Control Flow Opt Control Flow Opt I EBO Hyperblock Formation Critical-Path Reduction PQS: Predicate Query System Code Emission Process Inner Loops: unrolling, EBO Loop prep, software pipelining
Loop Preparation and Optimization for Software Pipelining • Loop canonicalization for SWP • Read/Write removal (register aware) • Loop unrolling (resource aware) • Recurrence removal or extension • Prefetch • Forced if-conversion
Pro64 Software Pipelining Method Overview • Test for SWP-amenable loops • Extensive loop preparation and optimization before application [DeTo93] • Use lifetime sensitive SWP algorithm [Huff93] • Register allocation after scheduling based on Cydra 5 [RLTS92, DeTo93] • Handle both while and do loops • Smooth switching to normal scheduling if not successful.
Pro64 Lifetime-Sensitive Modulo Scheduling for Software Pipelining Features • Try to place an op ASAP or ALAP to minimize register pressure • Slack scheduling • Limited backtracking • Operation-driven scheduling framework Compute Estart/Lstart for all unplaced ops Choose a good op to place into the current partial schedule within its Estart/Lstart range yes Register allocate Succeed no done Eject conflicting Ops
Flowchart of Code Generator WHIRL Control Flow Opt II EBO WHIRL-to-TOP Lowering EBO: Extended basic block optimization peephole, etc. CGIR: Quad Op List IGLS: pre-pass GRA, LRA, EBO IGLS: post-pass Control Flow Opt Control Flow Opt I EBO Hyperblock Formation Critical-Path Reduction PQS: Predicate Query System Code Emission Process Inner Loops: unrolling, EBO Loop prep, software pipelining
Integrated Global Local Scheduling (IGLS) Method • The basic IGLS framework integrates global code motion (GCM) with local scheduling [MaJD98] • IGLS extended to hyperblock scheduling • Performs profitable code motion between hyperblock regions and normal regions
IGLS Phase Flow Diagram Hyperblock Scheduling (HBS) Block Priority Selection Motion Selection Target Selection Global Code Motion (GCM) Local Code Scheduling (LCS)
Advantages of the Extended IGLSMethod - The Example Revisited 1 • Advantages: • No rigid boundaries between hyperblocks and non-hyperblocks • GCM moves code into and out of a hyperblock according to profitability 1 4 2 H1 4 2 H1 5 5 6 6 7 7 8’ 8 H2 H2 H3 8 (a) Pro64 hyperblock (b) Profitable duplication
Software Pipelining vsNormal Scheduling a SWP-amenable loop candidate ? No Yes IGLS Inner loop processing software pipelining GRA/LRA Failure/not profitable IGLS Code Emission Success
Flowchart of Code Generator WHIRL Control Flow Opt II EBO WHIRL-to-TOP Lowering EBO: Extended basic block optimization peephole, etc. CGIR: Quad Op List IGLS: pre-pass GRA, LRA, EBO IGLS: post-pass Control Flow Opt Control Flow Opt I EBO Hyperblock Formation Critical-Path Reduction PQS: Predicate Query System Code Emission Process Inner Loops: unrolling, EBO Loop prep, software pipelining
Global and Local Register Allocation(GRA/LRA) From prepass IGLS • LRA-RQ provides an estimate of local register requirements • Allocates global variables using a priority-based register allocator [ChowHennessy90,Chow83, Briggs92] • Incorporates IA-64 specific extensions, e.g. register stack usage GRA LRA Register Request LRA-RQ Priority Based Register Allocation with IA-64 Extensions LRA To postpass IGLS
Local Register Allocation (LRA) • Assign_registers using reverse linear scan • Reordering: depth-first ordering on the DDG Assign_Registers failed succeed Fix_LRA first time Instruction reordering Spill global spill local
Future Research Topics for Pro64 Code Generator • Hyperblock formation • Predicate query system • Enhanced speculation support