Compiling for IA-64

Compiling for IA-64 Carol Thompson Optimization Architect Hewlett Packard

CISC era: no significant ILP Compiler is merely a tool to enable use of high-level language, at some performance cost RISC era: advent of ILP Compiler-influenced architecture Instruction scheduling becomes important EPIC era: ILP as driving force Compiler-specified ILP History of ILP Compilers

Increasing Scope for ILP Compilation • Early RISC Compilers • Basic block scope (delimited by branches & branch targets) • Superscalar RISC and early VLIW Compilers • Trace scope (single entry, single path) • Superblocks & Hyperblocks (single entry, multiple path) • EPIC Compilers • Composite regions: multiple entry, multiple path Basic Blocks Traces Composite Regions Superblock

Unbalanced and UnbiasedControl Flow • Most code is not well balanced • Many very small blocks • Some very large • Then and else clause frequently unbalanced • Number of instructions • Pathlength • Many branches are highly biased • But some are not • Compiler can obtain frequency information from profiling or derive heuristically 60 40 5 55 0 5 55 40 0 60

Basic Blocks • Basic Blocks are simple • No issues with executing unnecessary instructions • No speculation or predication support required • But, very limited ILP • Short blocks offer very little opportunity for parallelism • Long latency code is unable to take advantage of issue bandwidth in an earlier block 60 40 5 55 0 5 55 40 0 60

Traces • Traces allow scheduling of multiple blocks together • Increases available ILP • Long latency operations can be moved up, as long as they are on the same trace • But, unbiased branches are a problem • Long latency code in slightly less frequent paths can’t move up • Issue bandwidth may go unused (not enough concurrent instructions to fill available execution units) 60 40 5 55 0 5 55 40 0 60

Superblocks and Hyperblocks 60 • Superblocks and Hyperblocks allow inclusion of multiple important paths • Long latency code may migrate up from multiple paths • Hyperblocks may be fully predicated • More effective utilization of issue bandwidth • But, requires code duplication • Wholesale predication may lengthen important paths 40 5 55 0 5 55 40 0 5 60

Composite Regions • Allow rejoin from non-Region code • Wholesale code duplication is not required • Support full code motion across region • Allow all interesting paths to be scheduled concurrently • Nested, less important Regions bear the burden of the rejoin • Compensation code, as needed 60 40 5 55 0 5 55 40 0 60

60 40 5 55 0 5 55 40 0 60 Predication Approaches • Full Predication of entire Region • Penalizes short paths

60 40 5 55 0 5 55 40 0 60 On-Demand Predication • Predicate (and Speculate) as needed • reduce critical path(s) • fully utilize issue bandwidth • Retain control flow to accommodate unbalanced paths

Predicate Analysis • Instruction scheduler requires knowledge of predicate relationships • For dependence analysis • For code motion • … • Predicate Query System • Graphical representation of predicate relationships • Superset, subset, disjoint, …

Predicate Computation • Compute all predicates possibly needed • Optimize • to share predicates where possible • to utilize parallel compares • to fully utilize dual-targets

Predication and Branch Counts • Predication reduces branches • at both moderate and aggressive opt. levels

Predication & Branch Prediction • Comparable misprediction rate with predication • despite significantly fewer branches • increased mean time between mispredicted branches

Register Allocation x x = ... y = ... = ... x z = ... = … y = … z y • Modeled as a graph-coloring problem. • Nodes in the graph represent live ranges of variables • Edges represent a temporal overlap of the live ranges • Nodes sharing an edge must be assigned different colors (registers) z y z x Requires Two Colors

z = ... = … z = … y x = ... x = ... y = ... = … x Register Allocation With Control Flow x x y z y z Requires Two Colors

x = ... y = ... z = ... = …y x = ... = …z = … x Register Allocation With Predication x x y z z y Now Requires Three Colors

x = ... y = ... z = ... = …y x = ... = …z = … x Predicate Analysis x p0 y z p1 p2 p1 and p2 are disjoint If p1 is TRUE, p2 is false and vice versa

x = ... y = ... z = ... = …y x = ... = …z = … x Register Allocation With Predicate Analysis x x y z z y Now Back to Two Colors

Effect of Predicate-Aware Register Allocation • Reduces register requirements for individual procedures by 0% to 75% • Depends upon how aggressively predication is applied • Average dynamic reduction in register stack allocation for gcc is 4.7%

Solutions • Inlining • for non-virtual functions or provably unique virtual functions • Speculative inlining for most common variant • Dynamic optimization (e..g Java) • Make use of dynamic profile • Speculative execution • Guarantees correct exception behavior • Liveness analysis of handlers • Architectural support for speculation ensures recoverability Object-Oriented Code • Challenges • Small Procedures, many indirect (virtual) • Limits size of regions, scope for ILP • Exception Handling • Bounds Checking (Java) • Inherently serial - must check before executing load or store

Method Calls Possible target Resolve target method • Barrier between execution streams • Often, location of called method must be determined at runtime • Costly “identity check” on object must complete before method may begin • Even if the call nearly always goes to the same place • Little ILP Possible target Call-dependent code Possible target

Speculating Across Method Calls • Compiler predicts target method • Profiling • Current state of class hierarchy • Predicted method is inlined • Full or partial • Speculative execution of called method begins while actual target is determined

Speculation Across Method Calls Resolve target method call method Dominant called method Dominant called method Resolve target method Other target method call other method if needed Other target method Other target method Other target method

Bounds & Null Checks • Checks inhibit code motion • Null checks x = y.foo; if( y == null ) throw NullPointerException; x = y.foo; • Bounds checks x = a[i]; if( a == null ) throw NullPointerException; if( i < 0 || i >= a.length) throw ArrayIndexOutOfBounds Exception; x = a[i];

Speculating Across Bounds Checks • Bounds checks rarely fail x = a[i]; ld.s t = a[i]; if( a == null ) throw NullPointerException; if( i < 0 || i >= a.length) throw ArrayIndexOutOfBoundsException; chk.s t x = t; • Long latency load can begin before checks

Exception Handling • Exception handling inhibits motion of subsequent code if( y.foo ) throw MyException; x = y.bar + z.baz;

Speculation in the Presence of Exception Handling • Execution of subsequent instructions may begin before exception is resolved if( y.foo ) throw MyException; x = y.bar + z.baz; ld t1 = y.foo ld.s t2 = y.bar ld.s t3 = z.baz add x = t2 + t3 if( t1 ) throw MyException; chk.s x

If( n < p->count ) { (*log)++; return p->x[n]; } else { return 0; } Dependence Graph for Instruction Scheduling add t1 = 8,p ld4 count = [t1] cmp4.ge p1,p2=n,count (p1) ld4 t3 = [log] (p1) add t2 = 1,t2 (p1) st4 [log] = t2 mov out0 = 0 (p1) ld4 t3 = [p] shladd t4 = n,4,t3 (p1) ld4 out0 = [t4] br.ret rp

During dependence graph construction, potentially control and data speculative edges and nodes are identified Check nodes are added where possibly needed (note that only data speculation checks are shown here) Dependence Graph with Predication & Speculation add t1 = 8,p ld4 count = [t1] cmp4.ge p1,p2=n,count (p1) ld4 t3 = [log] (p1) add t2 = 1,t2 (p1) st4 [log] = t2 mov out0 = 0 (p1) ld4 t3 = [p] chk.a p shladd t4 = n,4,t3 chk.a t4 (p1) ld4 out0 = [t4] br.ret rp

Speculative edges may be violated. Here the graph is re-drawn to show the enhanced parallelism Note that the speculation of both writes to the out0 register would require insertion of a copy. The scheduler must consider this in its scheduling Nodes with sufficient slack (e.g. writes to out0) will not be speculated Dependence Graph with Predication & Speculation (p1) ld4 t3 = [p] (p1) ld4 t3 = [log] add t1 = 8,p (p2) mov out0 = 0 shladd t4 = n,4,t3 (p1) add t2 = 1,t2 ld4 count = [t1] (p1) ld4 out0 = [t4] cmp4.ge p1,p2=n,count (p1) st4 [log] = t2 chk.a p chk.a t4 br.ret rp

Conclusions • IA-64 compilers push the complexity of the compiler • However, the technology is a logical progression from today’s • Today’s RISC compilers • are more complex • are more reliable • and deliver more performance than those of the early days • Complexity trend is mirrored in both hardware and applications • Need a balance to maximize benefits from each

Compiling for IA-64

Compiling for IA-64

Presentation Transcript

IA-64 Architecture (Think Intel Itanium)

IA-64 Microarchitecture --- Itanium Processor

Lecture: EPIC, IA-64 and Merced

Chapter 15 IA-64 Architecture

C++ Exception Handling for IA-64 Unix

Dynamic Instrumentation on the IA-64

IA-64 Architecture Innovations

Intel IA-64

The IA-64 Architectural Innovations

Pertemuan 22 IA-64 Architecture

Microprocessor system architectures – IA 64

IA-64

IA-64 Application Architecture Tutorial

Chapter 15 IA-64 Architecture

Pro64™: Performance Compilers For IA-64™

C++ Exception Handling for IA-64 Unix

IA-64 Architecture Innovations