Code Size Efficiency in Global Scheduling for ILP Processors

Code Size Efficiency in Global Scheduling for ILP Processors Huiyang Zhou, Tom Conte TINKER Research Group Department of Electrical & Computer Engineering North Carolina State University

Outline • Introduction • Quantitative measure of code size efficiency • Best code size efficiency for a given code size limit • Optimal code size efficiency for a program • Summary • Future work

Introduction • Instruction level parallelism (ILP) vs. static code size • Region enlarging optimizations usually enhance ILP • Cyclic scheduling: loop unrolling, loop peeling, etc. • Acyclic scheduling: tail duplication, recovery code, etc. • I-cache and ITLB performance vs. static code size • Larger code usually means larger I-Cache footprint • Trade off of the conflicting effects of code size increase • Especially in acyclic global scheduling

BB1 BB3 BB2 BB4 BB5 BB6 Tree1 Tree2 Background of Treegion Scheduling • Treegion scheduling • An acyclic scheduling technique • Two phases • Treegion formation • Treegion-based instruction scheduling: Tree Traversal Scheduling (TTS) (HPCA-4, LCPC’01) • Treegion • Basic scheduling unit • A single-entry / multiple-exit nonlinear region with CFG forming a tree (i.e., no merge points and back-edges in a treegion)

BB1 BB1 BB3 BB2 BB3 BB2 BB4’ BB4 BB4 BB5’ BB6’ BB5 BB6 BB5 BB6 Tree 1’ Tree1 Tree2 Background of Treegion Scheduling • Treegion examples Natural treegion: treegions formed without tail duplication (i.e., no code size increase during natural treegion formation)

Code Size Effects in Treegion Scheduling • Tail duplication increases code size • General operation combining reduces code size … R1=R3+R4 … BB1 BB3 BB3 BB2 BB2 BB4’ BB4’ … ________ … … R1=R3+R4 … BB5’ … _________ R9=R1*4 … BB5’ … R7=R3+R4 R9=R7*4 … BB5 BB6 BB5 BB6

Quantitative Measure of Code Size Efficiency • ILP vs. static code size Havanki’s heuristic: A treegion formation heuristic proposed before [HPCA-4].

Code Size Efficiency for Any Code Size Related Optimizations • Use the ratio of IPC changes over code size changes as an indication of code size efficiency. • Average code size efficiency • Instantaneous code size efficiency

A4 A3 A2 A1 A0 Average and Instantaneous Code Size Efficiency Static IPC Code Size

Estimate Static IPC Before Scheduling • Use the expected execution time to calculate the static IPC For a multi-path region: • Now, IPC changes can be calculated as execution time saved by the optimization. Tree1’ tree1 Example: tree2

Optimal Code Size Efficiency For A Given Code Size Limit Static IPC Fixed code size, try to maximize the static IPC, i.e., maximize the average code size efficiency Natural Treegion Code Size Size Limit

IPC Relative Code Size limit Optimal Tail Duplication Under Code Size Constraint • Calculate the instantaneous code size efficiency for all possible tail duplication candidates in the program scope. • Find the one with best code size efficiency. • If the selected candidate satisfies the code size constraint, perform the tail duplication and update the code size efficiencies of the candidates that are affected by the tail duplication process. • Repeat steps 2-3 until the code size limit is reached.

Specification Execution Dispatch/Issue/Retire bandwidth: 8; Universal function units: 8; Operation latency: ALU, ST, BR: 1 cycle; LD, floating-point (FP) add/subtract: 2 cycles. I-cache Compressed (zero-nop) and two banks with 2-way 16KB each bank. Line size: 16 operations with 4 bytes each operation. Miss latency: 12 cycles D-cache Size/Associativity/Replacement: 64KB/4-way/LRU; Line size: 32 bytes Miss Penalty: 14 cycles Branch Predictor G-share style Multiway branch prediction [20] Branch prediction table: 214 entries; Branch target buffer: 214 entries/8-way/LRU. Branch misprediction penalty: 10 cycles Processor Model

Results: ILP vs. Code Size 30% 80% 5% 2% 0%

Results: ILP vs. Code Size (cont.) 5% 80% 2% 30% 0% Reason: only a very small part of the program is frequently executed.

A’ A l Optimal Code Size Efficiency • Definition: the point where the ‘diminishing returns’ start • Finding the optimal code size efficiency IPC Relative code size

Finding the Optimal Code Size Efficiency • K is the slope of line l A or A’ K K1 K2 0 Relative code size Threshold on the first derivative of IPC vs. code size curve, which is simply the threshold on instantaneous code size efficiency !

Finding the Optimal Code Size Efficiency (cont.) • Meaning of K1 and K2 • K1 and K2 are the slope of the lines l1 and l2. • The range (K1 – K2) determines the robustness of the threshold scheme. • Point B  Threshold as K1 • Point C  Threshold as K2 C IPC l2 B A l1 Relative code size

Algorithm for Finding the Optimal Code Size Efficiency • Set the threshold k anywhere between tan(/6) to tan(/12) • Calculate the instantaneous code size efficiency for all possible tail duplication candidates in the program scope. • If there is a candidate whose instantaneous code size efficiency is above the threshold, duplicate the candidate and update the efficiency of affected candidates, repeat until there are no more candidates. When the expected execution time is used, the threshold scheme becomes (derivation details in ref [21])

Results for Optimal Code Size Efficiency • Vary threshold from tan(/12) to tan(/6), the threshold scheme finds the optimal efficiency accurately. • Use m88ksim as an example 20% 10% 5% 2% 0%

I-Cache Impacts of the Code Size Increase Code size impacts and locality impacts (ref [3])

I-Cache Impacts of the Code Size Increase (cont.) Denser schedule of optimal efficiency results

I-Cache Impacts of the Code Size Increase (cont.) The combined impact

Processor Performance In average, significant speedup (17% over natural treegion) in dynamic IPC at the cost of 2% code size increase.

Conclusions • Quantitative measure of the code size efficiency: the ratio of IPC changes over code size increase • Best code size efficiency for a given code size limit • Results • Significant but varying impact on IPC • Optimal efficiency: simple yet robust threshold scheme to find ‘knee’ of the curve • Results • Improved I-cache performance (4%) • Significant speedup (17%) • Moderate static code size increase (2%) • Future Work • Combine with other optimization, e.g., loop unrolling.

Contact Information Huiyang Zhouhzhou@eos.ncsu.edu Tom Conteconte@eos.ncsu.edu TINKER Research Group North Carolina State University www.tinker.ncsu.edu

Code Size Efficiency in Global Scheduling for ILP Processors