Sami YEHIA and Olivier TEMAM LRI, Paris South University France

From Sequences of Dependent Instructions to Functions: A Complexity-Effective Approach for Improving Performance Without ILP or Speculation Sami YEHIA and Olivier TEMAM LRI, Paris South University France

Scaling Up Processors • Larger pipelines, caches, instruction windows and reservation stations • Aggressive speculation mechanisms : branch prediction, value prediction, data prefetching.. • Rely on ILP exploitation • What about scaling with little ILP?

Concept Program r1 r2 r3 rn … … addq r1,r2,r3 subq r3,10,r4 … … sll r5,6,r6 addq r5,r5,r4 r163 r162 r161 r11 r10 Logic circuit f163 f162 f161 f11 f10 r6 = f1(r1,r2,…,rn) r4 = f2(r1,r2,…,rn) • 264*num_registers input! (Theoretically) • Combinatorial Functions • A sequence of instructions is a set of functions

Principles • An « independent » Function for each output fr3(r9,r10) = r9 + r10 – 1 fr4(r9,r10) = sign_extension(r9 + r10 – 1)31:0 fr5(r9,r10) = ((r9 + r10 – 1)<<1) >> 1 fbr(r9,r10) = (r9 + r10 – 1)  ((r9 + r10 – 1)<< 1)>>1) DFG

Hardware Operator • Eliminate dependencies to calculate a+b+c a b c + f1 + out • r10 + r9 –1 to hardware operators f1i = f’(ai,bi,cout1i-1) cout1i =f’c(ai,bi,cout1i-1) outi = f’’(f1i,ci,cout2i-1) = f’’(ai,bi ,ci,cout1i-1,cout2i-1) cout2i = f’’c(ai,bi ,ci,cout1i-1,cout2i-1)

Complexity Effectiveness • Scalability of ILP Vs. Functions Performance ILP exploitation Functions Complexity

Related Work    AND OR XOR Adder AND OR XOR • ASIC • General-Purpose context • 3-1 Interlock Collapsing ALU[Y. Sazeides, S. Vassiliadis and J. Smith, Micro’ 29, 1996] • Chimaera[Z. YE et al., ISCA’ 27, 2000] • Grid Processors [R. Nagarajan et al., MICRO’ 34, 2001] • Cascade one or more hardware operators to execute specific functions

Building Functions Traces • From traces of instructions to configuration macros compilation toolchain to study: • Potential of the approach • Performance analysis on a superscalar processor

Potential of the Approach • Theoretical speedup • Cuts : limits to DFG collapsing (height) • Number of inputs • Non-collapsable instructions • Load instructions (27,7 %) • Carries from upper significant bits The lower the ILP the higher speedup op op op F1 @ @ mem LD F2 mem Cut op op

Theoretical Speedup

Number of Inputs

Non Collapsable Instructions

Implementation rePlay Framework

Performance Evaluation

RePlay Optimization Engine Delay • Function built “offline”

Latency of Function units

Future Work @’ op LD mem op op op op @ • Address prediction to overcome Load cuts @’ op Address Prediction & Cache Preloading LD op mem op F1 F2 @ F1 @ LD mem mem F2 op op

Q & A

Carries from Upper Significant Bits

Optimization Engine Delay

Latency of Function units

Sami YEHIA and Olivier TEMAM LRI, Paris South University France