210 likes | 350 Views
From Sequences of Dependent Instructions to Functions: A Complexity- E ffective Approach for Improving Performance Without ILP or Speculation. Sami YEHIA and Olivier TEMAM LRI, Paris South University France. Scaling Up Processors.
E N D
From Sequences of Dependent Instructions to Functions: A Complexity-Effective Approach for Improving Performance Without ILP or Speculation Sami YEHIA and Olivier TEMAM LRI, Paris South University France
Scaling Up Processors • Larger pipelines, caches, instruction windows and reservation stations • Aggressive speculation mechanisms : branch prediction, value prediction, data prefetching.. • Rely on ILP exploitation • What about scaling with little ILP?
Concept Program r1 r2 r3 rn … … addq r1,r2,r3 subq r3,10,r4 … … sll r5,6,r6 addq r5,r5,r4 r163 r162 r161 r11 r10 Logic circuit f163 f162 f161 f11 f10 r6 = f1(r1,r2,…,rn) r4 = f2(r1,r2,…,rn) • 264*num_registers input! (Theoretically) • Combinatorial Functions • A sequence of instructions is a set of functions
Principles • An « independent » Function for each output fr3(r9,r10) = r9 + r10 – 1 fr4(r9,r10) = sign_extension(r9 + r10 – 1)31:0 fr5(r9,r10) = ((r9 + r10 – 1)<<1) >> 1 fbr(r9,r10) = (r9 + r10 – 1) ((r9 + r10 – 1)<< 1)>>1) DFG
Hardware Operator • Eliminate dependencies to calculate a+b+c a b c + f1 + out • r10 + r9 –1 to hardware operators f1i = f’(ai,bi,cout1i-1) cout1i =f’c(ai,bi,cout1i-1) outi = f’’(f1i,ci,cout2i-1) = f’’(ai,bi ,ci,cout1i-1,cout2i-1) cout2i = f’’c(ai,bi ,ci,cout1i-1,cout2i-1)
Complexity Effectiveness • Scalability of ILP Vs. Functions Performance ILP exploitation Functions Complexity
Related Work AND OR XOR Adder AND OR XOR • ASIC • General-Purpose context • 3-1 Interlock Collapsing ALU[Y. Sazeides, S. Vassiliadis and J. Smith, Micro’ 29, 1996] • Chimaera[Z. YE et al., ISCA’ 27, 2000] • Grid Processors [R. Nagarajan et al., MICRO’ 34, 2001] • Cascade one or more hardware operators to execute specific functions
Building Functions Traces • From traces of instructions to configuration macros compilation toolchain to study: • Potential of the approach • Performance analysis on a superscalar processor
Potential of the Approach • Theoretical speedup • Cuts : limits to DFG collapsing (height) • Number of inputs • Non-collapsable instructions • Load instructions (27,7 %) • Carries from upper significant bits The lower the ILP the higher speedup op op op F1 @ @ mem LD F2 mem Cut op op
Implementation rePlay Framework
RePlay Optimization Engine Delay • Function built “offline”
Future Work @’ op LD mem op op op op @ • Address prediction to overcome Load cuts @’ op Address Prediction & Cache Preloading LD op mem op F1 F2 @ F1 @ LD mem mem F2 op op