1 / 21

Sami YEHIA and Olivier TEMAM LRI, Paris South University France

From Sequences of Dependent Instructions to Functions: A Complexity- E ffective Approach for Improving Performance Without ILP or Speculation. Sami YEHIA and Olivier TEMAM LRI, Paris South University France. Scaling Up Processors.

csilla
Download Presentation

Sami YEHIA and Olivier TEMAM LRI, Paris South University France

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Sequences of Dependent Instructions to Functions: A Complexity-Effective Approach for Improving Performance Without ILP or Speculation Sami YEHIA and Olivier TEMAM LRI, Paris South University France

  2. Scaling Up Processors • Larger pipelines, caches, instruction windows and reservation stations • Aggressive speculation mechanisms : branch prediction, value prediction, data prefetching.. • Rely on ILP exploitation • What about scaling with little ILP?

  3. Concept Program r1 r2 r3 rn … … addq r1,r2,r3 subq r3,10,r4 … … sll r5,6,r6 addq r5,r5,r4 r163 r162 r161 r11 r10 Logic circuit f163 f162 f161 f11 f10 r6 = f1(r1,r2,…,rn) r4 = f2(r1,r2,…,rn) • 264*num_registers input! (Theoretically) • Combinatorial Functions • A sequence of instructions is a set of functions

  4. Principles • An « independent » Function for each output fr3(r9,r10) = r9 + r10 – 1 fr4(r9,r10) = sign_extension(r9 + r10 – 1)31:0 fr5(r9,r10) = ((r9 + r10 – 1)<<1) >> 1 fbr(r9,r10) = (r9 + r10 – 1)  ((r9 + r10 – 1)<< 1)>>1) DFG

  5. Hardware Operator • Eliminate dependencies to calculate a+b+c a b c + f1 + out • r10 + r9 –1 to hardware operators f1i = f’(ai,bi,cout1i-1) cout1i =f’c(ai,bi,cout1i-1) outi = f’’(f1i,ci,cout2i-1) = f’’(ai,bi ,ci,cout1i-1,cout2i-1) cout2i = f’’c(ai,bi ,ci,cout1i-1,cout2i-1)

  6. Complexity Effectiveness • Scalability of ILP Vs. Functions Performance ILP exploitation Functions Complexity

  7. Related Work    AND OR XOR Adder AND OR XOR • ASIC • General-Purpose context • 3-1 Interlock Collapsing ALU[Y. Sazeides, S. Vassiliadis and J. Smith, Micro’ 29, 1996] • Chimaera[Z. YE et al., ISCA’ 27, 2000] • Grid Processors [R. Nagarajan et al., MICRO’ 34, 2001] • Cascade one or more hardware operators to execute specific functions

  8. Building Functions Traces • From traces of instructions to configuration macros compilation toolchain to study: • Potential of the approach • Performance analysis on a superscalar processor

  9. Potential of the Approach • Theoretical speedup • Cuts : limits to DFG collapsing (height) • Number of inputs • Non-collapsable instructions • Load instructions (27,7 %) • Carries from upper significant bits The lower the ILP the higher speedup op op op F1 @ @ mem LD F2 mem Cut op op

  10. Theoretical Speedup

  11. Number of Inputs

  12. Non Collapsable Instructions

  13. Implementation rePlay Framework

  14. Performance Evaluation

  15. RePlay Optimization Engine Delay • Function built “offline”

  16. Latency of Function units

  17. Future Work @’ op LD mem op op op op @ • Address prediction to overcome Load cuts @’ op Address Prediction & Cache Preloading LD op mem op F1 F2 @ F1 @ LD mem mem F2 op op

  18. Q & A

  19. Carries from Upper Significant Bits

  20. Optimization Engine Delay

  21. Latency of Function units

More Related