1 / 28

Trace-Level Reuse

1999 International Conference on Parallel Processing ICPP´99. Trace-Level Reuse. A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya. Motivation. Increase performance by overcoming dataflow limitation DATA SPECULATION

shayla
Download Presentation

Trace-Level Reuse

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1999 International Conference on Parallel Processing ICPP´99 Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya

  2. Motivation • Increase performance by overcoming dataflow limitation • DATA SPECULATION • Exploits predictability of values • DATA REUSE • Exploits redundancy of computations ICPP´99

  3. OUT = f (IN) OUT = f (IN) OUT = f (IN) Motivation • Redundant computations are rather frequent • code • loops, recursive subroutines • data • finite domain of values • The results could be reused instead of recomputed redundant computations dynamic execution stream ICPP´99

  4. Motivation • Reuse granularity • an instruction • a sequence of instructions • TRACE-LEVEL REUSE • Performance potential of data reuse • at instruction-level • at trace-level ICPP´99

  5. Outline • Trace-level reuse • Performance potential • A first approach • Related work • Conclusions ICPP´99

  6. Trace-Level Reuse • Trace • Any dynamic sequence of instructions • Goal • Avoid the execution of a trace by reusing its results • provided that the same trace with the same inputs has already been executed • Advantages • Reduces other machine resources utilization • Reduces time to compute results • Allows the processor to exceed the dataflow limit ICPP´99

  7. Trace-Level Reuse • Hardware scheme • Main Issues • Reuse Trace Memory (RTM) • Dynamic trace collection • Reuse test • State update ICPP´99

  8. TRACE Trace input Trace output Next Address Initial Address Input registers identifiers&contents Input memory addresses&contents Output registers identifiers&contents output memory addresses&contents Reuse Trace Memory (RTM) • RTM stores candidate traces to be reused INPUT OUTPUT ICPP´99

  9. Dynamic trace collection • Chooses candidate traces • Initial address • Next address • Input and output trace locations are computed at execution-time and stored along with their values in RTM ICPP´99

  10. Reuse Test & State Update • Reuse test • At some points of the execution the reused test is performed • Checks if a trace input, stored in RTM, matches the current execution state • State update • Writes output trace values to output trace locations • REUSE LATENCY • Reuse test plus State update ICPP´99

  11. Outline • Trace-level reuse • Performance Potential • A first approach • Related work • Conclusions ICPP´99

  12. Performance Potential • Base-line machine • ISA: Alpha • Only constrained by: • Data dependences • Data dependences + Finite instruction window • Reuse engine • Perfect trace reuse • Maximum-length traces • Minimum number of traces ICPP´99

  13. Performance Potential • Instruction-level reuse (ILR) • Perfect instruction reuse engine: • All previous executed instances of each instruction are checked for a possible reuse • Maximum reusability: almost 90% ICPP´99

  14. ILR • Performance limits • Base-line machine • constrained by data dependences • Reuse engine: 1-cycle latency ICPP´99

  15. ILR • Performance limits • Base-line machine constrained by • data dependences • data dependences and instruction window • Reuse latency: 1 to 4 cycles ICPP´99

  16. ILR • Performance limits • Moderate potential with a perfect reuse engine • Instruction latency is reduced • The reuse of a chain of dependent instructions is still a sequential process • Source operands must be ready ICPP´99

  17. Performance Potential • Trace-level reuse (TLR) • Perfect reuse engine • Traces consist of maximum-length dynamic sequences of reusable instructions • Upper bound of the maximum reusability • Lower bound of the minimum traces I1 I2 I3 TRACE I4 I5 I6 ICPP´99

  18. TLR • Average trace size: 15.0 instructions • FP: 11.7 • INT: 20.3 203 116 ICPP´99

  19. TLR • Performance limits • Base-line machine constrained by • data dependences ans instruction window (256-entry) • Reuse engine latency • Constant • Linear: f(#INPUTS+#OUTPUTS) CONSTANT LINEAR ICPP´99

  20. Outline • Trace-level reuse • Performance potential • A first approach • Related work • Conclusions ICPP´99

  21. A First Approach • Reuse Trace Memory (RTM) • Indexed by trace initial address (4-way and 8-way) • Maximum number of input and output values: • 8 register values • 4 memory values • Sizes • 512 entries (4 different entries per initial address) • 4K entries (8 entries per initial address) • 32K entries (16 entries per initial address) • 256K entries (16 entries per initial address) ICPP´99

  22. A First Approach • In-order execution • Reuse test performed for every fetch operation RTM PC RTM entry Instruction Cache Reuse Test Fetch Decode Commit Execute ICPP´99

  23. A First Approach • Dynamic trace collection • Built traces have all instructions reusable • an additional memory to check instruction reusability is needed • Fixed-length traces • starting at any address • Trace expansion on reuse hit ICPP´99

  24. Reusable Instructions • 25% reusability for a 4K-entry RTM ICPP´99

  25. Trace Size • 6 instructions for a 4K-entry RTM ICPP´99

  26. Related work • Data Reuse • Software implementation • Memoization [Richardson,92] • Hardware implementation • Tree Machine [Harbison,82] • At instruction-level • Reuse Buffer [Sodani and Sohi,97] • Register renaming [Jourdan et al.,98] • Redundant Computation Buffer [Molina, González and Tubella,99] • At “trace”-level • Result cache [Richardson,93] [Oberman and Flynn,95] • Basic block reuse [Huang and Lilja,99] ICPP´99

  27. Conclusions • Increasing the granularity of reuse from instructions to traces • Less reusability • More effective • Fetch band-width is reduced • Effective instruction window size is increased • Number of operations per reused instruction is reduced • DATA DEPENDENCES ARE BROKEN ICPP´99

  28. Conclusions • Concentrate effort in • divising strategies to choose reusable traces • High-level structures • Compiler assistance • reducing the reuse test overhead • Boolean test • Invalidate/validate RTM entries ICPP´99

More Related