280 likes | 383 Views
1999 International Conference on Parallel Processing ICPP´99. Trace-Level Reuse. A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya. Motivation. Increase performance by overcoming dataflow limitation DATA SPECULATION
E N D
1999 International Conference on Parallel Processing ICPP´99 Trace-Level Reuse A. González, J. Tubella and C. Molina Dpt. d´Arquitectura de Computadors Universitat Politècnica de Catalunya
Motivation • Increase performance by overcoming dataflow limitation • DATA SPECULATION • Exploits predictability of values • DATA REUSE • Exploits redundancy of computations ICPP´99
OUT = f (IN) OUT = f (IN) OUT = f (IN) Motivation • Redundant computations are rather frequent • code • loops, recursive subroutines • data • finite domain of values • The results could be reused instead of recomputed redundant computations dynamic execution stream ICPP´99
Motivation • Reuse granularity • an instruction • a sequence of instructions • TRACE-LEVEL REUSE • Performance potential of data reuse • at instruction-level • at trace-level ICPP´99
Outline • Trace-level reuse • Performance potential • A first approach • Related work • Conclusions ICPP´99
Trace-Level Reuse • Trace • Any dynamic sequence of instructions • Goal • Avoid the execution of a trace by reusing its results • provided that the same trace with the same inputs has already been executed • Advantages • Reduces other machine resources utilization • Reduces time to compute results • Allows the processor to exceed the dataflow limit ICPP´99
Trace-Level Reuse • Hardware scheme • Main Issues • Reuse Trace Memory (RTM) • Dynamic trace collection • Reuse test • State update ICPP´99
TRACE Trace input Trace output Next Address Initial Address Input registers identifiers&contents Input memory addresses&contents Output registers identifiers&contents output memory addresses&contents Reuse Trace Memory (RTM) • RTM stores candidate traces to be reused INPUT OUTPUT ICPP´99
Dynamic trace collection • Chooses candidate traces • Initial address • Next address • Input and output trace locations are computed at execution-time and stored along with their values in RTM ICPP´99
Reuse Test & State Update • Reuse test • At some points of the execution the reused test is performed • Checks if a trace input, stored in RTM, matches the current execution state • State update • Writes output trace values to output trace locations • REUSE LATENCY • Reuse test plus State update ICPP´99
Outline • Trace-level reuse • Performance Potential • A first approach • Related work • Conclusions ICPP´99
Performance Potential • Base-line machine • ISA: Alpha • Only constrained by: • Data dependences • Data dependences + Finite instruction window • Reuse engine • Perfect trace reuse • Maximum-length traces • Minimum number of traces ICPP´99
Performance Potential • Instruction-level reuse (ILR) • Perfect instruction reuse engine: • All previous executed instances of each instruction are checked for a possible reuse • Maximum reusability: almost 90% ICPP´99
ILR • Performance limits • Base-line machine • constrained by data dependences • Reuse engine: 1-cycle latency ICPP´99
ILR • Performance limits • Base-line machine constrained by • data dependences • data dependences and instruction window • Reuse latency: 1 to 4 cycles ICPP´99
ILR • Performance limits • Moderate potential with a perfect reuse engine • Instruction latency is reduced • The reuse of a chain of dependent instructions is still a sequential process • Source operands must be ready ICPP´99
Performance Potential • Trace-level reuse (TLR) • Perfect reuse engine • Traces consist of maximum-length dynamic sequences of reusable instructions • Upper bound of the maximum reusability • Lower bound of the minimum traces I1 I2 I3 TRACE I4 I5 I6 ICPP´99
TLR • Average trace size: 15.0 instructions • FP: 11.7 • INT: 20.3 203 116 ICPP´99
TLR • Performance limits • Base-line machine constrained by • data dependences ans instruction window (256-entry) • Reuse engine latency • Constant • Linear: f(#INPUTS+#OUTPUTS) CONSTANT LINEAR ICPP´99
Outline • Trace-level reuse • Performance potential • A first approach • Related work • Conclusions ICPP´99
A First Approach • Reuse Trace Memory (RTM) • Indexed by trace initial address (4-way and 8-way) • Maximum number of input and output values: • 8 register values • 4 memory values • Sizes • 512 entries (4 different entries per initial address) • 4K entries (8 entries per initial address) • 32K entries (16 entries per initial address) • 256K entries (16 entries per initial address) ICPP´99
A First Approach • In-order execution • Reuse test performed for every fetch operation RTM PC RTM entry Instruction Cache Reuse Test Fetch Decode Commit Execute ICPP´99
A First Approach • Dynamic trace collection • Built traces have all instructions reusable • an additional memory to check instruction reusability is needed • Fixed-length traces • starting at any address • Trace expansion on reuse hit ICPP´99
Reusable Instructions • 25% reusability for a 4K-entry RTM ICPP´99
Trace Size • 6 instructions for a 4K-entry RTM ICPP´99
Related work • Data Reuse • Software implementation • Memoization [Richardson,92] • Hardware implementation • Tree Machine [Harbison,82] • At instruction-level • Reuse Buffer [Sodani and Sohi,97] • Register renaming [Jourdan et al.,98] • Redundant Computation Buffer [Molina, González and Tubella,99] • At “trace”-level • Result cache [Richardson,93] [Oberman and Flynn,95] • Basic block reuse [Huang and Lilja,99] ICPP´99
Conclusions • Increasing the granularity of reuse from instructions to traces • Less reusability • More effective • Fetch band-width is reduced • Effective instruction window size is increased • Number of operations per reused instruction is reduced • DATA DEPENDENCES ARE BROKEN ICPP´99
Conclusions • Concentrate effort in • divising strategies to choose reusable traces • High-level structures • Compiler assistance • reducing the reuse test overhead • Boolean test • Invalidate/validate RTM entries ICPP´99