310 likes | 411 Views
ICS´99, Rhodes (Greece) - June 20-25, 1999. Dynamic Removal of Redundant Computations. Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona {cmolina,antonio,jordit}@ac.upc.es. Motivation. Quasi-common subexpression. Quasi - invariant.
E N D
ICS´99, Rhodes (Greece) - June 20-25, 1999 Dynamic Removal of Redundant Computations Carlos Molina, Antonio González and Jordi Tubella Universitat Politècnica de Catalunya - Barcelona{cmolina,antonio,jordit}@ac.upc.es
Motivation Quasi-common subexpression Quasi - invariant . . . . . R = S / T ; . . . . . X = S / U ; . . . . . for (i=0; i<N; i++) A[i] = B[i]+C[i];
Outline • Instruction Reuse • Related Work • Redundant Computation Buffer • Performance Results • Conclusions
Instruction Reuse Reuse Mechanism index OOO Execution Fetch Commit Decode & Rename
Related Work • Instruction Reuse • Value Cache for the Tree Machine (Harbison 82) • Result Cache (Richardson 92, Oberman et al. 95) • Reuse Buffer (Sodani and Sohi 97) • Physical Register Reuse (Jourdan et al. 98) • Trace Reuse • Basic blocks (Huang and Lilja 99) • General traces (González et al. 99)
Related Work • Result Cache • Richardson 92, Oberman & Flynn 95 • Special purpose (long latency operations) • Indexed by operand values • No reuse chaining • Can reuse dynamic instances of other static instructions • Reuse Buffer • Sodani & Sohi 97 • General purpose • Indexed by PC • Reuse chaining • Only reuse dynamic instances of same static instructions
address tag result Redundant Computation Buffer Vtable Atable pointer Mtable Atable opcode result/address opnd1 opnd2 pointer Reuse Test Reused Memory Value Reused Value
div 8 2 4 nil 10: 4 I1: 8 / 2 = 4 RCB (Working Example) Vtable Atable while (cond) { r = s / t ; ...... x = s / u ; }
4 div 8 2 4 nil 20: I2: 8 / 2 = 4 RCB (Working Example) Vtable Atable div 8 2 4 nil 10: while (cond) { r = s / t ; ...... x = s / u ; }
div 8 2 4 nil 4 div 8 2 4 20: I2: 8 / 2 = 4 RCB (Working Example) Vtable Atable 10: while (cond) { r = s / t ; ...... x = s / u ; }
div div 9 8 3 2 4 3 nil nil 3 4 div 8 2 4 nil 20: I1: 9 / 3 = 3 I2: 9 / 3 = 3 RCB (Working Example) Vtable Atable 10: while (cond) { r = s / t ; ...... x = s / u ; }
opcode result/address opnd1 opnd2 address tag address tag result result PC Enhancements to Other Schemes • Enhanced Result Cache Mtable Atable Operands • Enhanced Reuse Buffer Mtable Atable opcode result/address opnd1 opnd2
fetch decode& rename opnd read &dispatch issue execute write back commit Atable lookup reuse test Latency of the Reuse Buffer 1stAtable lookup 2ndAtable lookup reuse test Latency of the RCB Atable lookup reuse test Latency of the Result Cache Timing Considerations Pipeline Stages
Experimental Framework • Simulator Alpha version of the SimpleScalar Toolset • Benchmarks Spec95 • Maximum Optimization Level DEC C & F77 compilers with -non_shared -O5 • Statistics Collected for 125 million instructions Skipping initializations
Basic Reuse Statistics • We evaluate different schemes - Enhanced Result Cache (ERC) - Enhanced Reuse Buffer (ERB) - Redundant Computation Buffer (RCB) • We find best configuration for each scheme - Number of entries - History depth • Best configurations will be evaluated - Percentage of reuse - Speedup
Study of Reuse (ERB) | | | | | | | | | 8 16 32 64 128 256 512 1024 2048 4096 Size in Kbytes
Study of Reuse (RCB) | | | | | | | | | 8 16 32 64 128 256 512 1024 2048 4096 Size in Kbytes
Study of Reuse (Comparative) | | | | | | | | | 8 16 32 64 128 256 512 1024 2048 4096 Size in Kbytes
Performance Evaluation • Two different capacities are evaluated - 32 KB - 200 KB • Best configuration has been chosen for each reuse scheme • We present a performance evaluation for a supercalar processor - Speedup - Percentage of reuse
1.20 1.15 1.10 1.05 1.00 Speedup (32 KB)
Speedup (200 KB) 1.25 1.20 1.15 1.10 1.05 1.00
Reuse (32 KB) Ops ready
Reuse (200 KB) Ops ready
Reuse by Instruction Category Load Value Memory Address Arithmetic Cond Branch
opco opco res/addr res/addr op1 op1 op2 op2 pointer pointer opco res/addr op1 op2 nil opcod result/addr opnd1 opnd2 Hybrid Scheme Atable Atable PC PC Atable Opnds Opnds
Speedup (Hybrid Scheme) 1.20 1.15 1.10 1.05 1.00
Speedup (Perfect Reuse Engine) 2.20 2.00 1.80 1.60 1.40 1.20 1.00
Conclusions • Redundant Computation Buffer • Quasi-invariants • Quasi-common subexpressions • High reuse coverage and low latency • 30% reuse • 10% speedup • Outperforms previous schemes