Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan Zhou

ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi,Wei Liu, Josep Torrellas, Yuanyuan Zhou University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Presented at CARG, March 8th, 2006 By Jason Zebchuk

Data Value Speculation • Predict the value and proceed speculatively • When correct value is known, if mispediction, squash and re-execute • Initial proposals • L1 data value • Store/Load Data dependences • Aggressive novel proposals  Retire speculative instructions • Values of L2 misses • Thread independence in Thread-Level Speculation (TLS) • Speculative synchronization

Long-latency Speculation • Misprediction recovery is very wasteful • Most discarded instructions are still useful Speculatively retired instructions Checkpoint Prediction 6.6 Instructions 210Instructions Resolution

Contributions (I) • ReSlice: Architecture to buffer forward slice and re-execute it on misprediction Buffered Forward Slice Checkpoint Prediction Resolution If fails If succeeds

Contributions (II) • A sufficient condition for correct slice re-execution and merge • Guarantee to correctly repair the program state • Application to Thread-Level Speculation (TLS) on SpecInt • Prediction: task live-ins • Removed 61% task squashes • Speedup: up to 33% over TLS (avg 12%) • E×D2 reduction: avg 20%

Outline • Motivation and Contributions • Idea of ReSlice • Architecture Design • Experimental Results • Conclusions

Idea of ReSlice • Initial execution of the task • Predict value of “risky” load and continue • e.g L2 missing loads, exposed loads in TLS … • Buffer in HW the forward slice of the load • If a misprediction happens • Re-execute the slice with the correct value • If succeed: merge the register and memory state and continue • If fail: roll back and re-execute (conventional recovery)

Why is this Challenging? Initial execution is speculative  Buffered slices may not be correct “Risky” Load Initial Execution Buffered Slice Correct Execution … #1: LD R1 mem_A … #2: ST R2 (R1) … #3: LD R5 mem[0x20] … R1  0x10 ST R2 mem[0x10] R1 0x20 ST R2 mem[0x20] LD R5 mem[0x20] Problem: Instruction #3 is not buffered!

Solution: Run-time Address Check “Risky” Load Initial Execution Buffered Slice Slice Re-Execution … #1: LD R1 mem_A … #2: ST R2 (R1) … #3: LD R5 mem[0x20] … R1 0x10 ST R2 mem[0x10] R1 0x20 ST R2 mem[0x20] Different store addresses; And mem[0x20] was loaded in initial execution “Inhibiting” Store

Problems Initial Speculative Execution Buffered Slice Slice Re-execution Missing Insts Extra Insts Data Corruption (Example just shown) “Dangling” Load “Inhibiting” Load

A Sufficient Condition • Guarantee correct • Slice re-execution • State merge with program • Two sub-conditions • Branch takes same direction in both initial execution and slice re-execution • No presence of any previous three problems • Easy for HW to check at run-time • Details please see our paper • (Briefly: compare addresses accessed in old and new run, also look for references to same address in first run)

How does ReSlice Work? Timeline Checkpoint Slice Collection Prediction Compare the correct and predicted value Correct prediction Misprediction Resolution Slice Re-Execution Correct Incorrect State Merging

Architecture Design CPU Cache Register State Memory State Slice Info Live Ins Buffered Instr. Slice Descriptor Slice Descriptor Slice Info Re-Execution Unit (REU) Slice Descriptors Slice Buffer

Other Architectural Requirements • Tag physical registers with “SliceTag” • Tag Cache for storing addresses accessed by slice instructions • Undo Log records first store to each memory location in each Slice • High-coverage Predictor to predict which instructions to use as seeds. • Uses Dependence and Value Predictor used by TLS • Uses TLS information to identify Speculatively Read/Written locations in cache

Step 1: Slice Collection CPU • Fill up the Slice Buffer when a prediction is made • Track both register and memory dependences • Save live-in operands and slice instructions • Slices are buffered when instructions are retired Cache Register State Memory State Slice Info Live Ins Buffered Instr. Slice Info Re-Execution Unit (REU) Slice Descriptor Slice Buffer

An Example Live Ins Buffered Insts ADD R3, R1, R2 R4 = 0x10 ST R3, mem[R4] Slice Desc. ST R3, mem[R4] R4 = 0x10

Step 2: Slice Re-Execution CPU • REU takes over after a misprediction • In-order execution • Sufficient condition is checked during slice re-execution • If success, merge the register and memory state; otherwise, squash the task and restart Cache Register State Memory State Slice Info Live Ins Buffered Instr. Slice Info Re-Execution Unit (REU) Slice Descriptor Slice Buffer

Step 3: State Merging CPU • Copy live registers back to CPU register file • Merge memory state (details in paper) Cache Register State Memory State Slice Info Live Ins Buffered Instr. Slice Info Re-Execution Unit (REU) Slice Descriptor Slice Buffer

Multiple Overlapping Slices • One slice may change live-ins of the other slice • Detect overlapping slices during slice collection • Combine the instructions of the overlapping slices and re-execute them in program order misprediction misprediction

Outline • Motivation and Contributions • Idea of ReSlice • Architecture Design • Experimental Results • Conclusions

Methodology • Simulated CMP architecture • 5 GHz • Private 16k L1 per core; 1 MB Shared L2 • SpecInt 2000 applications Baseline: TLS TLS+ReSlice Serial Four 3-issue cores with TLS One 3-issue core Four 3-issue cores with TLS and ReSlice

Slice Characterization (Averages) 4.5 Reg live-ins 10.4 instructions 1.0 Mem live-ins 1.6 slices per task 15% tasks with overlapping slices 2.2 Reg live-outs 1.9 Mem live-outs

Number of Task Squashes 61% squashes are avoided because of ReSlice

Performance • Speedup TLS+ReSlice • Over TLS: up to 33% (avg 12%) • Over Serial: avg 45%

Energy × Delay2 E×D2 reduction: 20% over TLS

Conclusions • Generic architecture for forward slice re-execution • Sufficient condition for correct re-execution and merge • Improve TLS on SpecInt • Speedups: up to 33% over TLS (avg 12%) • E×D2 reduction: avg 20%over TLS • Recovering wasted work is a promising approach • Boost performance • Energy efficiency

ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi,Wei Liu, Josep Torrellas, Yuanyuan Zhou University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu

Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan Zhou