260 likes | 350 Views
TurboROB A Low Cost Checkpoint/Restore Accelerator. Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto { pakl , moshovos}@eecg.toronto.edu. Recovering From Control Flow Mispredictions. Execution Timeline.
E N D
TurboROBA Low Cost Checkpoint/Restore Accelerator Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto {pakl, moshovos}@eecg.toronto.edu
Recovering From Control Flow Mispredictions Execution Timeline Predict a Branch Outcome Misprediction Discovered Recover Processor State Redirect Fetch Correct Path Predicted Path Resume Execution • Accelerate Recovery – Improve Performance
State-of-the-Art Recovery • Scalability and/or Performance Issues Log of Changes State Snapshot Predict a Branch Outcome ROB Misprediction Discovered what old value
Turbo-ROB Log of Changes Predict a Branch Outcome ROB Misprediction Discovered Partial Log of Changes • Make common case fast: • Recover only at branches • Store only as much as needed: • Partial Log
Outline • Control Flow Mispeculation Recovery • TurboROB • Methodology and Results • Summary
State Recovery Example: Register Alias Table Lg(# arch. regs) Original Code RAT A add r1, r2, 100 B breq r1, E C sub r1, r2, r2 p1 p4 p5 p5 p4 Architectural Register p2 p3 # arch. regs Renamed Code A add p4, p2, 100 B breq p4, E C sub r5, p2, p2 Physical Register
B B B B B ROB: Slow, Fine-Grain Recovery Each entry contains • Architectural destination register • Its previous RAT map Program Order 3. Undo RAT updates in reverse order Reorder Buffer • Misprediction discovered 2. Locate newest instruction INVALID RAT • Too slow: recovery latency proportional to number of instructions to squash
B B B B B Global Checkpoints: Fast, Coarse-Grain Recovery Program Order checkpoint checkpoint checkpoint checkpoint Reorder Buffer • Misprediction discovered INVALID RAT • Branch w/ GC: Recovery is “Instantaneous”
RAT checkpoints Working Copy Impact of More Checkpoints Concept ActualImplementation architectural register physical register • More checkpoints ? • Power hungry structure • Increased delay • Only a few checkpoints can practically be implemented • Cannot always cover all branches
B B B B B Intelligent Checkpointing & BranchTap checkpoint checkpoint checkpoint checkpoint • Use Few Checkpoints Effectively • BranchTap: • Throttle Speculation
Conventional Mechanisms: Recovery Scenarios B B B checkpoint B B B checkpoint Re-Execution B B B checkpoint
Outline • Background • Turbo-ROB • Methodology and Results • Summary
Turbo-ROB ~ Recovery Cost B R2 R1 R1 R2 R1 ROB Recovery useful redundant We only need to reverse the first subsequent change for every RAT entry
Turbo-ROB Replacing the ROB B B B TROB Re-Execution B B B TROB
Selective Turbo-ROB w/ ROB B B B TROB Selective Turbo-ROB w/ GCs B B B TROB checkpoint
Outline • Background • TurboROB • Methodology and Results • Summary
Results Overview • TROB as an ROB replacement • BranchTap offers better performance than ROB • Fewer resources • Even for smaller windows • Selective TROB as a GC reduction mechanism • TROB reduces pressure for GCs • Offload a critical structure: RAT • In the paper: • Selective TROB as an ROB accelerator • Even the smallest TROB accelerates recovery
Methodology • Simulator based on Simplescalar • Alpha/OSF • 24 SPEC CPU 2000 benchmarks • Reference Inputs • Processor configurations • 4-way OoO core • 128/256/512 in-flight instructions • 1K-entry confidence table for low confidence branch identification / similar results with Anyweak • 1B committed instructions after skipping 2B
“Perfect Checkpointing” Configuration • A checkpoint is auto-magically taken at all mispredicted branches • All recoveries are fast • We report the “deterioration relative to perfect checkpointing”
better TROB Replacing the ROB/512-Entry Window • 64-entry TROB == ROB on the Average • Pathological cases exist 256-entry needed • 512-Entry TROB better than ROB
better TROB Replacing the ROB/128-Entry Window • 64-Entry 50% better than ROB • Fewer pathological cases • 128-Entry TROB better than ROB
better sTROB and Global Checkpoints/128-Entry Window • TROB + 1 GC better than 4GCs
Summary • TROB vs. ROB • Replacement • Same resources better performance • Fewer resources often better performance • Except when accuracy is high • Acceleration: • ¼ resources 35% improvement • TROB vs. GCs • Reduce pressure from the critical path • With just 1 GC match the performance of four GCs • One more alternative for designers • Allows different area/performance/power tradeoffs
TurboROBA Low Cost Checkpoint/Restore Accelerator Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto {pakl, moshovos}@eecg.toronto.edu
better TROB Replacing the ROB/512-Entry Window • 64-entry TROB == ROB on the Average • Pathological cases exist 256-entry needed • 512-Entry TROB better than ROB
better TROB Replacing the ROB/128-Entry Window • 64-Entry 50% better than ROB • Fewer pathological cases • 128-Entry TROB better than ROB