1 / 26

TurboROB A Low Cost Checkpoint/Restore Accelerator

TurboROB A Low Cost Checkpoint/Restore Accelerator. Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto { pakl , moshovos}@eecg.toronto.edu. Recovering From Control Flow Mispredictions. Execution Timeline.

Download Presentation

TurboROB A Low Cost Checkpoint/Restore Accelerator

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TurboROBA Low Cost Checkpoint/Restore Accelerator Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto {pakl, moshovos}@eecg.toronto.edu

  2. Recovering From Control Flow Mispredictions Execution Timeline Predict a Branch Outcome Misprediction Discovered Recover Processor State Redirect Fetch Correct Path Predicted Path Resume Execution • Accelerate Recovery – Improve Performance

  3. State-of-the-Art Recovery • Scalability and/or Performance Issues Log of Changes State Snapshot Predict a Branch Outcome ROB Misprediction Discovered what old value

  4. Turbo-ROB Log of Changes Predict a Branch Outcome ROB Misprediction Discovered Partial Log of Changes • Make common case fast: • Recover only at branches • Store only as much as needed: • Partial Log

  5. Outline • Control Flow Mispeculation Recovery • TurboROB • Methodology and Results • Summary

  6. State Recovery Example: Register Alias Table Lg(# arch. regs) Original Code RAT A add r1, r2, 100 B breq r1, E C sub r1, r2, r2 p1 p4 p5 p5 p4 Architectural Register p2 p3 # arch. regs Renamed Code A add p4, p2, 100 B breq p4, E C sub r5, p2, p2 Physical Register

  7. B B B B B ROB: Slow, Fine-Grain Recovery Each entry contains • Architectural destination register • Its previous RAT map Program Order 3. Undo RAT updates in reverse order Reorder Buffer • Misprediction discovered 2. Locate newest instruction INVALID RAT • Too slow: recovery latency proportional to number of instructions to squash

  8. B B B B B Global Checkpoints: Fast, Coarse-Grain Recovery Program Order checkpoint checkpoint checkpoint checkpoint Reorder Buffer • Misprediction discovered INVALID RAT • Branch w/ GC: Recovery is “Instantaneous”

  9. RAT checkpoints Working Copy Impact of More Checkpoints Concept ActualImplementation architectural register physical register • More checkpoints ? • Power hungry structure • Increased delay • Only a few checkpoints can practically be implemented • Cannot always cover all branches

  10. B B B B B Intelligent Checkpointing & BranchTap checkpoint checkpoint checkpoint checkpoint • Use Few Checkpoints Effectively • BranchTap: • Throttle Speculation

  11. Conventional Mechanisms: Recovery Scenarios B B B checkpoint B B B checkpoint Re-Execution B B B checkpoint

  12. Outline • Background • Turbo-ROB • Methodology and Results • Summary

  13. Turbo-ROB ~ Recovery Cost B R2 R1 R1 R2 R1 ROB Recovery useful redundant We only need to reverse the first subsequent change for every RAT entry

  14. Turbo-ROB Replacing the ROB B B B TROB Re-Execution B B B TROB

  15. Selective Turbo-ROB w/ ROB B B B TROB Selective Turbo-ROB w/ GCs B B B TROB checkpoint

  16. Outline • Background • TurboROB • Methodology and Results • Summary

  17. Results Overview • TROB as an ROB replacement • BranchTap offers better performance than ROB • Fewer resources • Even for smaller windows • Selective TROB as a GC reduction mechanism • TROB reduces pressure for GCs • Offload a critical structure: RAT • In the paper: • Selective TROB as an ROB accelerator • Even the smallest TROB accelerates recovery

  18. Methodology • Simulator based on Simplescalar • Alpha/OSF • 24 SPEC CPU 2000 benchmarks • Reference Inputs • Processor configurations • 4-way OoO core • 128/256/512 in-flight instructions • 1K-entry confidence table for low confidence branch identification / similar results with Anyweak • 1B committed instructions after skipping 2B

  19. “Perfect Checkpointing” Configuration • A checkpoint is auto-magically taken at all mispredicted branches • All recoveries are fast • We report the “deterioration relative to perfect checkpointing”

  20. better TROB Replacing the ROB/512-Entry Window • 64-entry TROB == ROB on the Average • Pathological cases exist  256-entry needed • 512-Entry TROB better than ROB

  21. better TROB Replacing the ROB/128-Entry Window • 64-Entry 50% better than ROB • Fewer pathological cases • 128-Entry TROB better than ROB

  22. better sTROB and Global Checkpoints/128-Entry Window • TROB + 1 GC better than 4GCs

  23. Summary • TROB vs. ROB • Replacement • Same resources  better performance • Fewer resources  often better performance • Except when accuracy is high • Acceleration: • ¼ resources  35% improvement • TROB vs. GCs • Reduce pressure from the critical path • With just 1 GC match the performance of four GCs • One more alternative for designers • Allows different area/performance/power tradeoffs

  24. TurboROBA Low Cost Checkpoint/Restore Accelerator Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto {pakl, moshovos}@eecg.toronto.edu

  25. better TROB Replacing the ROB/512-Entry Window • 64-entry TROB == ROB on the Average • Pathological cases exist  256-entry needed • 512-Entry TROB better than ROB

  26. better TROB Replacing the ROB/128-Entry Window • 64-Entry 50% better than ROB • Fewer pathological cases • 128-Entry TROB better than ROB

More Related