1 / 25

TurboROB A Low Cost Checkpoint/Restore Accelerator

TurboROB A Low Cost Checkpoint/Restore Accelerator. Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto 1 Now with AMD/ATI. What Happens on a Branch Misprediction?. Execution Timeline. Predict a Branch Outcome.

kaspar
Download Presentation

TurboROB A Low Cost Checkpoint/Restore Accelerator

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl1 and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto 1 Now with AMD/ATI

  2. What Happens on a Branch Misprediction? Execution Timeline Predict a Branch Outcome Predicted Path Correct Path Misprediction Discovered Recover Processor State Redirect Fetch Resume Execution • We wish to make the recovery fast

  3. Recover Mechanisms Overview • ROB: • Buffer all changes • Slow • Instantaneous checkpoints: • Snapshot before speculating • Fast • Problem: can’t have enough checkpoints • Checkpoint prediction • Allocate the few checkpoints judiciously • Speculation control • Sometimes deeper speculation = higher recovery cost • Can hurt performance • Throttle speculation

  4. TurboROB Overview • Complements or Replaces Existing Mechanisms • ROB: recover at any point • TurboROB: recover only at frequent points • Improves performance for most programs • Misprediction performance penalty reduced by 28% on AVG • BranchTap comes “for free” • Very simple to implement • Better than more accurate checkpoint predictors

  5. Outline • Background • BranchTap • Methodology and Results • Summary

  6. State Recovery Example: Register Alias Table Lg(# arch. regs) Original Code RAT A add r1, r2, 100 B breq r1, E C sub r1, r2, r2 p1 p4 p5 p5 p4 Architectural Register p2 p3 # arch. regs Renamed Code A add p4, p2, 100 B breq p4, E C sub r5, p2, p2 Physical Register

  7. B B B B B ROB: Slow, Fine-Grain Recovery Each entry contains • Architectural destination register • Its previous RAT map Program Order 3. Undo RAT updates in reverse order Reorder Buffer • Misprediction discovered 2. Locate newest instruction INVALID RAT • Too slow: recovery latency proportional to number of instructions to squash

  8. B B B B B Global Checkpoints: Fast, Coarse-Grain Recovery Program Order checkpoint checkpoint checkpoint checkpoint Reorder Buffer • Misprediction discovered INVALID RAT • Branch w/ GC: Recovery is “Instantaneous”

  9. RAT checkpoints Working Copy Impact of More Checkpoints Concept ActualImplementation architectural register physical register • More checkpoints ? • Power hungry structure • Increased delay • Only a few checkpoints can practically be implemented • Cannot always cover all branches

  10. Intelligent Checkpointing • State of the art solution • Checkpoint allocation: Allocate checkpoints at hard-to-predict branches • Checkpoint management: Release checkpoints as soon as they are no longer needed • Use few checkpoints efficiently

  11. Conventional Mechanisms: Recovery Scenarios • Mispeculation on a branch w/ a GC: Direct recovery • Mispeculation on a branch w/o a GC: Indirect recovery • With intelligent checkpointing: • 30% Indirect recoveries  75% of performance loss B B B ROB Fast Recovery checkpoint B B B ROB Slow Recovery checkpoint

  12. Outline • Background • BranchTap • Methodology and Results • Summary

  13. BranchTap Motivation Low confidence branch ~ Recovery Cost B B B ROB No Wait Scenario checkpoint checkpoint Misprediction discovered B B B Wait Scenario ROB ~ Recovery Cost checkpoint checkpoint Sometimes, it is better to wait if no checkpoint is available

  14. BranchTap Concept • Key idea: stall when speculation is likely to deteriorate performance • Count the number of low confidence branches w/o a checkpoint • If it exceeds a threshold, stall • Threshold selection • Fixed • Varies greatly across programs • Can deteriorate performance significantly • Adaptive • Robust performance • Minimize recovery cost while conserving good speculation opportunities

  15. Threshold Adaptation Policy • BranchTap adapts across and within applications

  16. Outline • Background • BranchTap • Methodology and Results • Summary

  17. Results Overview • Performance w/o Checkpoints • BranchTap improves even with just an ROB • Performance w/ 4 Checkpoints • BranchTap improves over conventional recovery methods • Performance w/ Larger Checkpoint Predictors • BranchTap offers better performance than a 64x larger predictor

  18. Methodology • Simulator based on Simplescalar • 24 SPEC CPU 2000 benchmarks • Reference Inputs • Processor configurations • 8-way OoO core • Up to 1K in-flight instructions • 1K-entry confidence table for low confidence branch identification • 1B committed instructions after skipping 100B

  19. “Perfect Checkpointing” Configuration • A checkpoint is auto-magically taken at all mispredicted branches • All recoveries are fast • We report the “deterioration relative to perfect checkpointing”

  20. Performance with No Checkpoints • Deterioration relative to “perfect checkpointing” better -39% deterioration • BranchTap improves over conventional mechanisms • Adaptation leads to robust performance improvements

  21. Performance Evaluation with 4 Checkpoints • Deterioration relative to “perfect checkpointing” • BranchTap with 4 checkpoints is better than 6 checkpoints alone better -28% deterioration

  22. BranchTap vs. Larger Checkpoint Predictors • BranchTap with a 1K-entry confidence table and 4 GCs: • Higher performance than a 64K-entry confidence table with 4 GCs • Lower complexity, virtually comes “for free” better deterioration BranchTap confidence table size

  23. Outline • Background • BranchTap • Methodology and Results • Summary

  24. Summary • Performance with 4 (no) checkpoints • ~28 (39) % of misprediction penalty removed • BranchTap is robust: • Up to 6 (13) % better and max 1.2 (0.1) % worse than conventional mechanisms • BranchTap is very simple to implement • Few counters and comparators • BranchTap is better than other alternatives • BT + 1K predictor better than a 64K predictor alone • BT + 4 GCs better than 6 GCs alone

  25. BranchTapImproving Performance With Very Few Checkpoints Through Adaptive Speculation Control Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical and Computer Engineering University of Toronto {pakl, moshovos}@eecg.toronto.edu

More Related