1 / 24

Towards Performance-Efficient Temporal Redundancy

CSE 598C Project. Towards Performance-Efficient Temporal Redundancy. Sudhanva Gurumurthi. Approaches to Redundancy. Spatial Redundancy Hardware duplication IBM G3, HP NonStop Himalaya Informational Redundancy Parity and ECC Temporal Redundancy. “Sphere of Replication”. Input Replicator.

josiah
Download Presentation

Towards Performance-Efficient Temporal Redundancy

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE 598C Project Towards Performance-Efficient Temporal Redundancy Sudhanva Gurumurthi

  2. Approaches to Redundancy • Spatial Redundancy • Hardware duplication • IBM G3, HP NonStop Himalaya • Informational Redundancy • Parity and ECC • Temporal Redundancy

  3. “Sphere of Replication” Input Replicator Output Comparator Rest of the System Source: Mukherjee et al, “Detailed Design and Evaluation of Redundant Multithreading Alternatives”, ISCA’02

  4. Sphere of Replication in Superscalar Processors ROB I0 FU Decode I1 Commit I-Cache FU Registers PC = RAT FU Source: J. Ray et al, “Dual Use of Superscalar Datapath for Transient-Fault Detection and Recovery”, MICRO’01.

  5. System Configuration • 8-wide superscalar processor • 128-entry RUU, 64-entry LSQ • 4 I-ALUs, 3 I-MULT/DIVs, 2 FP-ALUs, 1 FP-MULT/DIV/SQRT • 32 KB 2-way L1-dCache • 64 KB, 2-way L1-iCache • 512 KB, 4-way L2-Cache • 112 cycles Memory latency

  6. Performance Loss

  7. Towards Single-Thread Performance • The instruction scheduler is oblivious to the presence of two execution contexts. • Goals • Minimize impact on the critical-path of execution. • Critical-instruction scheduling

  8. The Concept of Criticality and Slack • Only some of the instructions in a program might be bottleneck causing. • Critical Instructions • Critical instructions cannot be delayed! • Slack is a measure of how critical an instruction is.

  9. Microarchitectural Critical-Path • A compiler can examine data-dependences but not resource-dependences. • ROB stalls • Branch mispredictions • Routing network stalls • Critical path is a function of data-dependences + inherent instruction latencies as well as resource usage at runtime.

  10. Critical-Path Prediction[B. Fields et al, ISCA’01] • Tries to construct dependence-graph model of the critical-path at runtime. • Considers both machine-independent data dependences and machine-specific resource dependences. • Tracks last-arriving chain of edges along the dynamically constructed microarchitectural dependence-graph using a token-passing algorithm.

  11. ROB Size 1. plant token Critical 3. is token alive? 4. yes, train critical Token-Passing Example 2. propagate token • Found CPwithout constructing entire graph Adapted from Brian Fields’s ISCA’01 Slides

  12. Hardware Implementation • Two components • Critical-path table • Trainer • Critical-Path Table • Stores predictions indexed by PC (16K-entry with 6-bit hysteresis) • Looked up in parallel with instruction fetch • Trainer • Token array • Stores dependence graph of ROB-size most recent instructions with 1 bit to show if token propagated into that node.

  13. Hardware Implementation

  14. Evaluation Methodology • SimpleScalar 3.0 • Code modified to simulate temporal-redundant execution based on [Ray’01]. • Critical-path predictor integrated with above. • Fast-forward 1 billion instructions. • Detailed simulation of 1 billion instructions.

  15. Scheduling Strategies • base • Default scheduler used in SimpleScalar • Loads, long-latency instructions and branches selected first. • Other instructions selected in oldest-first order.

  16. Scheduling Strategies • cp-all • All predicted-critical instructions selected first. • Other instructions selected in oldest-first order. • cp-dual • All predicted-critical instructions in redundant context selected first. • Other predicted-critical instructions next. • Other instructions (both contexts) in oldest-first order.

  17. Availability of Critical Instructions

  18. Performance Loss

  19. Summary • More workloads. • Might need to stagger threads. • Larger ROBs needed to prevent stalls. • AR-SMT-style delay-buffer • Difficult to reconcile two far-flung threads

  20. Backup Slides

  21. Why not just provision more functional-units? • Scheduler complexity • Broadcast-free schedulers • Why couldn’t we have just used them to boost single-thread performance?

  22. Temporal Redundancy • Execute multiple contexts of the same program. • At the commit-stage, check results of all the copies and re-execute if necessary. • Simple voting • Checker processor

  23. Sphere of Replication • Components within the Sphere protected via redundant execution. • Components outside the Sphere protected via spatial/informational redundancy. • Temporal redundancy does not preclude extra hardware support.

  24. Illustrative Example – Program Graph 4 2 I1 I2 Slack = 2 TES = 4 I3 Critical! I4

More Related