240 likes | 438 Views
CSE 598C Project. Towards Performance-Efficient Temporal Redundancy. Sudhanva Gurumurthi. Approaches to Redundancy. Spatial Redundancy Hardware duplication IBM G3, HP NonStop Himalaya Informational Redundancy Parity and ECC Temporal Redundancy. “Sphere of Replication”. Input Replicator.
E N D
CSE 598C Project Towards Performance-Efficient Temporal Redundancy Sudhanva Gurumurthi
Approaches to Redundancy • Spatial Redundancy • Hardware duplication • IBM G3, HP NonStop Himalaya • Informational Redundancy • Parity and ECC • Temporal Redundancy
“Sphere of Replication” Input Replicator Output Comparator Rest of the System Source: Mukherjee et al, “Detailed Design and Evaluation of Redundant Multithreading Alternatives”, ISCA’02
Sphere of Replication in Superscalar Processors ROB I0 FU Decode I1 Commit I-Cache FU Registers PC = RAT FU Source: J. Ray et al, “Dual Use of Superscalar Datapath for Transient-Fault Detection and Recovery”, MICRO’01.
System Configuration • 8-wide superscalar processor • 128-entry RUU, 64-entry LSQ • 4 I-ALUs, 3 I-MULT/DIVs, 2 FP-ALUs, 1 FP-MULT/DIV/SQRT • 32 KB 2-way L1-dCache • 64 KB, 2-way L1-iCache • 512 KB, 4-way L2-Cache • 112 cycles Memory latency
Towards Single-Thread Performance • The instruction scheduler is oblivious to the presence of two execution contexts. • Goals • Minimize impact on the critical-path of execution. • Critical-instruction scheduling
The Concept of Criticality and Slack • Only some of the instructions in a program might be bottleneck causing. • Critical Instructions • Critical instructions cannot be delayed! • Slack is a measure of how critical an instruction is.
Microarchitectural Critical-Path • A compiler can examine data-dependences but not resource-dependences. • ROB stalls • Branch mispredictions • Routing network stalls • Critical path is a function of data-dependences + inherent instruction latencies as well as resource usage at runtime.
Critical-Path Prediction[B. Fields et al, ISCA’01] • Tries to construct dependence-graph model of the critical-path at runtime. • Considers both machine-independent data dependences and machine-specific resource dependences. • Tracks last-arriving chain of edges along the dynamically constructed microarchitectural dependence-graph using a token-passing algorithm.
ROB Size 1. plant token Critical 3. is token alive? 4. yes, train critical Token-Passing Example 2. propagate token • Found CPwithout constructing entire graph Adapted from Brian Fields’s ISCA’01 Slides
Hardware Implementation • Two components • Critical-path table • Trainer • Critical-Path Table • Stores predictions indexed by PC (16K-entry with 6-bit hysteresis) • Looked up in parallel with instruction fetch • Trainer • Token array • Stores dependence graph of ROB-size most recent instructions with 1 bit to show if token propagated into that node.
Evaluation Methodology • SimpleScalar 3.0 • Code modified to simulate temporal-redundant execution based on [Ray’01]. • Critical-path predictor integrated with above. • Fast-forward 1 billion instructions. • Detailed simulation of 1 billion instructions.
Scheduling Strategies • base • Default scheduler used in SimpleScalar • Loads, long-latency instructions and branches selected first. • Other instructions selected in oldest-first order.
Scheduling Strategies • cp-all • All predicted-critical instructions selected first. • Other instructions selected in oldest-first order. • cp-dual • All predicted-critical instructions in redundant context selected first. • Other predicted-critical instructions next. • Other instructions (both contexts) in oldest-first order.
Summary • More workloads. • Might need to stagger threads. • Larger ROBs needed to prevent stalls. • AR-SMT-style delay-buffer • Difficult to reconcile two far-flung threads
Why not just provision more functional-units? • Scheduler complexity • Broadcast-free schedulers • Why couldn’t we have just used them to boost single-thread performance?
Temporal Redundancy • Execute multiple contexts of the same program. • At the commit-stage, check results of all the copies and re-execute if necessary. • Simple voting • Checker processor
Sphere of Replication • Components within the Sphere protected via redundant execution. • Components outside the Sphere protected via spatial/informational redundancy. • Temporal redundancy does not preclude extra hardware support.
Illustrative Example – Program Graph 4 2 I1 I2 Slack = 2 TES = 4 I3 Critical! I4