How to Speed-up Fault-Tolerant Clock Generation in VLSI Systems-on-Chip via Pipelining

How to Speed-up Fault-Tolerant Clock Generationin VLSI Systems-on-Chip via Pipelining 2RUAG Space Vienna andreas.dielacher@space.at 1Vienna University of Technology Embedded Computing Systems Group {fuegger, s}@ecs.tuwien.ac.at Matthias Függer1, Andreas Dielacher2 and Ulrich Schmid1

Outline Fault-tolerant SoCs Asynchronous fault-tolerant clock generation algorithm Making it faster Proving it correct FPGA implementation 2

Making SoCs fault-tolerant • System Level Approach • replication of functional units • communication between units necessary to maintain consistency • problems are analogous to those of • replicated state machines in • distributed systems!

Fault-tolerant SoC needs Common Time Common time eases (allows) solving problems of replica determinism (atomic broadcast). q’s local clock domain tick(3) tick(4) tick(5) p q tick(2) tick(3) tick(4) tick(5) p q π(t) = 2 #ticks(Δ) = 3 precision: at any t,π(t) bounded accuracy: l(Δ) < #ticks in any Δ < u(Δ) 4

Clocking a fault-tolerant SoC classical synchronousSoC globally coordinated clock generation: DARTS GALS • (+) no single point of failure • (-) no common time across chip •  synchronize •  overhead & metastability • (+) no single point of failure • (+) common time across chip (< small # of ticks) (-) single point of failure (+) common time across chip (< 1 tick)

DARTS High-level Algorithm k k+1 k n = 5, f = 1 TQS Initially: send tick(1) to all; clock:= 1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m >= clock: send tick(m+1) to all; clock:= m+1; 6

DARTS Hardware Implementation Common time property proved in [EDCC06]. Provides the m > clock and m >= clock status Initially: send tick(1) to all; clock:= 1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m >= clock: send tick(m+1) to all; clock:= m+1; 7

DARTS Performance the lock step-case (Δ = 1) Δ k+2 k k+1 Common time property proved in [EDCC06]. Δ Performance Obtained frequency: 1/Δ, depends on end-to-end delay Δ 8

Making DARTS faster: Pipelining the lock step-case (Δ = 1) Δ k+2X+2 k k+X+1 X = 4 Δ X+1 ticks X+1 ticks Pipelined Performance Idea: Let tick k+X+1 depend on tick k. Obtained frequency: (X+1)/Δ, maximum depends on local delays 9

Making DARTS faster: Algorithm Adaptations Initially: send tick(1) to all; clock:= 1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m >= clock: send tick(m+1) to all; clock:= m+1; not changed Initially: send tick(1), ..., tick(X+1) to all; clock:= X+1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m + X >= clock: send tick(m+1) to all; clock:= m+1; allows sending k+X+1 based on k Small change in algorithm 10

Is pDARTS correct?! k-X k+1 k-X TQS n = 5, f = 1 Initially: send tick(1), ..., tick(X+1) to all; clock:= X+1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m + X >= clock: send tick(m+1) to all; clock:= m+1;  easy to prove in classical systems (synchronous, Θ - model) 11

pDARTS Hardware Implementation Provides the m > clock status Initially: send tick(1), ..., tick(X+1) to all; clock:= X+1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m + X >= clock: send tick(m+1) to all; clock:= m+1; Provides the m + X >= clock status 12

pDARTS Hardware Implementation Provides the m > clock status Provides the m + X >= clock status 13

Is pDARTS still correct?! • Correctness Proof • High-level algorithm, yes. (proof-gap) • Low-level pDARTS, has far more complex proofs than DARTS, • & queuing effects inside Counter Modules not neglected •  formal framework tied to hardware, • therein prove it correct. 14

The formal Framework • Ingredients • Classical models: step-based (state machines) • Modules with signal ports • Signal’s behavior specified by • event trace: (t,x) in S • status function: S(t) = x • counting function: #S(t) = k • Basic/Compound modules • their behavior is specified by relations on the port behavior I O [Δ-, Δ+], initially 0 15

The formal Framework Diff-Gate Module (Counting Function model) When to remove tick k from both local and remote pipe: For k = 0: If, received tick 1 in remote pipe at t, and received tick 1 in local pipe at t’,  remove tick 0 from both pipes within max(t,t’) + [Δ-Diff , Δ+Diff] For k > 0: If, received tick k+1 in remote pipe at t, received tick k+1 in local pipe at t’, and removed tick k-1 at t’’,  remove tick k from both pipes within max(t,t’,t’’) + [Δ-Diff , Δ+Diff] Active signal only if exactly 1 tick in local pipe 16

Proof Results Precision Accuracy L(t2-t1) ≤#ticks in any (t2-t1) ≤U(t2-t1) Bounded Queue Sizes depends on  17

FPGA prototype implementation X = 0 (conventional DARTS) maximum of X = 4 (stabilizes) APEX EP20K1000 FPGA Slow Δ compared to Δloc Δ about 125ns Δloc about 25ns 18

Conclusions • Replication to make fault-tolerant. • Clocking a replicated state machine is non-trivial, but possible. • Unfortunately: slow! • Apply pipelining idea to make it faster. • Formal analysis with hardware inspired formal framework. • Proved it correct & implemented FPGA prototype. 19

Spreading effect of Ticks tends to spread out ticks evenly after an initial phase 21

How to Speed-up Fault-Tolerant Clock Generation in VLSI Systems-on-Chip via Pipelining

How to Speed-up Fault-Tolerant Clock Generation in VLSI Systems-on-Chip via Pipelining

Presentation Transcript

CprE 545: FAULT-TOLERANT SYSTEMS

Fault Tolerant Distributed Systems

CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: Fault Tolerant Systems

Clocking and Timing in Fault-Tolerant Systems-on-Chip

Exploiting Crosstalk to Speed up On-chip Buses

CprE 545: Fault Tolerant Systems

Fault-Tolerant Delay-Insensitive Inter-Chip Communication

Introduction to CMOS VLSI Design Clock Skew-tolerant circuits

CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: FAULT-TOLERANT SYSTEMS

On the Threat of Metastability in an Asynchronous Fault-Tolerant Clock Generation Scheme

CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: FAULT-TOLERANT SYSTEMS

CprE 545: Fault Tolerant Systems

A Lightweight Fault-Tolerant Mechanism for Network-on-Chip

Secure and Fault-Tolerant Clock Synchronization in Sensor Networks

Fault-tolerant Routing in Peer-to-Peer Systems

fault-tolerant

A Lightweight Fault-Tolerant Mechanism for Network-on-Chip

Exploiting Crosstalk to Speed up On-chip Buses