210 likes | 322 Views
How to Speed-up Fault-Tolerant Clock Generation in VLSI Systems-on-Chip via Pipelining. 2 RUAG Space Vienna andreas.dielacher@space.at. 1 Vienna University of Technology Embedded Computing Systems Group { fuegger , s}@ ecs.tuwien.ac.at.
E N D
How to Speed-up Fault-Tolerant Clock Generationin VLSI Systems-on-Chip via Pipelining 2RUAG Space Vienna andreas.dielacher@space.at 1Vienna University of Technology Embedded Computing Systems Group {fuegger, s}@ecs.tuwien.ac.at Matthias Függer1, Andreas Dielacher2 and Ulrich Schmid1
Outline Fault-tolerant SoCs Asynchronous fault-tolerant clock generation algorithm Making it faster Proving it correct FPGA implementation 2
Making SoCs fault-tolerant • System Level Approach • replication of functional units • communication between units necessary to maintain consistency • problems are analogous to those of • replicated state machines in • distributed systems!
Fault-tolerant SoC needs Common Time Common time eases (allows) solving problems of replica determinism (atomic broadcast). q’s local clock domain tick(3) tick(4) tick(5) p q tick(2) tick(3) tick(4) tick(5) p q π(t) = 2 #ticks(Δ) = 3 precision: at any t,π(t) bounded accuracy: l(Δ) < #ticks in any Δ < u(Δ) 4
Clocking a fault-tolerant SoC classical synchronousSoC globally coordinated clock generation: DARTS GALS • (+) no single point of failure • (-) no common time across chip • synchronize • overhead & metastability • (+) no single point of failure • (+) common time across chip (< small # of ticks) (-) single point of failure (+) common time across chip (< 1 tick)
DARTS High-level Algorithm k k+1 k n = 5, f = 1 TQS Initially: send tick(1) to all; clock:= 1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m >= clock: send tick(m+1) to all; clock:= m+1; 6
DARTS Hardware Implementation Common time property proved in [EDCC06]. Provides the m > clock and m >= clock status Initially: send tick(1) to all; clock:= 1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m >= clock: send tick(m+1) to all; clock:= m+1; 7
DARTS Performance the lock step-case (Δ = 1) Δ k+2 k k+1 Common time property proved in [EDCC06]. Δ Performance Obtained frequency: 1/Δ, depends on end-to-end delay Δ 8
Making DARTS faster: Pipelining the lock step-case (Δ = 1) Δ k+2X+2 k k+X+1 X = 4 Δ X+1 ticks X+1 ticks Pipelined Performance Idea: Let tick k+X+1 depend on tick k. Obtained frequency: (X+1)/Δ, maximum depends on local delays 9
Making DARTS faster: Algorithm Adaptations Initially: send tick(1) to all; clock:= 1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m >= clock: send tick(m+1) to all; clock:= m+1; not changed Initially: send tick(1), ..., tick(X+1) to all; clock:= X+1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m + X >= clock: send tick(m+1) to all; clock:= m+1; allows sending k+X+1 based on k Small change in algorithm 10
Is pDARTS correct?! k-X k+1 k-X TQS n = 5, f = 1 Initially: send tick(1), ..., tick(X+1) to all; clock:= X+1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m + X >= clock: send tick(m+1) to all; clock:= m+1; easy to prove in classical systems (synchronous, Θ - model) 11
pDARTS Hardware Implementation Provides the m > clock status Initially: send tick(1), ..., tick(X+1) to all; clock:= X+1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m + X >= clock: send tick(m+1) to all; clock:= m+1; Provides the m + X >= clock status 12
pDARTS Hardware Implementation Provides the m > clock status Provides the m + X >= clock status 13
Is pDARTS still correct?! • Correctness Proof • High-level algorithm, yes. (proof-gap) • Low-level pDARTS, has far more complex proofs than DARTS, • & queuing effects inside Counter Modules not neglected • formal framework tied to hardware, • therein prove it correct. 14
The formal Framework • Ingredients • Classical models: step-based (state machines) • Modules with signal ports • Signal’s behavior specified by • event trace: (t,x) in S • status function: S(t) = x • counting function: #S(t) = k • Basic/Compound modules • their behavior is specified by relations on the port behavior I O [Δ-, Δ+], initially 0 15
The formal Framework Diff-Gate Module (Counting Function model) When to remove tick k from both local and remote pipe: For k = 0: If, received tick 1 in remote pipe at t, and received tick 1 in local pipe at t’, remove tick 0 from both pipes within max(t,t’) + [Δ-Diff , Δ+Diff] For k > 0: If, received tick k+1 in remote pipe at t, received tick k+1 in local pipe at t’, and removed tick k-1 at t’’, remove tick k from both pipes within max(t,t’,t’’) + [Δ-Diff , Δ+Diff] Active signal only if exactly 1 tick in local pipe 16
Proof Results Precision Accuracy L(t2-t1) ≤#ticks in any (t2-t1) ≤U(t2-t1) Bounded Queue Sizes depends on 17
FPGA prototype implementation X = 0 (conventional DARTS) maximum of X = 4 (stabilizes) APEX EP20K1000 FPGA Slow Δ compared to Δloc Δ about 125ns Δloc about 25ns 18
Conclusions • Replication to make fault-tolerant. • Clocking a replicated state machine is non-trivial, but possible. • Unfortunately: slow! • Apply pipelining idea to make it faster. • Formal analysis with hardware inspired formal framework. • Proved it correct & implemented FPGA prototype. 19
Spreading effect of Ticks tends to spread out ticks evenly after an initial phase 21