1 / 21

How to Speed-up Fault-Tolerant Clock Generation in VLSI Systems-on-Chip via Pipelining

How to Speed-up Fault-Tolerant Clock Generation in VLSI Systems-on-Chip via Pipelining. 2 RUAG Space Vienna andreas.dielacher@space.at. 1 Vienna University of Technology Embedded Computing Systems Group { fuegger , s}@ ecs.tuwien.ac.at.

hue
Download Presentation

How to Speed-up Fault-Tolerant Clock Generation in VLSI Systems-on-Chip via Pipelining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to Speed-up Fault-Tolerant Clock Generationin VLSI Systems-on-Chip via Pipelining 2RUAG Space Vienna andreas.dielacher@space.at 1Vienna University of Technology Embedded Computing Systems Group {fuegger, s}@ecs.tuwien.ac.at Matthias Függer1, Andreas Dielacher2 and Ulrich Schmid1

  2. Outline Fault-tolerant SoCs Asynchronous fault-tolerant clock generation algorithm Making it faster Proving it correct FPGA implementation 2

  3. Making SoCs fault-tolerant • System Level Approach • replication of functional units • communication between units necessary to maintain consistency • problems are analogous to those of • replicated state machines in • distributed systems!

  4. Fault-tolerant SoC needs Common Time Common time eases (allows) solving problems of replica determinism (atomic broadcast). q’s local clock domain tick(3) tick(4) tick(5) p q tick(2) tick(3) tick(4) tick(5) p q π(t) = 2 #ticks(Δ) = 3 precision: at any t,π(t) bounded accuracy: l(Δ) < #ticks in any Δ < u(Δ) 4

  5. Clocking a fault-tolerant SoC classical synchronousSoC globally coordinated clock generation: DARTS GALS • (+) no single point of failure • (-) no common time across chip •  synchronize •  overhead & metastability • (+) no single point of failure • (+) common time across chip (< small # of ticks) (-) single point of failure (+) common time across chip (< 1 tick)

  6. DARTS High-level Algorithm k k+1 k n = 5, f = 1 TQS Initially: send tick(1) to all; clock:= 1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m >= clock: send tick(m+1) to all; clock:= m+1; 6

  7. DARTS Hardware Implementation Common time property proved in [EDCC06]. Provides the m > clock and m >= clock status Initially: send tick(1) to all; clock:= 1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m >= clock: send tick(m+1) to all; clock:= m+1; 7

  8. DARTS Performance the lock step-case (Δ = 1) Δ k+2 k k+1 Common time property proved in [EDCC06]. Δ Performance Obtained frequency: 1/Δ, depends on end-to-end delay Δ 8

  9. Making DARTS faster: Pipelining the lock step-case (Δ = 1) Δ k+2X+2 k k+X+1 X = 4 Δ X+1 ticks X+1 ticks Pipelined Performance Idea: Let tick k+X+1 depend on tick k. Obtained frequency: (X+1)/Δ, maximum depends on local delays 9

  10. Making DARTS faster: Algorithm Adaptations Initially: send tick(1) to all; clock:= 1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m >= clock: send tick(m+1) to all; clock:= m+1; not changed Initially: send tick(1), ..., tick(X+1) to all; clock:= X+1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m + X >= clock: send tick(m+1) to all; clock:= m+1; allows sending k+X+1 based on k Small change in algorithm 10

  11. Is pDARTS correct?! k-X k+1 k-X TQS n = 5, f = 1 Initially: send tick(1), ..., tick(X+1) to all; clock:= X+1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m + X >= clock: send tick(m+1) to all; clock:= m+1;  easy to prove in classical systems (synchronous, Θ - model) 11

  12. pDARTS Hardware Implementation Provides the m > clock status Initially: send tick(1), ..., tick(X+1) to all; clock:= X+1; If received tick(m) from at least f+1 remote nodes and m > clock: send tick(clock+1),…, tick(m) to all; clock:= m; If received tick(m) from at least 2f+1 remote nodes and m + X >= clock: send tick(m+1) to all; clock:= m+1; Provides the m + X >= clock status 12

  13. pDARTS Hardware Implementation Provides the m > clock status Provides the m + X >= clock status 13

  14. Is pDARTS still correct?! • Correctness Proof • High-level algorithm, yes. (proof-gap) • Low-level pDARTS, has far more complex proofs than DARTS, • & queuing effects inside Counter Modules not neglected •  formal framework tied to hardware, • therein prove it correct. 14

  15. The formal Framework • Ingredients • Classical models: step-based (state machines) • Modules with signal ports • Signal’s behavior specified by • event trace: (t,x) in S • status function: S(t) = x • counting function: #S(t) = k • Basic/Compound modules • their behavior is specified by relations on the port behavior I O [Δ-, Δ+], initially 0 15

  16. The formal Framework Diff-Gate Module (Counting Function model) When to remove tick k from both local and remote pipe: For k = 0: If, received tick 1 in remote pipe at t, and received tick 1 in local pipe at t’,  remove tick 0 from both pipes within max(t,t’) + [Δ-Diff , Δ+Diff] For k > 0: If, received tick k+1 in remote pipe at t, received tick k+1 in local pipe at t’, and removed tick k-1 at t’’,  remove tick k from both pipes within max(t,t’,t’’) + [Δ-Diff , Δ+Diff] Active signal only if exactly 1 tick in local pipe 16

  17. Proof Results Precision Accuracy L(t2-t1) ≤#ticks in any (t2-t1) ≤U(t2-t1) Bounded Queue Sizes depends on  17

  18. FPGA prototype implementation X = 0 (conventional DARTS) maximum of X = 4 (stabilizes) APEX EP20K1000 FPGA Slow Δ compared to Δloc Δ about 125ns Δloc about 25ns 18

  19. Conclusions • Replication to make fault-tolerant. • Clocking a replicated state machine is non-trivial, but possible. • Unfortunately: slow! • Apply pipelining idea to make it faster. • Formal analysis with hardware inspired formal framework. • Proved it correct & implemented FPGA prototype. 19

  20. Spreading effect of Ticks tends to spread out ticks evenly after an initial phase 21

More Related