250 likes | 507 Views
Fault Tolerant NoC Communication. Greg Link. Benefits and Drawbacks of NoC . Benefits Exploits parallelism with scalable interconnect Employs heavy design reuse Removes complexity of bus interconnect Drawbacks Tight area constraints on routers Low latency imperative
E N D
Fault Tolerant NoC Communication Greg Link
Benefits and Drawbacks of NoC • Benefits • Exploits parallelism with scalable interconnect • Employs heavy design reuse • Removes complexity of bus interconnect • Drawbacks • Tight area constraints on routers • Low latency imperative • Uncertain node-node delays
Why Fault Tolerance? • Offers many advantages: • Avoids costly packet retransmissions • Avoids catastrophic data loss • Can increase chip yield • Allows higher speed operation • In NoC specifically • Ensures success of interconnect • Grows in importance as technology scales
Previous Work • Significant research has been done in reliable and fault tolerant communication for large-scale interconnect • Fault-Tolerant NoC work is limited • Mostly flooding style algorithms • consume large amounts of energy • network congestion severely limits performance • no guaranteed delivery even in fault-free case • Stochastic algorithms have difficult-to-predict delays
My Goal • Develop a non-stochastic (and hence lower power) means of fault-tolerant communication for NoC • Target Goals: • Consume relatively minor die area • Predictable communication behaviour • Limit the number of unnecessary packet duplicates • Target Applications: • 14x14 2d network for LDPC decoding • 4x4 2d network used in Neural Network-on-Chip
A Two-Part Problem • Permanent faults • Nodes that • Improperly route packets • Don’t route at all • Can be detected at boot-time • Must be routed ‘around’ to avoid problems
The Second Half of the Problem • Transient faults: • Transient faults occur between nodes, and represent bit-errors in the interconnect • As technology scales, interconnect can have more than 1000 bits • With a bit-error rate of 10-12 on a 16x16 torus, there will be a word-error every 2x10^6 cycles (< 6 ms) • To avoid costly source-dest retransmissions and their required storage, we must ensure no word-errors during transmit
The more ‘permanent’ solution… • Since permanent faults can be detected statically, use a series of ‘bootstrap’ operations. • Routers are configured to ‘ignore’ certain neighbors if they don’t respond properly to queries. • Results in “X-Preferred-Y” routing
X-Preferred-Y Routing • In nominal case use standard deadlock-free X-Y routing • If normal direction choice is unavailable, send packet in secondary direction of choice. • If both choices are unavailable, copy the message to all functional output channels. • Packets sent in X-directions are henceforth routed using Y-Preferred-X routing to prevent backtracking • All packets sent in ‘improper’ directions have their lifetime values reduced by 1. This prevents livelock situations and never-dying oscillations. (Lifetime empirically determined to be optimal near 2-3)
Notes on X-P-Y Routing • Does not handle all fault patterns. But most! • Requires many faulty nodes in local area to fail (and in a ‘C’ shape) • Deadlock Avoidance • The packet attempts to return to X-Preferred-Y routing as soon as possible • Can only have ‘misrouting’ packets in the local vicinity of faulty nodes • These faulty nodes make certain turns impossible
Possible Transient Solutions • Use SECDED encoding/decoding for each packet hop • Verifies and recovers any errors at every hop • Requires two additional pipeline stages (normally only three total) • Use SECDED encoding for the entire packet at the NIC level, and decode only the destination during transmission • Two extra cycles required overall, but less tolerance
A New Solution? • Use shadow registers (see Razor project) to detect soft-errors and crosstalk delay-induced timing errors • Secondary ‘extra’ set of registers resamples the link shortly after the primary clock fires • Discrepancies must be due to timing violation or bitflip • Reissue the flit, and allow two cycles for retransmission this time
Evaluating the Solution - Area • The first question: Is it even feasible? • XPY routing consumes an additional 17% switch area in a 14x14 LDPC application • Only a 7% switch area penalty in 4x4 Neural Network application • Doesn’t affect clock rate • Bootstrap operations still included off chip
Evaluating the Solution – Area (2) • Transient Solutions: • Full Node-Node SECDED has a 34% area penalty in the 14x14 app, and a 41% penalty in the 4x4 NN app • NIC-based SECDED increases area by only 20% and 17% respectively • Shadow registers require an extra flit of buffering at each port, imposing nearly a 70% area penalty • As technology scales, one flit is less and less of a problem
Evaluation – Energy Consumption • Max replications of a given message is 3*lifetime copies • Finite and bounded number of duplicates • Less than similarly effective redundant random walk in worst case • In best case, no copies are generated at all • As energy is dominated by hop count, this results in nearly an order-of-magnitude lower energy consumption than the RRW method (which is an order of magnitude lower than stochastic flooding)
Evaluation – Energy Consumption • Work still in progress on transient solutions • Energy consumption in NoC switches is dominated by buffer activity normally • SECDED blocks require significant switching activity • Additional bits (and hence larger buffers) required to send SECDED data • Shadow latches must be clocked as well • Still an open question
Evaluation – Performance Impact • Due to a power outage, I cannot show you the effectiveness graph. BUT… • For 5% faults, more than 99% of all paths on the chip are routable • For small networks (4x4), over 50% of the chips have all valid paths routeable, even with 10% faults • Effectiveness of the algorithm reduces as network size increases – the longer average paths increase the odds of stumbling across a non-routable section
Performance Impact – Transient Solutions • Normal packet latency in 14x14 LDPC application (3 stage pipeline) is roughly 16 cycles • Full SECDED increase this to nearly 27 cycles • NIC-based SECDED increases to only 18 cycles • Shadow register technique imposes little delay penalty, as error-free case (the majority of the cases) run at optimal speed • For small networks (2 stage pipeline), the average latency is much smaller, nearing four cycles. • Two cycle penalty imposed by NIC-based SECDED becomes much more significant • Shadow registers much faster for small networks, short pipelines
Conclusions • X-Preferred-Y routing can be implemented in almost all designs and offers a nearly-deterministic and resource-friendly solution • NIC-based SECDED is an effective means of preventing soft errors but the performance impact is significant for small networks • Shadow registers especially useful for performance-critical apps • Get even better as technology scales • Area penalty minimized • Soft errors and Crosstalk problems increasing
Further Work • On-chip bootstrap nodes to determine and reconfigure around faults • Energy consumption of transient solutions • Non-homogenous irregular topologies