320 likes | 425 Views
High-Performance Networks for Dataflow Architectures. Pravin Bhat Andrew Putnam. Overview. Motivation & Design Constraints Network design Performance Adaptive Routing Conclusion. Overview. Motivation & Design Constraints Network design Performance Adaptive Routing Conclusion.
E N D
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam
Overview • Motivation & Design Constraints • Network design • Performance • Adaptive Routing • Conclusion
Overview • Motivation & Design Constraints • Network design • Performance • Adaptive Routing • Conclusion
Motivation • Signal delay on wires is more important than transistor switching speed • Seriously decreased reliability in future processes • Factory testing will not be possible • Expect 20% of transistors to be DOA • Expect 10% more to die over several months • Dataflow is an answer, but the network is currently a bottleneck
Dataflow Characteristics • Unpredictable traffic • Cannot pre-allocate resources • Highly bursty traffic • Quick delivery of bursts is critical • Nodes are not guaranteed to consume messages • Potential for livelock & deadlock
Overview • Motivation & Design Constraints • Network design • Performance • Adaptive Routing • Conclusion
Network Requirements • High-Performance during bursts • Area efficient • Guarantee message delivery • Deadlock & Livelock free • Fault Tolerant • Regular 2-D physical structure
Topology • On-chip - must be implementable in 2-D • Regular tiled structure suggests: • Grid • Torus • Hypercube • Fat Tree • Hypercube is difficult to route, scale • Fat Tree has a single point of failure
Routing • Static routing does not provide essential fault tolerance • Use a modified Virtual Channel algorithm • VC guarantees deadlock free if nodes consume messages • Dynamically adaptive to handle transient faults & congestion • Initial studies used static routing
Flow Control • Resource reservation not possible • Long-latency wires prohibit handshakes • Send messages assuming accept • Buffer just enough to allow receiver to send reject signal on subsequent clock cycle
Deadlock-Free Operation • Nodes cannot always consume messages • Add a dedicated channel to and from memory • Adds 8% area overhead • Rotate stalled operands out of PEs to ensure forward progress • Send first operand back at a faster rate to avoid livelock
Overview • Motivation & Design Constraints • Network design • Performance • Adaptive Routing • Conclusion
Performance • Ran network-centric simulations • 20 billion instructions • Spec2000, Splash2, and Dataflow benchmarks • Goal is to find optimum balance of: • Number of Virtual Channels • Queue Length • Link Bandwidth • Packets per message
ASIC Model • Performance must be balanced with area • Developed RTL model of WaveScalar network architecture • 90 nm process ASIC standard cell library • Timing per link: • Grid links: 2.76 ns • Torus links: 6.16 ns • Network switch is 11.6% of chip area
Overview • Motivation & Design Constraints • Network design • Performance • Adaptive Routing • Conclusion
Virtual Channels Flow Control • In hardware only Head-of-Queue can be dequeued in one clock cycle • If the first message in a queue is blocked then every message behind it is blocked • The network utilization suffers due to idle links
Virtual Channels Flow Channel • Virtual Channels – several small queues instead of one long queue • Decouples buffer resources from link resources • Increase network throughput by increasing link usage
Dimension Order Routing • Old WaveScalar Routing Protocol • Network topology is a static grid • Packets first travel to the correct x-coordinate and then to the correct y-coordinate • Low network utilization from not using all available paths • Not fault tolerant
Adaptive Routing • Progressively chooses longer routes instead of waiting for an unavailable resource • High Network Utilization • Fault tolerant • Can cause deadlock
Deadlock Free Adaptive Routing • Some Virtual Channels are reserved for Dimension Order Routing, rest used for Adaptive routing • Every time a packet is routed in the wrong direction the Dimension Reversal count incremented • No packet is allowed to wait in a virtual channel with a packet that has a lower Dimension reversal count • Mathematically proven to be deadlock free.
Conclusion • Best performance per area with: • 2 Virtual Channels • 2 Links • 2-4 entries per queue • Torus Topology • Adaptive Routing • Dataflow chip networks can be high-performance at reasonable area