400 likes | 549 Views
ARIADNE A gnostic R econfiguration I n A D isconnected N etwork E nvironment. Konstantinos Aisopos (Princeton, MIT), Andrew DeOrio (Michigan), Li- Shiuan Peh (MIT), Valeria Bertacco (Michigan). What is “reconfiguration”?. Silicon technologies move into the nanometer regime
E N D
ARIADNEAgnostic ReconfigurationIn ADisconnected Network Environment Konstantinos Aisopos (Princeton, MIT), Andrew DeOrio (Michigan), Li-ShiuanPeh (MIT), Valeria Bertacco (Michigan)
What is “reconfiguration”? Silicon technologies move into the nanometer regime …transistors become unreliable In future chips of 100 billion transistors, 10% of transistors will eventually fail over the lifetime of the chip for architects: permanent faults Shekhar Borkar Our focus in this talk: Network-on-Chip cannot resend need to re-route around the fault P S D P$ S$ NIC reconfiguration: “the process of replacing the routing algorithm” R
Why is reconfiguration challenging? • XY routing X S Y D
Agnostic Reconfiguration algorithm In A Disconnected Network Environment Why is reconfiguration challenging? • XY routing S D
Why is reconfiguration challenging? • XY routing Agnostic Reconfiguration algorithm S In A Disconnected Network Environment D
Outline • Motivation • Ariadne • Baseline • Deadlocks • Synchronization • Evaluation • Overhead • Performance • Reliability • Conclusions
How will S find a path to D? RT RT RT RT S • … • … • … • … D: D: D: D: S W E N • … • … • … • … D
How will S find a path to D? RT RT RT RT RT RT S • … • … • … • … • … • … D: D: D: D: D: D: W,N E,N S W,S E,S N • … • … • … • … • … • … D
How will S find a path to D? RT RT RT S • … • … • … D: D: D: W W S • … • … • … D
ARIADNE: baseline • Upon a fault that changes the topology… • a nodecan let everyone know how it can be reached with a single broadcast • N nodes can let everyone know how they can be reached with N broadcasts
ARIADNE: baseline statically assigned node IDs • Upon a fault that changes the topology… • Every node broadcasts “in-turn” to let others know how it can be reached 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 • last • 2nd • 3rd 16 17 18 19 20 21 22 23 • 1st 17 20 19 • fault detector: 18 24 25 27 28 29 30 31 24 • … 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
ARIADNE: baseline • Upon a fault that changes the topology… • Every node broadcasts “in-turn” to let others know how it can be reached • last • 2nd • 3rd • 1st 20 17 19 • fault detector: 18 • … • Issues: • deadlock avoidance • synchronization • (when to broadcast, multiple detectors)
ARIADNE: deadlocks S S D D S D
ARIADNE: deadlocks up*/down* disable routes where rank goes 0 1 2 3 4 5 6 7 first bcast ONLY: nodes are assigned ranks higher 8 9 10 11 12 13 14 15 bcaster “root” 0 16 17 18 19 20 21 22 23 • down lower • up immediate neighbors 1 24 25 26 27 28 29 30 31 assume routing circle: 1 node will have higher rank than its neighbors, breaking the circular route 2-hop neighbors 2 32 33 34 35 36 37 38 39 r 3-hop neighbors 3 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 unique ordering: among nodes with same rank, arbitrarily select a higher one 56 57 58 59 60 61 62 63
ARIADNE: deadlocks up*/down* disable routes where rank goes first bcast ONLY: nodes are assigned ranks higher bcaster “root” 0 • down lower • up immediate neighbors 1 assume routing circle: 1 node will have higher rank than its neighbors, breaking the circular route 2-hop neighbors 2 r 3-hop neighbors 3 unique ordering: among nodes with same rank, arbitrarily select a higher one
ARIADNE: deadlocks up*/down* disable routes where rank goes first bcast ONLY: nodes are assigned ranks S higher bcaster “root” 0 • down lower • up immediate neighbors 1 D assume routing circle: 1 node will have higher rank than its neighbors, breaking the circular route 2-hop neighbors 2 r 3-hop neighbors 3 unique ordering: among nodes with same rank, arbitrarily select a higher one connectivity: can reach any node via the root S D
ARIADNE: deadlocks • Upon a fault that changes the topology… • Every node broadcasts “in-turn” to let others know how it can be reached • Issues: • deadlock avoidance • synchronization
ARIADNE: deadlocks • Upon a fault that changes the topology… • Every node broadcasts “in-turn” to let others know how it can be reached RULE: (i) first broadcast ranks nodes (ii) remaining broadcasts spread ONLY via enabled turns • Issues: • synchronization
ARIADNE: synchronization • 1-bit • 1-bit • arbitration • 1-bit • 1-bit • 1-bit • 1-bit • how does the recipient of a flag know the broadcasting node? • 1-bit • 1-bit • “in turn” broadcasts: when does previous broadcast complete? can broadcasts overlap? • NO
ARIADNE: synchronization Solution : Atomic Broadcasts • Nodes utilize the cycle count as a global reference point • Each node is assigned a unique broadcast slot from the “global” cycle counter
ARIADNE: synchronization 0 1 2 3 cycle count (same for all nodes) bcast node bcast cycle 0 1 1 1 0 0 1 0 0 1 0 1 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 1 0 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 1 0 6 6 5 4 4 9 5 0 15 15 15 0 7 15 4 5 5 6 7 log(16) bits log(16) bits … X X X X X X 8 9 10 11 5 initiates bcast … 5’s bcast completes 6’s bcast completes 4’s bcast completes 12 13 14 15 waits for 5 0 6 initiates bcast … longest (in hops) broadcast … … reconfiguration completes in (16)2 =(number of nodes)2 cycles
ARIADNE: synchronization 0 1 2 3 cycle count (same for all nodes) bcast node bcast cycle 0 1 1 1 0 0 0 1 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1 1 0 1 1st hop 5 5 5 4 9 5 0 1 7 15 2 15 4 5 5 6 7 … X X X X X X 2nd hop 8 9 8 10 11 5 initiates bcast 5’s bcast completes 12 13 14 15 8 resigns from becoming the root node waits for waits for 8 5 0 0 … (!) we need to reconfigure once even for multiple faults
Outline • Motivation • Ariadne • Baseline • Deadlocks • Synchronization • Evaluation • Overhead • Performance • Reliability • Conclusions
Evaluation: Overhead Evaluation • On-chip routing algorithms for irregular topologies Immunet (V. Puente, ISCA’04) • Vicis routing algo • (D. Fick, DATE’09) • exceptions to turn model to apply it to an arbitrary topology • reserves an escape VC for deadlock freedom (routes deterministically in a ring) ARIADNE 6.0% 2.0% 1.5% • synthesized a baseline 5-stage pipelined router (5 ports, 2 VCs, 5-flit buffer/VC) • with Synopsys Design Compiler (IBM 130nm target library): • router area (mm2): baseline=2.708, Ariadne=2.761, Vicis=2.748, Immunet=2.870
Evaluation: Performance • Average over 10 PARSEC benchmarks • 1000 fault configurations • Experimental Setup: Garnet + GEMS lower is better • System Configuration (GEMS) • deadlocks • traffic • Network Architecture (GARNET) • routing in • a ring
Evaluation: Performance + Reliability • On-chip routing algorithms for irregular topologies Immunet (V. Puente, ISCA’04) • Vicis routing algo • (D. Fick, DATE’09) • exceptions to turn model to apply it to an arbitrary topology • reserves an escape VC for deadlock freedom (routes deterministically in a ring) ARIADNE 6.0% 2.0% 1.5%
Outline • Motivation • Ariadne • Baseline • Deadlocks • Synchronization • Evaluation • Overhead • Performance • Reliability • Conclusions
Conclusions We have presented Ariadne. • a reconfiguration algorithm that provides deadlock-free routing paths in irregular network topologies that result from faulty links • is implemented in a fully distributed mode, resulting in simple hardware and low complexity • enables a trade-off between performance and reliable functionality on unreliable silicon
Thank You! Questions? [source: wikipedia] The Greek legend of Princess Ariadne • “Ariadne (Αριάδνη), was the daughter of King Minos of Crete. Minos attacked Athens after his son was killed there. The Athenians asked for terms, and were required to sacrifice seven young men and seven maidens every nine years to the Minotaur, a monster with the head of a bull on the body of a man. One year, the sacrificial party included Theseus, a young man who volunteered to come and kill the Minotaur. Ariadne fell in love at first sight, and helped him by giving him a ball of red fleece thread that she was spinning, to find his way out of the Minotaur's labyrinth.” …similarly to Princess Ariadne, our Ariadne algorithm helps packets find their way in the labyrinth-like topology of a faulty network.
BACKUP SLIDES
Evaluation: Results (reconfiguration dynamics)
Ariadne vs. off-chip approaches • Off-chip networks utilize centralized software algorithms for reconfiguration (-) topology needs to be communicated to a central node (-) interfacing with OS (-) updated routing tables need to be delivered back to each node C OS
Ariadne: Motivation • Many transistor failures are expected to occur at NoCs fabricated at advanced technology nodes • These failures will result in router link failures and disconnected routers
Evaluation: Related Work • On-chip routing algorithms for irregular topologies implemented for comparison Immunet (V. Puente, ISCA 2004) • Vicis routing algorithm • (D. Fick, DATE 2009) • 1 escape VC reserved for deadlock freedom: routes deterministically in a ring S • Other VCs route adaptively. BUT, if no other VC is available, a packet switches to escape VC D • latency when traffic • 6% overhead (3 routing tables) • reliability
Evaluation: Related Work • On-chip routing algorithms for irregular topologies implemented for comparison Immunet (V. Puente, ISCA 2004) • Vicis routing algorithm • (D. Fick, DATE 2009) • No faults? use turn model • 1 escape VC reserved for deadlock freedom: routes deterministically in a ring • Upon fault occurrence: • re-enable disabled turns to increase connectivity • Other VCs route adaptively. BUT, if no other VC is available, a packet switches to escape VC • latency when traffic • 6% overhead (3 routing tables) ? • reliability • (assuming north last) • (assuming north last) • (assuming north last)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63