230 likes | 454 Views
A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router. Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002.
E N D
A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002 Shubu Mukherjee*, Federico Silla!, Peter Bannon$, Joel Emer*, Steve Lang*, & Dave Webb$ (ack: Richard Kessler) Intel*, UPV!, &HP$
L2 Cache Data MC1 Router MC2 L2 Cache Data L2 Cache Tags 21264 CORE Alpha 21364 Network M M M M IO IO IO IO M M M M IO IO IO IO M M M M IO IO IO IO 21364 Chip (including Router) Rambus Memory I/O
The Alpha 21364 8x7 Router C R O S S B A R Input Ports Output Ports Distributed Arbitration Algorithm Controls the Crossbar • 8 Input ports: 4 network, 2 memory, 1 cache, 1 I/O • 7 Output ports: 4 network, 2 memory/cache, 1 I/O • Router Pipeline Length = 13/14 cycles • Virtual Cut-Through
1 2 3 6 3 3 3 Problem: Maximize # Matches older packet at input port Input Port 0 2 3 Input Port 1 1 3 Input Port 2 1 2 Input Port 3 1 2 3 Input Port 4 1 3 Input Port 5 0 2 Input Port 6 4 2 Input Port 7 5 2 numbers in table cells: destination output port • Oldest Packet First: one match • Smarter algorithm (shaded boxes): 7 matches (perfect)
Simpler Algorithms Have Fewer Matches complexity Assumes all output ports are free
Complexity may not pay off complexity @ 30% input buffer occupancy
Key Results • Arbitration Algorithms • WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993, SGI Spider) • PIM1: Parallel Iterative Matching with one iteration (Anderson, et al., ASPLOS 1992) • SPAA: Simple, Pipelined Arbitration Algorithm (21364) • SPAA outperforms WFA & PIM1 + SPAA’s matching power similar to WFA & PIM1 (when many output ports are busy) + SPAA minimizes interactions between ports + SPAA can be pipelined more effectively • Rotary Rule + avoids network saturation under very heavy load
2 7 1 6 3 4 5 Request N W E i,j S Grant Wave Front Arbiter (WFA) • Proposed by Tamir & Chi, 1993 • used in the SGI Spider/Origin switch • Implement via “connection” matrix output ports input port 0 input port 1 Grant = Request & N & W S = N & NOT(Grant) E = W & NOT(Grant) input port 2 input port 3
(3) (1) (2) 1.5 1.5 1 WFA Advantage & Pipeline + High degree of interaction among output ports reduces arbitration collisions & improves # of matches Algorithm (implemented via a connection matrix) (1) Select packet at input port & load matrix (1.5 cycles) (2) Run through matrix and inform input ports (1.5 cycles) (3) Forward arbitration to output ports (1 cycle)
(3) (3) (1) (1) (2) (2) 1.5 1.5 1 3 cycles WFA Limitations - Higher number of estimated cycles • 4 cycles in 0.18 micron - Harder to pipeline effectively • micropipelining waves (2) is difficult because initial cell changes every cycle • restarting (1) before (2) completes is complex • large in-flight packet table due to large number of nominations (up to 54) • may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations)
Parallel Iterative Matching (PIM) • Steps in One Iteration (PIM1) • Nominate: each input port nominates packets for every output port (same packet nominated multiple times …) • Grant: unmatched output port selects an input port packet randomly • Accept: unselected input port selects a grant randomly input port 0 input port 0 input port 0 output port 0 output port 0 output port 0 input port 1 input port 1 input port 1 output port 1 output port 1 output port 1 Accept Grant Nominate Output Port 0 unused in this arbitration round
(3) (1) (2) 1.5 1.5 1 PIM1 Advantage & Pipeline + High interaction between input and output ports reduces arbitration collisions & improves # of matches Algorithm (implemented via connection matrix) (1) Select packet at input port & load matrix (1.5 cycles) (2) Run through matrix and inform input ports (1.5 cycles) (3) Forward arbitration to output ports (1 cycle)
(3) (3) (1) (1) (2) (2) 1.5 1.5 1 3 cycles PIM1 Limitations - Higher number of estimated cycles • 4 cycles in 0.18 micron - Harder to pipeline effectively • restarting (1) before (2) completes is complex • same packet can be nominated multiple times requiring the “Accept” step (part of stage 2) • large in-flight packet table due to large number of nominations (up to 54) • may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations)
Simple, Pipelined Arbitration Algorithm (SPAA)used in the Alpha 21364 Router • Algorithm • Nominate: each input port nominates packets for exactly one output port (one packet nominated only once) • Grant: each output port selects an input port packet based on the least-recently selected one • Reset: input ports reset state of all unselected packets and renominate them in subsequent cycles input port 0 input port 0 output port 0 output port 0 Reset input port 1 input port 1 output port 1 output port 1 Accept Grant Nominate
1 1 1 (2) (1) (3) SPAA’s Simplicity • Low degree of interaction among ports - increases arbitration collisions + reduces complexity Algorithm (no centralized matrix) (1) Select packet at input port & load matrix (1 cycle) (2) Forward packets to output ports (1 cycle) (3) Output ports select packets and return feedback to input ports (1 cycle)
1 1 1 1 cycle (2) (2) (1) (1) (3) (3) SPAA’s Advantages + Fewer cycles • 3 cycles in 0.18micron + Speculatively read out input buffer • prior to output port arbitration • because only one packet is nominated to one output port + Easier to pipeline • restart (1) for free input ports before (2) completes • only one packet nominated to one output port • small number (16) of in-flight packets • avoids any centralized matrix • speculative read allows data flits to follow header flits
saturation point Saturation Behavior • Reasons: Hot spots & tree saturation • 21364’s router shows cyclic pattern (link utilization with time) • Ideally, operate at saturation bandwidth • Solution: throttle input load
Rotary Rule • 21364’s in-built throttling + maximum outstanding cache miss requests per processor = 16 • Rotary Rule: more throttling + 21364 is a “direct” network + Rotary Rule prioritizes traffic in network ports over local ports + also, clears network congestion + relies on anti-starvation mechanism • WFA+Rotary: change first cell • SPAA+Rotary: change output port priority to the Rotary Rule
Simulation Methodology • Asim • modeling infrastructure • detailed timing model of 21364 network • selected design points validated against RTL • Traffic Patterns • 70% three coherence hops, 30% two coherence hops • random destinations • other traffic combinations in paper and simulated internally
Knee 64 Node Network: Base Case • SPAA outperforms WFA & PIM1 • 24% higher throughput at knee
64 Node Network: With Rotary Rule • Rotary Rule helps both SPAA & WFA
Summary & Conclusions • Arbitration Algorithms • WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993, SGI Spider) • PIM1: Parallel Iterative Matching with one iteration (Anderson, et al., ASPLOS 1992) • SPAA: Simple, Pipelined Arbitration Algorithm (21364) • SPAA outperforms WFA & PIM1 + SPAA’s matching power similar to WFA & PIM1 (when many output ports are busy) + SPAA minimizes interactions between ports + SPAA can be pipelined more effectively • Rotary Rule + avoids network saturation under heavy load