1 / 23

A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router. Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002.

norah
Download Presentation

A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Comparative Study of Arbitration Algorithms for the Alpha 21364 Pipelined Router Tenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2002 Shubu Mukherjee*, Federico Silla!, Peter Bannon$, Joel Emer*, Steve Lang*, & Dave Webb$ (ack: Richard Kessler) Intel*, UPV!, &HP$

  2. L2 Cache Data MC1 Router MC2 L2 Cache Data L2 Cache Tags 21264 CORE Alpha 21364 Network M M M M IO IO IO IO M M M M IO IO IO IO M M M M IO IO IO IO 21364 Chip (including Router) Rambus Memory I/O

  3. The Alpha 21364 8x7 Router C R O S S B A R Input Ports Output Ports Distributed Arbitration Algorithm Controls the Crossbar • 8 Input ports: 4 network, 2 memory, 1 cache, 1 I/O • 7 Output ports: 4 network, 2 memory/cache, 1 I/O • Router Pipeline Length = 13/14 cycles • Virtual Cut-Through

  4. 1 2 3 6 3 3 3 Problem: Maximize # Matches older packet at input port Input Port 0 2 3 Input Port 1 1 3 Input Port 2 1 2 Input Port 3 1 2 3 Input Port 4 1 3 Input Port 5 0 2 Input Port 6 4 2 Input Port 7 5 2 numbers in table cells: destination output port • Oldest Packet First: one match • Smarter algorithm (shaded boxes): 7 matches (perfect)

  5. Simpler Algorithms Have Fewer Matches complexity Assumes all output ports are free

  6. Complexity may not pay off complexity @ 30% input buffer occupancy

  7. Key Results • Arbitration Algorithms • WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993, SGI Spider) • PIM1: Parallel Iterative Matching with one iteration (Anderson, et al., ASPLOS 1992) • SPAA: Simple, Pipelined Arbitration Algorithm (21364) • SPAA outperforms WFA & PIM1 + SPAA’s matching power similar to WFA & PIM1 (when many output ports are busy) + SPAA minimizes interactions between ports + SPAA can be pipelined more effectively • Rotary Rule + avoids network saturation under very heavy load

  8. 2 7 1 6 3 4 5 Request N W E i,j S Grant Wave Front Arbiter (WFA) • Proposed by Tamir & Chi, 1993 • used in the SGI Spider/Origin switch • Implement via “connection” matrix output ports input port 0 input port 1 Grant = Request & N & W S = N & NOT(Grant) E = W & NOT(Grant) input port 2 input port 3

  9. (3) (1) (2) 1.5 1.5 1 WFA Advantage & Pipeline + High degree of interaction among output ports reduces arbitration collisions & improves # of matches Algorithm (implemented via a connection matrix) (1) Select packet at input port & load matrix (1.5 cycles) (2) Run through matrix and inform input ports (1.5 cycles) (3) Forward arbitration to output ports (1 cycle)

  10. (3) (3) (1) (1) (2) (2) 1.5 1.5 1 3 cycles WFA Limitations - Higher number of estimated cycles • 4 cycles in 0.18 micron - Harder to pipeline effectively • micropipelining waves (2) is difficult because initial cell changes every cycle • restarting (1) before (2) completes is complex • large in-flight packet table due to large number of nominations (up to 54) • may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations)

  11. Parallel Iterative Matching (PIM) • Steps in One Iteration (PIM1) • Nominate: each input port nominates packets for every output port (same packet nominated multiple times …) • Grant: unmatched output port selects an input port packet randomly • Accept: unselected input port selects a grant randomly input port 0 input port 0 input port 0 output port 0 output port 0 output port 0 input port 1 input port 1 input port 1 output port 1 output port 1 output port 1 Accept Grant Nominate Output Port 0 unused in this arbitration round

  12. (3) (1) (2) 1.5 1.5 1 PIM1 Advantage & Pipeline + High interaction between input and output ports reduces arbitration collisions & improves # of matches Algorithm (implemented via connection matrix) (1) Select packet at input port & load matrix (1.5 cycles) (2) Run through matrix and inform input ports (1.5 cycles) (3) Forward arbitration to output ports (1 cycle)

  13. (3) (3) (1) (1) (2) (2) 1.5 1.5 1 3 cycles PIM1 Limitations - Higher number of estimated cycles • 4 cycles in 0.18 micron - Harder to pipeline effectively • restarting (1) before (2) completes is complex • same packet can be nominated multiple times requiring the “Accept” step (part of stage 2) • large in-flight packet table due to large number of nominations (up to 54) • may require multiple copies of matrix to buffer pipeline stages (these must avoid stale nominations)

  14. Simple, Pipelined Arbitration Algorithm (SPAA)used in the Alpha 21364 Router • Algorithm • Nominate: each input port nominates packets for exactly one output port (one packet nominated only once) • Grant: each output port selects an input port packet based on the least-recently selected one • Reset: input ports reset state of all unselected packets and renominate them in subsequent cycles input port 0 input port 0 output port 0 output port 0 Reset input port 1 input port 1 output port 1 output port 1 Accept Grant Nominate

  15. 1 1 1 (2) (1) (3) SPAA’s Simplicity • Low degree of interaction among ports - increases arbitration collisions + reduces complexity Algorithm (no centralized matrix) (1) Select packet at input port & load matrix (1 cycle) (2) Forward packets to output ports (1 cycle) (3) Output ports select packets and return feedback to input ports (1 cycle)

  16. 1 1 1 1 cycle (2) (2) (1) (1) (3) (3) SPAA’s Advantages + Fewer cycles • 3 cycles in 0.18micron + Speculatively read out input buffer • prior to output port arbitration • because only one packet is nominated to one output port + Easier to pipeline • restart (1) for free input ports before (2) completes • only one packet nominated to one output port • small number (16) of in-flight packets • avoids any centralized matrix • speculative read allows data flits to follow header flits

  17. Summary: Simpler is Better

  18. saturation point Saturation Behavior • Reasons: Hot spots & tree saturation • 21364’s router shows cyclic pattern (link utilization with time) • Ideally, operate at saturation bandwidth • Solution: throttle input load

  19. Rotary Rule • 21364’s in-built throttling + maximum outstanding cache miss requests per processor = 16 • Rotary Rule: more throttling + 21364 is a “direct” network + Rotary Rule prioritizes traffic in network ports over local ports + also, clears network congestion + relies on anti-starvation mechanism • WFA+Rotary: change first cell • SPAA+Rotary: change output port priority to the Rotary Rule

  20. Simulation Methodology • Asim • modeling infrastructure • detailed timing model of 21364 network • selected design points validated against RTL • Traffic Patterns • 70% three coherence hops, 30% two coherence hops • random destinations • other traffic combinations in paper and simulated internally

  21. Knee 64 Node Network: Base Case • SPAA outperforms WFA & PIM1 • 24% higher throughput at knee

  22. 64 Node Network: With Rotary Rule • Rotary Rule helps both SPAA & WFA

  23. Summary & Conclusions • Arbitration Algorithms • WFA: Wave Front Arbitration Algorithm (Tamir & Chi, 1993, SGI Spider) • PIM1: Parallel Iterative Matching with one iteration (Anderson, et al., ASPLOS 1992) • SPAA: Simple, Pipelined Arbitration Algorithm (21364) • SPAA outperforms WFA & PIM1 + SPAA’s matching power similar to WFA & PIM1 (when many output ports are busy) + SPAA minimizes interactions between ports + SPAA can be pipelined more effectively • Rotary Rule + avoids network saturation under heavy load

More Related