100 likes | 129 Views
Explore the Alpha 21364 network architecture designed for communication-intensive server applications, high-performance computing, and server environments. This detailed overview includes specifications, design goals, and key features.
E N D
The Alpha 21364 Network Architecture By Shubhendu S. Mukherjee, Peter Bannon Steven Lang, Aaron Spink, and David Webb Compaq Computer Corporation Presented by Luis Alfredo Campos
Alpha 21364 Goals • Support communication-intensive server applications • High performance technical computing • Database servers • Web servers • Telecommunication applications • Achieve: • Extremely low latency • Enormous bandwidth • Support directory cache coherence • Improve: • Reliability • Availability
Overview • Alpha 21264 core with enhancements • Tightly-Coupled multiprocessor network • Connects up to 128 processors • Two-Dimensional torus network • Integrated L2 Cache • Integrated memory controller • Router • Directory-Based CC • Separate Virtual Channels • Packet Classes
Network Packet Classes • Seven Packet Classes • Request (3 Flits) • Forward (3 Flits) • Block Response (18 or 19 Flits) • Non-Block Response (2 or 3 Flits) • Write I/O (19 Flits) • Read I/O (3 Flits) • Special (1 or 3 Flits) • Flits Are 32 Bits Data Plus 7 Bits ECC
Network Architecture • Two-dimensional torus • Limited Support for Imperfect Tori • Allows Fault Remapping • Virtual Cut-Through Routing • Buffer space for 316 packets
Adaptive Routing • Four Rectangles With Current and Destination At Diagonals • Packets route within the minimum rectangle • Maximize the bandwidth between source and destination
Avoiding Deadlocks in Adaptive Routing • “Adaptive routing will not deadlock a network as long as packets can drain via a deadlock-free path” • 19 Virtual Channels • 3 sets of virtual channel per Packet class except for the Special Class (only one channel) • Adaptive, VC0, and VC1 • Adaptive Is First Choice • VC0 and VC1 combination creates deadlock-free network
Router Architecture • 9 pipeline types • Input and Output: Local, Interprocessor, and I/O • Pin to pin latency of 13 cycles • Running at 1.2 Ghz • Network Links run 33% slower • Running at 0.8 Ghz • Synchronous with outgoing links • Asynchronous with incoming links
Arbitration • Needs to avoid central bottleneck • 16 local arbiters • 7 global arbiters • Least Recently Selected (LRS) Scheme • Local Arbiters • Classes • Virtual Channel • Global Arbiters • Input ports • Rotary Rule mode • Priority to oldest packets • Coherence Dependence Priority (CDP) Rule mode • Priority depending on class ordering
Questions • How Is the 1.2 GHz Internal/800 MHz External Clock OK? • Why 2-d Torus? • What Are the Limitations Imposed?