AlphaServer GS320 Architecture & Design

AlphaServer GS320 Architecture & Design • Gharachorloo, Sharma, Steely, and Van Doren • Compaq Research & High-Performance Servers • Published in 2000 (ASPLOS-IX)‏ • Presented by Matt Johnson • CPS221/ECE259, Advanced Computer Arch. II • Duke University, 1/30/08

AlphaServer GS320 Architecture & Design • Sold by HP until 2004, now discontinued

Overview • Design Goals • Architecture (from 10E+3 ft.)‏ • Coherence Protocol • Memory Consistency • Performance • Analysis/Questions

Design Goals • Targeting small/medium multiprocessors • Exploit known (and limited) system size to implement ideas that don't scale well • e.g. protocol optimizations (limited queue sizes)‏ • Avoid the high latency and protocol overhead of traditional directory protocols, and the bandwidth/scalability problems of snooping

Design Goals • RAS (reliability, availability, serviceability)‏ • Modularity (QBBs, we'll get to them in a moment)‏ • Hardware partitions (failure containment)‏ • Efficiency • Tight integration with CPUs (Alpha 21264)‏ • CPU support for coherence/consistency operations • Directory Protocol avoids NACKs and stalls

Architecture • Between 4 and 32 Alpha 21264 CPUs • Arranged in Quad-processor Building Blocks • 7M+ ASIC Gates • 4 CPUs • 32GB Memory • 8 PCI Slots • 10-Port Switch

Architecture • 10-Port Local Switch (per QBB)‏ • 4 Processor, 4 Memory, 1 I/O (PCI), 1 Global • 2 QBBs can be connected directly, up to 8 with a global switch • Hardware Coherence Support • DIRectory: 14 bits/64-byte cache line store owner/sharer info • Duplicate TAG Store copies CPUs' L2 cache tags • Transactions-in-Transit Table keeps track of outstanding transactions from a node (48 entries)‏ • All implemented in ASICs, some supported by 21264

Architecture

Coherence Protocol • 4 Types of Requests • Read (not writing, don't need an exclusive copy)‏ • Read-Exclusive (don't have it, want to write to it)‏ • Exclusive (have a shared copy, want to write to it)‏ • Exclusive-Without-Data • Used when you want to write an entire cache line (64B)‏ • Don't need to transfer the old data in this case

Coherence Protocol • Satisfies all requests w/o NACKs or retries • Blocks at the host • Saves bandwidth • Accomplishes this by ”doing the right thing” on the requestee side, transparently to the requester • State machines at nodes can be simple,fast,small • Dependencies are resolved on the outskirts of the system,not by clogging up the core w/ a heavy protocol

Coherence Protocol • Deadlock is prevented by using 3 virtual lanes • Q0 for requests, Q1 for local responses, Q2 for remote responses, QIO for I/O (PCI) transactions • Total ordering required on Q1, Point-to-Point ordering on Q0/QIO, no requirements on Q2 • Split responses into 2 parts (↓Latency,↑Perf.)‏ • Commit (yeah, I heard ya)‏ • Data (except exclusive-without-data requests)‏

Coherence Protocol • Instead of building their protocol to handle the general case, they optimize it for a specific case • e.g. the crossbar local and global switches lend themselves to meeting the ordering requirements • they can delay certain responses because they can bound the latency by a reasonable time

Coherence Protocol

Coherence Protocol • (u,v)=(1,0) should be disallowed (would violate sequential consistency)‏

Performance

Performance • 33.5 Gflops/sec on Linpack workload • Supports 2720 users on SAP Benchmark • Higher I/O bandwidth, but similar commercial workload performance to IBM RS/6000 S80 (24-CPU) and Sun 10000 (64-CPU) systems • NUMA+Lightweight Protocol->↓Mem. Latency • Much better for applications where this matters

Analysis/Questions

AlphaServer GS320 Architecture & Design