170 likes | 311 Views
AlphaServer GS320 Architecture & Design. Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX) . Presented by Matt Johnson CPS221/ECE259, Advanced Computer Arch. II Duke University, 1/30/08.
E N D
AlphaServer GS320 Architecture & Design • Gharachorloo, Sharma, Steely, and Van Doren • Compaq Research & High-Performance Servers • Published in 2000 (ASPLOS-IX) • Presented by Matt Johnson • CPS221/ECE259, Advanced Computer Arch. II • Duke University, 1/30/08
AlphaServer GS320 Architecture & Design • Sold by HP until 2004, now discontinued
Overview • Design Goals • Architecture (from 10E+3 ft.) • Coherence Protocol • Memory Consistency • Performance • Analysis/Questions
Design Goals • Targeting small/medium multiprocessors • Exploit known (and limited) system size to implement ideas that don't scale well • e.g. protocol optimizations (limited queue sizes) • Avoid the high latency and protocol overhead of traditional directory protocols, and the bandwidth/scalability problems of snooping
Design Goals • RAS (reliability, availability, serviceability) • Modularity (QBBs, we'll get to them in a moment) • Hardware partitions (failure containment) • Efficiency • Tight integration with CPUs (Alpha 21264) • CPU support for coherence/consistency operations • Directory Protocol avoids NACKs and stalls
Architecture • Between 4 and 32 Alpha 21264 CPUs • Arranged in Quad-processor Building Blocks • 7M+ ASIC Gates • 4 CPUs • 32GB Memory • 8 PCI Slots • 10-Port Switch
Architecture • 10-Port Local Switch (per QBB) • 4 Processor, 4 Memory, 1 I/O (PCI), 1 Global • 2 QBBs can be connected directly, up to 8 with a global switch • Hardware Coherence Support • DIRectory: 14 bits/64-byte cache line store owner/sharer info • Duplicate TAG Store copies CPUs' L2 cache tags • Transactions-in-Transit Table keeps track of outstanding transactions from a node (48 entries) • All implemented in ASICs, some supported by 21264
Coherence Protocol • 4 Types of Requests • Read (not writing, don't need an exclusive copy) • Read-Exclusive (don't have it, want to write to it) • Exclusive (have a shared copy, want to write to it) • Exclusive-Without-Data • Used when you want to write an entire cache line (64B) • Don't need to transfer the old data in this case
Coherence Protocol • Satisfies all requests w/o NACKs or retries • Blocks at the host • Saves bandwidth • Accomplishes this by ”doing the right thing” on the requestee side, transparently to the requester • State machines at nodes can be simple,fast,small • Dependencies are resolved on the outskirts of the system,not by clogging up the core w/ a heavy protocol
Coherence Protocol • Deadlock is prevented by using 3 virtual lanes • Q0 for requests, Q1 for local responses, Q2 for remote responses, QIO for I/O (PCI) transactions • Total ordering required on Q1, Point-to-Point ordering on Q0/QIO, no requirements on Q2 • Split responses into 2 parts (↓Latency,↑Perf.) • Commit (yeah, I heard ya) • Data (except exclusive-without-data requests)
Coherence Protocol • Instead of building their protocol to handle the general case, they optimize it for a specific case • e.g. the crossbar local and global switches lend themselves to meeting the ordering requirements • they can delay certain responses because they can bound the latency by a reasonable time
Coherence Protocol • (u,v)=(1,0) should be disallowed (would violate sequential consistency)
Performance • 33.5 Gflops/sec on Linpack workload • Supports 2720 users on SAP Benchmark • Higher I/O bandwidth, but similar commercial workload performance to IBM RS/6000 S80 (24-CPU) and Sun 10000 (64-CPU) systems • NUMA+Lightweight Protocol->↓Mem. Latency • Much better for applications where this matters