1 / 17

AlphaServer GS320 Architecture & Design

AlphaServer GS320 Architecture & Design. Gharachorloo, Sharma, Steely, and Van Doren Compaq Research & High-Performance Servers Published in 2000 (ASPLOS-IX) ‏. Presented by Matt Johnson CPS221/ECE259, Advanced Computer Arch. II Duke University, 1/30/08.

king
Download Presentation

AlphaServer GS320 Architecture & Design

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AlphaServer GS320 Architecture & Design • Gharachorloo, Sharma, Steely, and Van Doren • Compaq Research & High-Performance Servers • Published in 2000 (ASPLOS-IX)‏ • Presented by Matt Johnson • CPS221/ECE259, Advanced Computer Arch. II • Duke University, 1/30/08

  2. AlphaServer GS320 Architecture & Design • Sold by HP until 2004, now discontinued

  3. Overview • Design Goals • Architecture (from 10E+3 ft.)‏ • Coherence Protocol • Memory Consistency • Performance • Analysis/Questions

  4. Design Goals • Targeting small/medium multiprocessors • Exploit known (and limited) system size to implement ideas that don't scale well • e.g. protocol optimizations (limited queue sizes)‏ • Avoid the high latency and protocol overhead of traditional directory protocols, and the bandwidth/scalability problems of snooping

  5. Design Goals • RAS (reliability, availability, serviceability)‏ • Modularity (QBBs, we'll get to them in a moment)‏ • Hardware partitions (failure containment)‏ • Efficiency • Tight integration with CPUs (Alpha 21264)‏ • CPU support for coherence/consistency operations • Directory Protocol avoids NACKs and stalls

  6. Architecture • Between 4 and 32 Alpha 21264 CPUs • Arranged in Quad-processor Building Blocks • 7M+ ASIC Gates • 4 CPUs • 32GB Memory • 8 PCI Slots • 10-Port Switch

  7. Architecture • 10-Port Local Switch (per QBB)‏ • 4 Processor, 4 Memory, 1 I/O (PCI), 1 Global • 2 QBBs can be connected directly, up to 8 with a global switch • Hardware Coherence Support • DIRectory: 14 bits/64-byte cache line store owner/sharer info • Duplicate TAG Store copies CPUs' L2 cache tags • Transactions-in-Transit Table keeps track of outstanding transactions from a node (48 entries)‏ • All implemented in ASICs, some supported by 21264

  8. Architecture

  9. Coherence Protocol • 4 Types of Requests • Read (not writing, don't need an exclusive copy)‏ • Read-Exclusive (don't have it, want to write to it)‏ • Exclusive (have a shared copy, want to write to it)‏ • Exclusive-Without-Data • Used when you want to write an entire cache line (64B)‏ • Don't need to transfer the old data in this case

  10. Coherence Protocol • Satisfies all requests w/o NACKs or retries • Blocks at the host • Saves bandwidth • Accomplishes this by ”doing the right thing” on the requestee side, transparently to the requester • State machines at nodes can be simple,fast,small • Dependencies are resolved on the outskirts of the system,not by clogging up the core w/ a heavy protocol

  11. Coherence Protocol • Deadlock is prevented by using 3 virtual lanes • Q0 for requests, Q1 for local responses, Q2 for remote responses, QIO for I/O (PCI) transactions • Total ordering required on Q1, Point-to-Point ordering on Q0/QIO, no requirements on Q2 • Split responses into 2 parts (↓Latency,↑Perf.)‏ • Commit (yeah, I heard ya)‏ • Data (except exclusive-without-data requests)‏

  12. Coherence Protocol • Instead of building their protocol to handle the general case, they optimize it for a specific case • e.g. the crossbar local and global switches lend themselves to meeting the ordering requirements • they can delay certain responses because they can bound the latency by a reasonable time

  13. Coherence Protocol

  14. Coherence Protocol • (u,v)=(1,0) should be disallowed (would violate sequential consistency)‏

  15. Performance

  16. Performance • 33.5 Gflops/sec on Linpack workload • Supports 2720 users on SAP Benchmark • Higher I/O bandwidth, but similar commercial workload performance to IBM RS/6000 S80 (24-CPU) and Sun 10000 (64-CPU) systems • NUMA+Lightweight Protocol->↓Mem. Latency • Much better for applications where this matters

  17. Analysis/Questions

More Related