100 likes | 114 Views
Explore the architecture and design of the AlphaServer GS320 by Compaq, focusing on overcoming snooping protocol limitations and directory structure inefficiencies for mid-range multiprocessors, with a detailed overview and solutions for reducing latency and improving memory consistency.
E N D
Architecture and Design of the AlphaServer GS320 Gharachorloo, et al. (Compaq) Presented by Curt Harting http://h18002.www1.hp.com/alphaserver/gs320/
Motivation • Make money – server revenue at the time was in 4 – 64 processor systems • Snooping protocols work really well on small systems (<8 processors) but don’t scale well • Directory structures at the time were made for large (>64 processors) systems, but are too slow for mid-range multiprocessors
The problems • Snooping • Limited by bandwidth • Too much for each controller to do per cycle • Directories • Long latency • Too much glue (Amdahl’s Law)
Overview • 32 or 64 processor directory machine • 8 Quad-Processor Building Blocks connected in a crossbar • Each QBB has: • 4 processors (with external L2) • 4 memory modules • 1 I/O interface • 1 Global Port • DTAG • DIR (14 bits per line) • TTT • 4 request types: read, readX, X, X without data
Reducing Latency • No waiting for invalidated copies to ACK on a GETX • No Nack’ing • Directory updates state as soon as the request arrives • Dirty-Sharing • NUMA
The Three Lane Information Super-Highway • Information is passed on three virtual lanes (and an IO lane). • Q0: Carries a message from processor to the block’s home • Point to point ordering must occur • Q1: Carries messages from the home • Point of serialization! Must have total order • Q2: Replies/data
An example Reproduction of Figure 2d
Caveats • Early request race - request gets to the owner before the data does • Solution: Stall the Q1 until the data arrives • Late request race – request for data arrives after a writeback operation • Solution: Buffer victim until a writeback ACK is received • Intra-Node transactions – Check TTT, possible loop through global • Markers – Used to preserve global order
Memory Consistency • A quick very high-level overview: • Separation of data and requests • The previously atomic response has been split into two parts: the commit and the data • Lots of regulations of what can go when (still)
Questions • The total ordering of the Q1 lane “comes naturally in a crossbar switch”? • The GS320 is said to be expandable to 64 processors, but the system detailed in the paper is tailored to 32 processors. How easily can it be expanded? • Addressing has been a major issue in other papers, but it is not discussed in this one. Why?