Architecture and Design of AlphaServer GS320

Architecture and Design of AlphaServer GS320 Presented by Vijeta Johri 02/13/04

Motivation • Huge Demand for small and medium scale multiprocessors compared to larger servers • Scarcity of scalable applications and OS • Achieving high reliability and fault-containment is tough • GS320 targeted at medium-scale multiprocessing • Take advantage of smaller size of system • Eliminate inefficiencies of directory-based protocol

AlphaServer GS320 Architecture • Hierarchical Shared Memory multiprocessor • 8 QBBs • 10 port local switch • 4 Alpha 21264 processors • Separate on chip I & D cache & external cache • 4 memory modules • 1-8GB SDRAM memory • IO interface supports 8 PCI buses

AlphaServer GS320 Architecture • QBB’s (contd.) • DIR (directory) • 14-bit entry per 64 byte memory line • 6 bit owner field • 8 bit coarse vector having granularity of QBB • Dirty sharing supported • DTAG • Functions as centralized full map directory • Maintains coherence within QBB • TTT • 48 entry associative table • Global Switch • Supports virtual lanes & multicast

Cache Coherence • Goals • Make the common transaction efficient • Exploit small size and interconnect ordering properties • Protocol messages • Resource occupancy • Invalidation based protocol • 4 request types • Read • Read-exclusive • Exclusive • Exclusive-without-data • Reply-forwarding from remote owners • Eager exclusive replies

Cache Coherence • Handle corner cases without NAKs/retries and blocking at home directory • Guarantees owner node always services a forwarded request • all transactions complete with at most 1 message to home • Directory controller implemented as simple pipelined state machine • Eliminates livelock, starvation • Virtual lanes • Q0 : processor to home (point to point order) • Q1 : home / memory to processors ( total order ) • Q2 : replies from third party node or processor to requestor

Cache Coherence • Dealing with late request race • 2 level mechanism • Wait for victim signal before discarding from victim buffer • For writeback to remote home, TTT maintains copy • Dealing with early request race • Delay forwarded request on Q1 until data arrives on Q2 • Allow transactions to be served within a node • No invalidate acknowledgement messages • Multicast is used to send Q1 messages to multiple nodes

Cache Coherence • R: Requestor • H: Home • O: Owner • S: Sharer • Dirty sharing • no sharing writeback • Marker message • Allows requestor node to disambiguate the order of requests

Memory Consistency Optimizations • Alpha memory model is supported • Barrier instructions impose memory ordering • To implement safe early acknowledgement of invalidation, reply message split into • Data component needed to service the request • Commit component used for ordering • Generate early commit component for read and read-exclusive requests

Performance Evaluation • Relatively high back to back read latency • Effective latency smaller for independent read misses • Smaller L2 hit latency than snoopy systems • Latency impact of sending invalidations is small and independent of no. of sharers • Local home writes with remote sharers take longer than with no sharers in case of barrier • Conflicting writes to same line have approximately same latency as 1-hop write latency

Questions • Do you think that Alphaserver GS320 can completely replace snoopy systems and if not, why? • What are the major disadvantages of AlphaServer GS320?

Architecture and Design of AlphaServer GS320