120 likes | 133 Views
Explore the DASH prototype utilizing directory-based cache coherence for large-scale shared memory multiprocessors. Learn about the two-level architecture, cluster and inter-cluster setup, memory distribution, directory protocol, hardware cost analysis, performance metrics, and conclusions from performance results. Understand the importance of locality and memory hierarchy for achieving good speed-ups in parallel computing.
E N D
CS 258 Parallel Computer ArchitectureLecture 15.1DASH: Directory Architecture for Shared memoryImplementation, cost, performanceDaniel Lenoski, et. al. “The DASH Prototype: Implementation and Performance”, Proceedings of the International symposium on Computer Architecture, 1992. March 17, 2008 Rhishikesh Limaye
DASH objectives • Demonstrates large-scale shared memory multiprocessor using directory-based cache coherence. • Prototype with 16-64 processors. • Argument is that: for performance and programmability, a parallel architecture should: • Scale to 100s-1000s of processors • Have high performance individual processors • Have single shared address space
Two-level architecture • Cluster: • Uses bus-based shared memory with snoopy cache coherence • 4 processors per cluster • Inter-cluster: • Scalable interconnect network • Directory-based cache coherence
Cluster level • Minor modifications to off-the-shelf 4D/340 cluster • 4 MIPS R3000 processors + 4 R3010 floating point coprocessors • L1 write-through, L2 write-back. • Cache coherence: • MESI i.e. Illinois • Cache-to-cache transfers good for cached remote locations • L1 cache is write-through => inclusion property • Pipelined bus with maximum bandwidth 64MB/s.
Inter-cluster directory protocol • Three states per 16B memory chunk: invalid, shared, dirty. • Memory is distributed across clusters. • Directory bits: • Simple scheme of 1 bit per cluster + 1 dirty bit. • This is good for the prototype which has maximum 16 clusters. Should be replaced by limited-pointer/sparse directory for more clusters. • Replies are sent directly between clusters and not through the home cluster. • i.e. invalidation acks are collected at the requester node and not the home node of a memory location.
Extra hardware for directory For each cluster, we have the following: • Directory bits: DRAM • 17 bits per 16-byte cache line • Directory controller: snoops every bus transaction within cluster, accesses directory bits and takes action. • Reply controller • Remote access cache: SRAM. 128KB, 16B line • Snoops remote accesses on the local bus • Stores state of on-going remote accesses made by local processors • Lock-up free: handle multiple outstanding requests • QUESTION: what happens if two remote requests collide in this direct mapped cache? • Pseudo-CPU: • for requests for local memory by remote nodes. • Performance monitor
Memory performance • 4-level memory hierarchy: (L1, L2), (local L2s + memory), directory home, remote cluster
Hardware cost of directory • [table 2 in the paper] • 13.7% DRAM – directory bits • For larger systems, sparse representation needed. • 10% SRAM – remote access cache • 20% logic gates – controllers and network interfaces • Clustering is important: • For uniprocessor node, directory logic is 44%. • Compare to message passing: • Message passing has about 10% logic + ~0 memory cost. • Thus, hardware coherence costs 10% more logic and 10% more memory. • Later argued that, the performance improvement is much greater than 10% -- 3-4X.
Performance monitor • Configurable events. • SRAM-based counters: • 2 banks of 16K x 32 SRAM. • Addressed by events (i.e. event0, event1, event2… form address bit 0, 1, 2…) • Thus, can track (log 16K) = 14 events with each bank. • Trace buffer made of DRAM: • Can store 2M memory ops. • With software support, can log all memory operations.
Performance results • 9 applications • Good speed-up on 5 – without specially optimizing for DASH. • MP3D has bad locality. PSIM4 is enhanced version of MP3D. • Cholesky: more processors => too fine granularity, unless problem size is increased unreasonably. • Note: dip after P = 4.
Detailed study of 3 applications • What to read from tables 4, 5, 6: • Water and LocusRoute have equal fraction of reads local, but Water scales well, and LocusRoute doesn’t. • Remote caching works: • Water and LocusRoute have remote reference every 20 and 11 instructions, but busy pclks between processor stalls is 506 and 181.
Conclusions • Locality is still important, because of higher remote latencies • However, for applications, natural locality can be enough (Barnes-Hut, Radiosity, Water). • Thus, good speed-ups can be achieved without difficult programming model (i.e. message passing) • For higher performance, have to worry about the extended memory hierarchy – but only for critical data structures. Analogous to argument in the uniprocessor world: caches vs. scratchpad memories/stream buffers.