Cache Coherence Schemes for Multiprocessors

Cache Coherence Schemes for Multiprocessors Sivakumar M Osman Unsal

Consistency Different Directory Schemes Comparison of Directory schemes Hierarchical Directory scheme (in detail) Referred Papers: “Directory-Based Cache Coherence in Large-Scale Multiprocessors”, David Chaiken, Craig Fields, Kiyoshi Kurihara and Anant Agarwal “A Survey of Cache Coherence Schemes for Multiprocessors”, Per Stenstrom “Cache Consistency and Sequential Consistency”, James R Goodman “LimitLess Directories: A Scalable Cache Coherence Schemes”, David Chaiken, John Kubiatowicz and Anant Agarwal “A Hierarchical Directory Scheme for Large-Scale Cache-Coherent Multiprocessors”, A Dissertation by Yeong-Chang Maa

Strict Consistency Any read to memory location X returns the value stored by the most recent write operation to X P1: W(x)1 P1: W(x)1 P2: R(x)1 P2: R(x)0 R(x)1 Sequential Consistency : Program order + Memory coherence The result of any execution is the same as if the operations of all processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified y its program P1: W(x)1 P1: W(x)1 P2: R(x)0 R(x)1 P2: R(x)1 R(x)1 CONSISTENCY

Causal Consistency Writes that are potentially causally related must be seen by all process in the same order. Concurrent writes may be seen in a different order on different machines. P1: W(x)1 W(x)3 P2: R(x)1 W(x)2 P3 R(x)1 R(x)3 R(x)2 P4 R(x)1 R(x)2 R(x)3 PRAM Consistency Writes done by a single process are received by all other process in the order in which they are issued, but writes from different processes may be seen in a different order by different processes. Processor Consistency For every memory location X, there should be a global agreement about the order of writes to X CONSISTENCY

Weak Consistency Using Synchronization variable which are sequentially consistent No access to a synchronization variable is allowed until all previous writes have completed everywhere No data access is allowed until all previous access to synchronization variable have been performed Release Consistency Barrier synchronization : Acquire and Release Acquire and Release should be processor consistent Lazy release and Eager release consistencies Entry Consistency Locks for each shared variable or element CONSISTENCY

Need Limited Bandwidth Bus cycle times - ring out Scalability Disparity between bus and processor speed Increase in Bandwidth as processor number increases Drawback No Broadcast capability Complex protocol Directory based cache coherence

Tang’s scheme Full-mapped Each directory entry N bits + status bits for N processors Memory overhead scales as (square of N) assuming M a N Censier scheme (Distributed) Stenstrom scheme (Distributed) Limited Directories Classified as Dir i X, where X may be NB or B & i<N Eviction : Pointer replacement Resembles set associative cache and requires eviction policy Efficient if memory is referenced by few processors Memory overhead scales as (M*i*log N) If X is NB, can allow more than i copies to exist Directory Schemes

Directory Schemes • Chained Directories • Make use of pointers like linked lists • Complex cache-block replacement • splice intermediate cache out of the chain • Invalidate the location • Variation: Doubly linked chain • Optimizes replacement process • Needs large average message block size • Comparison of full-mapped, limited, chained schemes • Metric: Processor Utilization • Utilization depends on frequency of Memory reference and latency of memory system • Latency depends on topology, speed, number of processors, memory access latency, frequency and size of messages

Directory Schemes • Analysis • No coherence : All addresses in trace are not shared. Gives upper bound • Only cache private data : For comparison with other schemes • P-Thor : minimize communication and has minimum synchronization points • Speech : Poor performance of limited directories due to pointer thrashing • Performance improvement by system level optimizations • * Tree barrier structure instead of linear barrier • * Separating read only blocks from read/write blocks • * Reducing the block size

Directory Schemes • Coarse Vector DiriCVr • Initially behaves as limited directory • Switches to fully mapped • Dir0B • 2 status bit for 4 states : Absent, Present1: present and clean in only one cache, Present: present and clean in more than one cache, PresentM: present and dirty in only one cache • LimitLess Directory Scheme • Combination of hardware and software techniques • Realize performance of full-map directory • Memory overhead of limited directory • Sectored Directory DirN/L • L sub-blocks share the directory • Overhead is MN/L

Directory Schemes • Directory Cache Dira1,a2 • a1 entries for short limited directory pointers • a2 entries for long full-map pointers • Hierarchical Scheme

Network Architecture Wilson Hierarchical cache/bus architecture combination bus and directory scheme cache contains a copy of all blocks cached underneath it write Invalidate protocol Higher level caches act as filters Data Diffusion Machine Hierarchy of busses with large processor caches Write Invalidate protocol Only state information in higher order caches No global memory and cost effective Hierarchical Cache Coherence Schemes

Hierarchical Full-mapped Directory Schemes Descendants presence vector tag bits ackctr MRU INV UP MRQ Tr dirty • States of HFMD • ABS : No entries in descendants; cleared des.vector and Tr bit • ABT : descendants entries being invalidated; cleared des.vector and Tr bit • RO : read only entries in the descendants; set des.vector, cleared dirty and Tr • bits • RW : a dirty (read write) entry is in the descendants; set des.vector, dirty bit • and cleared TR bit • RT : descendant entries have outstanding read requests; set des.vector and Tr • bit, cleared dirty bit • WT : descendant entries have outstanding write or modify request; set • des.vector, dirty bit and Tr bit • INV : descendant entries being invalidated from directory entry; cleared • des.vector, set Tr bit and INV bit

Cache Coherence Schemes for Multiprocessors

Cache Coherence Schemes for Multiprocessors

Presentation Transcript

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Coherence Ordering for Ring-based Chip Multiprocessors

Multiprocessors—Directory Schemes

Transparent Dynamic Binding with Fault-Tolerant Cache Coherence Protocol for Chip Multiprocessors

Cache Coherence for GPU Architectures

Cache coherence

Cache Coherence

“An Evaluation of Directory Schemes for Cache Coherence”

Cache coherence for CMPs

The Cache-Coherence Problem

Cache Coherence

Cache Coherence Protocols

Cache Coherence in Shared Memory Multiprocessors

Cache Coherence

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors

The Cache-Coherence Problem

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Cache Coherence Techniques for Multicore Processors

Cache Coherence in Bus-Based Shared Memory Multiprocessors

“An Evaluation of Directory Schemes for Cache Coherence”