Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs252

CS252Graduate Computer ArchitectureLecture 21April 11th, 2012Distributed Shared Memory (con’t)Synchronization Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs252

Req NACK Inv Ack Req Inv Ack WData Inv Ack Recall: Sequential Consistency of Directory Protocols • How to get exclusion zone for directory protocol? • Clearly need to make sure that invalidations really invalidate copies • Keep state to handle reordering at client (previous slide’s problem) • While acknowledgements outstanding, cannot handle read requests • NAK read requests • Queue read requests • Example for invalidation-based scheme: • block owner (home node) provides appearance of atomicity by waiting for all invalidations to be ack’d before allowing access to new value • As a result, write commit point becomes point at which WData leaves home node (after last ack received) • Much harder in update schemes! Reader Reader REQ HOME Reader

1 2 4a 3 3:interv ention 4b 2:interv ention 1: req 1: req 4a:re vise L H R L H R 2:reply 4:reply 3:response 4b:response 1 2 1: req 2:interv ention 4 3 3a:re vise L H R 3b:response Recall: Deadlock Issues with Protocols • Consider Dual graph of message dependencies • Nodes: Networks, Arcs: Protocol actions • Number of networks = length of longest dependency • Must always make sure response (end) can be absorbed! 2 Networks Sufficient to Avoid Deadlock Need 4 Networks to Avoid Deadlock 1 2 Need 3 Networks to Avoid Deadlock 3a 3b

Original: Need 3 Networks to Avoid Deadlock Optional NACK When blocked Need 2 Networks to 1: req 1: req 2:interv 2:interv ention ention 1 1 2 2 3a:re 3a:re vise vise L L H H R R 3 3 3b:response 3b:response Recall: Mechanisms for reducing depth X NACK 2:intervention 1: req 1 2’ Transform to Request/Resp: Need 2 Networks to 3a:re vise L H R 2’:SendInt To R 2 3a 3b:response 3a

A Popular Middle Ground • Two-level “hierarchy” • Individual nodes are multiprocessors, connected non-hiearchically • e.g. mesh of SMPs • Coherence across nodes is directory-based • directory keeps track of nodes, not individual processors • Coherence within nodes is snooping or directory • orthogonal, but needs a good interface of functionality • Examples: • Convex Exemplar: directory-directory • Sequent, Data General, HAL: directory-snoopy • SMP on a chip?

Example Two-level Hierarchies

Advantages of Multiprocessor Nodes • Potential for cost and performance advantages • amortization of node fixed costs over multiple processors • applies even if processors simply packaged together but not coherent • can use commodity SMPs • less nodes for directory to keep track of • much communication may be contained within node (cheaper) • nodes prefetch data for each other (fewer “remote” misses) • combining of requests (like hierarchical, only two-level) • can even share caches (overlapping of working sets) • benefits depend on sharing pattern (and mapping) • good for widely read-shared: e.g. tree data in Barnes-Hut • good for nearest-neighbor, if properly mapped • not so good for all-to-all communication

Disadvantages of Coherent MP Nodes • Bandwidth shared among nodes • all-to-all example • applies to coherent or not • Bus increases latency to local memory • With coherence, typically wait for local snoop results before sending remote requests • Snoopy bus at remote node increases delays there too, increasing latency and reducing bandwidth • May hurt performance if sharing patterns don’t comply

Insight into Directory Requirements • If most misses involve O(P) transactions, might as well broadcast!  Study Inherent program characteristics: • frequency of write misses? • how many sharers on a write miss • how these scale • Also provides insight into how to organize and store directory information

Cache Invalidation Patterns

Sharing Patterns Summary • Generally, few sharers at a write, scales slowly with P • Code and read-only objects (e.g, scene data in Raytrace) • no problems as rarely written • Migratory objects (e.g., cost array cells in LocusRoute) • even as # of PEs scale, only 1-2 invalidations • Mostly-read objects (e.g., root of tree in Barnes) • invalidations are large but infrequent, so little impact on performance • Frequently read/written objects (e.g., task queues) • invalidations usually remain small, though frequent • Synchronization objects • low-contention locks result in small invalidations • high-contention locks need special support (SW trees, queueing locks) • Implies directories very useful in containing traffic • if organized properly, traffic and latency shouldn’t scale too badly • Suggests techniques to reduce storage overhead

Organizing Directories DirectorySchemes Centralized Distributed How to find source of directory information Flat Hierarchical How to locate copies Memory-based Cache-based

How to Find Directory Information • centralized memory and directory - easy: go to it • but not scalable • distributed memory and directory • flat schemes • directory distributed with memory: at the home • location based on address (hashing): network xaction sent directly to home • hierarchical schemes • ??

How Hierarchical Directories Work • Directory is a hierarchical data structure • leaves are processing nodes, internal nodes just directory • logical hierarchy, not necessarily phyiscal • (can be embedded in general network)

Find Directory Info (cont) • distributed memory and directory • flat schemes • hash • hierarchical schemes • node’s directory entry for a block says whether each subtree caches the block • to find directory info, send “search” message up to parent • routes itself through directory lookups • like hiearchical snooping, but point-to-point messages between children and parents

How Is Location of Copies Stored? • Hierarchical Schemes • through the hierarchy • each directory has presence bits child subtrees and dirty bit • Flat Schemes • vary a lot • different storage overheads and performance characteristics • Memory-based schemes • info about copies stored all at the home with the memory block • Dash, Alewife , SGI Origin, Flash • Cache-based schemes • info about copies distributed among copies themselves • each copy points to next • Scalable Coherent Interface (SCI: IEEE standard)

P M Flat, Memory-based Schemes • info about copies co-located with block at the home • just like centralized scheme, except distributed • Performance Scaling • traffic on a write: proportional to number of sharers • latency on write: can issue invalidations to sharers in parallel • Storage overhead • simplest representation: full bit vector, (called “Full-Mapped Directory”), i.e. one presence bit per node • storage overhead doesn’t scale well with P; 64-byte line implies • 64 nodes: 12.7% ovhd. • 256 nodes: 50% ovhd.; 1024 nodes: 200% ovhd. • for M memory blocks in memory, storage overhead is proportional to P*M: • Assuming each node has memory Mlocal= M/P,  P2Mlocal • This is why people talk about full-mapped directories as scaling with the square of the number of processors

P M = MlocalP Reducing Storage Overhead • Optimizations for full bit vector schemes • increase cache block size (reduces storage overhead proportionally) • use multiprocessor nodes (bit per mp node, not per processor) • still scales as P*M, but reasonable for all but very large machines • 256-procs, 4 per cluster, 128B line: 6.25% ovhd. • Reducing “width” • addressing the P term? • Reducing “height” • addressing the M term?

Storage Reductions • Width observation: • most blocks cached by only few nodes • don’t have a bit per node, but entry contains a few pointers to sharing nodes • Called “Limited Directory Protocols” • P=1024 => 10 bit ptrs, can use 100 pointers and still save space • sharing patterns indicate a few pointers should suffice (five or so) • need an overflow strategy when there are more sharers • Height observation: • number of memory blocks >> number of cache blocks • most directory entries are useless at any given time • Could allocate directory from pot of directory entries • If memory line doesn’t have a directory, no-one has copy • What to do if overflow? Invalidate directory with invaliations • organize directory as a cache, rather than having one entry per memory block

Case Study: Alewife Architecture • Cost Effective Mesh Network • Pro: Scales in terms of hardware • Pro: Exploits Locality • Directory Distributed along with main memory • Bandwidth scales with number of processors • Con: Non-Uniform Latencies of Communication • Have to manage the mapping of processes/threads onto processors due • Alewife employs techniques for latency minimization and latency tolerance so programmer does not have to manage • Context Switch in 11 cycles between processes on remote memory request which has to incur communication network latency • Cache Controller holds tags and implements the coherence protocol

LimitLESS Protocol (Alewife) • Limited Directory that is Locally Extended through Software Support • Handle the common case (small worker set) in hardware and the exceptional case (overflow) in software • Processor with rapid trap handling (executes trap code within 4 cycles of initiation) • State Shared • Processor needs complete access to coherence related controller state in the hardware directories • Directory Controller can invoke processor trap handlers • Machine needs an interface to the network that allows the processor to launch and intercept coherence protocol packets

The Protocol • Alewife: p=5-entry limited directory with software extension (LimitLESS) • Read-only directory transaction: • Incoming RREQ with n  p Hardware memory controller responds • If n > p: send RREQ to processor for handling

Transition to Software • Trap routine can either discard packet or store it to memory • Store-back capability permits message-passing and block transfers • Potential Deadlock Scenario with Processor Stalled and waiting for a remote cache-fill • Solution: Synchronous Trap (stored in local memory) to empty input queue

Transition to Software (Con’t) • Overflow Trap Scenario • First Instance: Full-Map bit-vector allocated in local memory and hardware pointers transferred into this and vector entered into hash table • Otherwise: Transfer hardware pointers into bit vector • Meta-State Set to “Trap-On-Write” • While emptying hardware pointers, Meta-State: “Trans-In-Progress” • Incoming Write Request Scenario • Empty hardware pointers to memory • Set AckCtr to number of bits that are set in bit-vector • Send invalidations to all caches except possibly requesting one • Free vector in memory • Upon invalidate acknowledgement (AckCtr == 0), send Write-Permission and set Memory State to “Read-Write”

Flat, Cache-based Schemes • How they work: • home only holds pointer to rest of directory info • distributed linked list of copies, weaves through caches • cache tag has pointer, points to next cache with a copy • on read, add yourself to head of the list (comm. needed) • on write, propagate chain of invals down the list • Scalable Coherent Interface (SCI) IEEE Standard • doubly linked list

Scaling Properties (Cache-based) • Traffic on write: proportional to number of sharers • Latency on write: proportional to number of sharers! • don’t know identity of next sharer until reach current one • also assist processing at each node along the way • (even reads involve more than one other assist: home and first sharer on list) • Storage overhead: quite good scaling along both axes • Only one head ptr per memory block • rest is all prop to cache size • Very complex!!!

Summary of Directory Organizations • Flat Schemes: • Issue (a): finding source of directory data • go to home, based on address • Issue (b): finding out where the copies are • memory-based: all info is in directory at home • cache-based: home has pointer to first element of distributed linked list • Issue (c): communicating with those copies • memory-based: point-to-point messages (perhaps coarser on overflow) • can be multicast or overlapped • cache-based: part of point-to-point linked list traversal to find them • serialized • Hierarchical Schemes: • all three issues through sending messages up and down tree • no single explict list of sharers • only direct communication is between parents and children

Summary of Directory Approaches • Directories offer scalable coherence on general networks • no need for broadcast media • Many possibilities for organizing directory and managing protocols • Hierarchical directories not used much • high latency, many network transactions, and bandwidth bottleneck at root • Both memory-based and cache-based flat schemes are alive • for memory-based, full bit vector suffices for moderate scale • measured in nodes visible to directory protocol, not processors • will examine case studies of each

Role of Synchronization • Types of Synchronization • Mutual Exclusion • Event synchronization • point-to-point • group • global (barriers) • How much hardware support? • high-level operations? • atomic instructions? • specialized interconnect?

Components of a Synchronization Event • Acquire method • Acquire right to the synch • enter critical section, go past event • Waiting algorithm • Wait for synch to become available when it isn’t • busy-waiting, blocking, or hybrid • Release method • Enable other processors to acquire right to the synch • Waiting algorithm is independent of type of synchronization • makes no sense to put in hardware

Strawman Lock lock: ld register, location/* copy location to register */ cmp location, #0/* compare with 0 */ bnz lock/* if not 0, try again */ st location, #1/* store 1 to mark it locked */ ret/* return control to caller */ unlock: st location, #0/* write 0 to location */ ret/* return control to caller */ Busy-Wait Why doesn’t the acquire method work? Release method?

What to do if only load and store? • Here is a possible two-thread solution: Thread AThread B Set A=1; Set B=1; while (B) {//X if (!A) {//Y do nothing; Critical Section; } } Critical Section; Set B=0; Set A=0; • Does this work? Yes. Both can guarantee that: • Only one will enter critical section at a time. • At X: • if B=0, safe for A to perform critical section, • otherwise wait to find out what will happen • At Y: • if A=0, safe for B to perform critical section. • Otherwise, A is in critical section or waiting for B to quit • But: • Really messy • Generalization gets worse

Atomic Instructions • Specifies a location, register, & atomic operation • Value in location read into a register • Another value (function of value read or not) stored into location • Many variants • Varying degrees of flexibility in second part • Simple example: test&set • Value in location read into a specified register • Constant 1 stored into location • Successful if value loaded into register is 0 • Other constants could be used instead of 1 and 0 • How to implement test&set in distributed cache coherent machine? • Wait until have write privileges, then perform operation without allowing any intervening operations (either locally or remotely)

20 l T est&set, c = 0 s s T est&set, exponential backof f, c = 3.64 l s 18 T est&set, exponential backof f, c = 0 n n Ideal u s 16 s l s s 14 n l l s s n 12 s s) n l m s s ime ( s l 10 l T n l s n n 8 n l 6 n l n 4 l n s l l 2 n l s uuuuuuuuuuuuuuu l n n n n s u l 0 3 5 7 9 11 13 15 Number of processors T&S Lock Microbenchmark: SGI Chal. lock: t&s register, location bnz lock /* if not 0, try again */ ret /* return control to caller */ unlock: st location, #0 /* write 0 to location */ ret /* return control to caller */ lock; delay(c); unlock;

Zoo of hardware primitives • test&set (&address) { /* most architectures */ result = M[address]; M[address] = 1; return result;} • swap (&address, register) { /* x86 */ temp = M[address]; M[address] = register; register = temp;} • compare&swap (&address, reg1, reg2) { /* 68000 */ if (reg1 == M[address]) { M[address] = reg2; return success; } else { return failure; }} • load-linked&store conditional(&address) { /* R4000, alpha */ loop: ll r1, M[address]; movi r2, 1; /* Can do arbitrary comp */ sc r2, M[address]; beqz r2, loop;}

Mini-Instruction Set debate • atomic read-modify-write instructions • IBM 370: included atomic compare&swap for multiprogramming • x86: any instruction can be prefixed with a lock modifier • High-level language advocates want hardware locks/barriers • but it’s goes against the “RISC” flow,and has other problems • SPARC: atomic register-memory ops (swap, compare&swap) • MIPS, IBM Power: no atomic operations but pair of instructions • load-locked, store-conditional • later used by PowerPC and DEC Alpha too • 68000: CCS: Compare and compare and swap • No-one does this any more • Rich set of tradeoffs

Other forms of hardware support • Separate lock lines on the bus • Lock locations in memory • Lock registers (Cray Xmp,Intel Single-Chip CC) • Hardware full/empty bits (Tera, Alewife) • QOLB (machines supporting SCI protocol) • Bus support for interrupt dispatch

Enhancements to Simple Lock • Reduce frequency of issuing test&sets while waiting • Test&set lock with backoff • Don’t back off too much or will be backed off when lock becomes free • Exponential backoff works quite well empirically: ith time = k*ci • Busy-wait with read operations rather than test&set • Test-and-test&set lock • Keep testing with ordinary load • cached lock variable will be invalidated when release occurs • When value changes (to 0), try to obtain lock with test&set • only one attemptor will succeed; others will fail and start testing again

Busy-wait vs Blocking • Busy-wait: I.e. spin lock • Keep trying to acquire lock until read • Very low latency/processor overhead! • Very high system overhead! • Causing stress on network while spinning • Processor is not doing anything else useful • Blocking: • If can’t acquire lock, deschedule process (I.e. unload state) • Higher latency/processor overhead (1000s of cycles?) • Takes time to unload/restart task • Notification mechanism needed • Low system overheadd • No stress on network • Processor does something useful • Hybrid: • Spin for a while, then block • 2-competitive: spin until have waited blocking time

Improved Hardware Primitives: LL-SC • Goals: • Test with reads • Failed read-modify-write attempts don’t generate invalidations • Nice if single primitive can implement range of r-m-w operations • Load-Locked (or -linked), Store-Conditional • LL reads variable into register • Follow with arbitrary instructions to manipulate its value • SC tries to store back to location • succeed if and only if no other write to the variable since this processor’s LL • indicated by condition codes; • If SC succeeds, all three steps happened atomically • If fails, doesn’t write or generate invalidations • must retry aquire

Simple Lock with LL-SC lock: ll reg1, location/* LL location to reg1 */ sc location, reg2/* SC reg2 into location*/ beqz reg2, lock/* if failed, start again */ ret unlock: st location, #0/* write 0 to location */ ret • Can do more fancy atomic ops by changing what’s between LL & SC • But keep it small so SC likely to succeed • Don’t include instructions that would need to be undone (e.g. stores) • SC can fail (without putting transaction on bus) if: • Detects intervening write even before trying to get bus • Tries to get bus but another processor’s SC gets bus first • LL, SC are not lock, unlock respectively • Only guarantee no conflicting write to lock variable between them • But can use directly to implement simple operations on shared variables

Ticket Lock • Only one r-m-w per acquire • Two counters per lock (next_ticket, now_serving) • Acquire: fetch&inc next_ticket; wait for now_serving == next_ticket • atomic op when arrive at lock, not when it’s free (so less contention) • Release: increment now-serving • Performance • low latency for low-contention - if fetch&inc cacheable • O(p) read misses at release, since all spin on same variable • FIFO order • like simple LL-SC lock, but no inval when SC succeeds, and fair • Backoff? • Wouldn’t it be nice to poll different locations ...

Array-based Queuing Locks • Waiting processes poll on different locations in an array of size p • Acquire • fetch&inc to obtain address on which to spin (next array element) • ensure that these addresses are in different cache lines or memories • Release • set next location in array, thus waking up process spinning on it • O(1) traffic per acquire with coherent caches • FIFO ordering, as in ticket lock, but, O(p) space per lock • Not so great for non-cache-coherent machines with distributed memory • array location I spin on not necessarily in my local memory • Example: MCS lock (Mellor-Crummey and Scott)

A r r a y - b a s e d l L L - S C 6 L L - S C , e x p o n e n t i a l n T i c k e t u T i c k e t , p r o p o r t i o n a l s 7 7 7 u u u u u u u 6 6 6 6 u u u u u u u u u u u 6 u 6 5 5 5 u u 6 l u u u l u u u 6 l l 6 l l u u 6 l l 6 l l l u l u l l l 6 l 6 l 4 4 4 l l l l l 6 l u l l l l l s 6 u l s l s l l l s u s l l l s s l l u 6 s l l s s l 6 u s u s n s s s 6 s s 6 s s Time (s) s Time (s) Time (s) s s s s s u s s s s u u 6 s s s s s s s n s s s 6 s n 6 3 3 3 n 6 n 6 n n 6 n n n n n n n 6 n 6 6 6 6 n n 2 2 2 6 n 6 n 6 6 n n 6 u s 6 n 6 6 n 6 6 n 6 n n 6 n n n 6 u n 1 1 n 1 n l n s n 6 n n s n 6 n s n n n l n s n u l l 6 n n u 6 u 6 0 0 0 1 3 5 7 9 1 1 1 3 1 5 1 3 5 7 9 1 1 1 3 1 5 1 3 5 7 9 1 1 1 3 1 5 Number of processors Number of processors Number of processors (a) Null (c = 0, d = 0) (c) Delay (c = 3.64 s, d = 1.29 s) (b) Critical-section (c = 3.64 s, d = 0) Lock Performance on SGI Challenge Loop: lock; delay(c); unlock; delay(d);

Point to Point Event Synchronization • Software methods: • Interrupts • Busy-waiting: use ordinary variables as flags • Blocking: use semaphores • Full hardware support: full-empty bit with each word in memory • Set when word is “full” with newly produced data (i.e. when written) • Unset when word is “empty” due to being consumed (i.e. when read) • Natural for word-level producer-consumer synchronization • producer: write if empty, set to full; consumer: read if full; set to empty • Hardware preserves atomicity of bit manipulation with read or write • Problem: flexibility • multiple consumers, or multiple writes before consumer reads? • needs language support to specify when to use • composite data structures?

Barriers • Software algorithms implemented using locks, flags, counters • Hardware barriers • Wired-AND line separate from address/data bus • Set input high when arrive, wait for output to be high to leave • In practice, multiple wires to allow reuse • Useful when barriers are global and very frequent • Difficult to support arbitrary subset of processors • even harder with multiple processes per processor • Difficult to dynamically change number and identity of participants • e.g. latter due to process migration • Not common today on bus-based machines

A Simple Centralized Barrier • Shared counter maintains number of processes that have arrived • increment when arrive (lock), check until reaches numprocs • Problem? struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name; BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0; /* reset flag if first to reach*/ mycount = bar_name.counter++; /* mycount is private */ UNLOCK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = 0; /* reset for next barrier */ bar_name.flag = 1; /* release waiters */ } else while (bar_name.flag == 0) {}; /* busy wait for release */ }

A Working Centralized Barrier • Consecutively entering the same barrier doesn’t work • Must prevent process from entering until all have left previous instance • Could use another counter, but increases latency and contention • Sense reversal: wait for flag to take different value consecutive times • Toggle this value only when all processes reach BARRIER (bar_name, p) { local_sense = !(local_sense); /* toggle private sense variable */ LOCK(bar_name.lock); mycount = bar_name.counter++; /* mycount is private */ if (bar_name.counter == p) UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/ else { UNLOCK(bar_name.lock); while (bar_name.flag != local_sense) {}; } }

Centralized Barrier Performance • Latency • Centralized has critical path length at least proportional to p • Traffic • About 3p bus transactions • Storage Cost • Very low: centralized counter and flag • Fairness • Same processor should not always be last to exit barrier • No such bias in centralized • Key problems for centralized barrier are latency and traffic • Especially with distributed memory, traffic goes to same node

Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs252

Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs252

Presentation Transcript

Prof John D. Kubiatowicz cs.berkeley/~kubitron/cs252

Prof John D. Kubiatowicz cs.berkeley/~kubitron/cs252

September 29, 2000 Prof. John Kubiatowicz

Prof John D. Kubiatowicz cs.berkeley/~kubitron/cs252

Prof John D. Kubiatowicz cs.berkeley/~kubitron/cs252

October 22 nd , 2003 Prof. John Kubiatowicz cs.berkeley/~kubitron/courses/cs252-F03

September 15, 2000 Prof. John Kubiatowicz

September 17 th , 2003 Prof. John Kubiatowicz

September 22, 1999 Prof. John Kubiatowicz

September 15, 2003 Prof. John Kubiatowicz cs.berkeley/~kubitron/courses/cs252-F03

Prof John D. Kubiatowicz cs.berkeley/~kubitron/cs252

Prof John D. Kubiatowicz cs.berkeley/~kubitron/cs252

September 24 th , 2003 Prof. John Kubiatowicz

September 29, 2000 Prof. John Kubiatowicz

September 22, 1999 Prof. John Kubiatowicz

September 17 th , 2003 Prof. John Kubiatowicz

September 15, 2003 Prof. John Kubiatowicz cs.berkeley/~kubitron/courses/cs252-F03

Prof John D. Kubiatowicz cs.berkeley/~kubitron/cs252

Prof John D. Kubiatowicz cs.berkeley/~kubitron/cs252