Lecture on High Performance Processor Architecture ( CS05162 )

Lecture on High Performance Processor Architecture(CS05162) Introduction to Shared Memory Model and Transactional Memory Guo Rui timmyguo@mail.ustc.edu.cn Fall 2007 University of Science and Technology of China Department of Computer Science and Technology

Outline • SMP/CMP & Shared Memory Model • Synchronization: Critical-Section Problem • Lock (Mutex) • Transactional Memory CS of USTC

P P 1 n Switch (Interleaved) First-level $ (Interleaved) Main memory Shared Cache P P n 1 P P n 1 $ $ $ $ Mem Mem Inter connection network Inter connection network Mem Mem Natural Extensions of Memory System Scale Centralized Memory Dance Hall, UMA Distributed Memory (NUMA) CS of USTC

Interconnection • Bus (Shared media) • Broadcast & snoop • Contention & arbitration • Cheap • Routing network (2D-Mesh etc.) • unicast communication • Multi-hop communication • Expensive CS of USTC

P P P 2 1 3 u = ? u = ? u = 7 $ 5 4 $ $ 3 I/O devices 1 2 u u u :5 :5 :5 Memory Example Cache Coherence Problem • Processors see different values for u after event 3 • With write back caches, value written back to memory depends on happenstance of which cache flushes or writes back value when • Processes accessing main memory may see very stale value • Unacceptable to programs, and frequent! CS of USTC

Caches and Cache Coherence • private processor caches create a problem • Copies of a variable can be present in multiple caches • A write by one processor may not become visible to others • They’ll keep accessing stale value in their caches => Cache coherence problem • What do we do about it? • Nothing at all • Organize the mem hierarchy to make it go away • Detect and take actions to eliminate the problem CS of USTC

P L1 100:67 L2 100:35 Memory Disk 100:34 Intuitive Memory Model • Reading an address should return the last value written to that address • Easy in uniprocessors • except for I/O • Cache coherence problem in MPs is more pervasive and more performance critical CS of USTC

P P P 2 1 3 u = ? u = ? u = 7 $ 5 4 $ $ 3 1 I/O devices 2 u u u u :5 :7 :5 :5 Memory Example: Write-thru Invalidate Protocol CS of USTC

Invalidate vs. Update • Basic question of program behavior: • Is a block written by one processor later read by others before it is overwritten? • Invalidate. • yes: readers will take a miss • no: multiple writes without addition traffic • also clears out copies that will never be used again • Update. • yes: avoids misses on later references • no: multiple useless updates • even to pack rats => Need to look at program reference patterns and hardware complexity but first - correctness CS of USTC

PrRd/— PrW r/— M BusRd/Flush PrW r/BusRdX S BusRdX/Flush BusRdX/— PrRd/BusRd PrRd/— PrW r/BusRdX I MSI Invalidate Protocol • Read obtains block in “shared” • even if only cache copy • Obtain exclusive ownership before writing • BusRdx causes others to invalidate (demote) • If M in another cache, will flush • BusRdx even if hit in S • promote to M (upgrade) • What about replacement? • S->I, M->I as before BusRd/— CS of USTC

Setup for Memory Consistency • Coherence => Writes to a location become visible to all in the same order • But when does a write become visible? • How do we establish orders between a write and a read by different processors? • use event synchronization • typically use more than one location! CS of USTC

P P 1 2 /*Assume initial value of A and flag is 0*/ /*spin idly*/ A = 1; while (flag == 0); flag = 1; print A; Example • Intuition not guaranteed by coherence • Expect memory to respect order between accesses to different locations issued by a given process • Coherence is not enough! • pertains only to single location • to preserve orders among accesses to same location by different processes CS of USTC

Memory Consistency Model • Specifies constraints on the order in which memory operations (from any process) can appear to execute with respect to one another • What orders are preserved? • Given a load, constrains the possible values returned by it • Without it, can’t tell much about an SAS program’s execution • Implications for both programmer and system designer • Programmer uses to reason about correctness and possible results • System designer can use to constrain how much accesses can be reordered by compiler or hardware • Contract between programmer and system CS of USTC

Pr ocessors P P P 1 2 n issuing memory r efer ences as per pr ogram or der The “switch” is randomly set after each memory r efer ence Memory Sequential Consistency • Total order achieved by interleaving accesses from different processes • Maintains program order, and memory operations, from all processes, appear to [issue, execute, complete] atomically w.r.t. others • as if there were no caches, and a single memory • “A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport, 1979] CS of USTC

P P 1 2 0*/ /*Assume initial values of A and B are (1a) A = 1; (2a) print B; A=0 (1b) B = 2; (2b) print A; B=2 SC Example • What matters is order in which operations appear to execute, not the chronological order of events • Possible outcomes for (A,B): (0,0), (1,0), (1,2) • What about (0,2) ? • program order => 1a->1b and 2a->2b • A = 0 implies 2b->1a, which implies 2a->1b • B = 2 implies 1b->2a, which leads to a contradiction CS of USTC

The problem • For (i=0; i < 100; i++, cnt++); //cnt==0 initially • A: Reg1=cnt //Reg1=0 • B: Reg1=cnt //Reg1=0 • A: Reg1++ //Reg1=1 • B: Reg1++ //Reg1=1 • A: cnt=Reg1 //cnt=1 • B: cnt=Reg1 //cnt=1 CS of USTC

Critical-Section Problem • Accessing shared resources • Only one thread can access shared resources protected by the critical section. • Correctness criteria • Mutual exclusion • Progress • Bounded waiting CS of USTC

Software mutex algorithm Bool flag[2]={0,0} Int turn=0; Void enterCriticalSection(int t) { int other = 1 – t; flag[t] = true; turn = other; while (flag[other] == true && turn == other); //waiting } Void leavingCriticalSection(int t) { flag[t] = false; } CS of USTC

Strawman Lock lock: ld register, location/* copy location to register */ cmp location, #0 /* compare with 0 */ bnz lock /* if not 0, try again */ st location, #1 /* store 1 to mark it locked */ ret /* return control to caller */ unlock: st location, #0 /* write 0 to location */ ret /* return control to caller */ Busy-Wait Why doesn’t the acquire method work? Release method? CS of USTC

Hardware support • Atomic Read-modify-Write instruction • IBM 370, Sparc: atomic compare&swap • x86: any instruction can be prefixed with a lock modifier • MIPS, PowerPC, Alpha: Load-link, Store conditional • Other forms of hardware support • Lock locations in memory • Lock registers (Cray Xmp) • Others… CS of USTC

Simple Test&Set Lock lock: t&s register, location bnz lock /* if not 0, try again */ ret /* return control to caller */ unlock: st location, #0 /* write 0 to location */ ret /* return control to caller */ • Other read-modify-write primitives • Swap • Fetch&op • Compare&swap • Three operands: location, register to compare with, register to swap with • Not commonly supported by RISC instruction sets • cacheable or uncacheable CS of USTC

Performance Criteria for Synch. Ops • Latency (time per op) • especially when light contention • Bandwidth (ops per sec) • especially under high contention • Traffic • load on critical resources • especially on failures under contention • Storage • Fairness CS of USTC

Enhancements to Simple Lock • Reduce frequency of issuing test&sets while waiting • Test&set lock with backoff • Don’t back off too much or will be backed off when lock becomes free • Exponential backoff works quite well empirically: ith time = k*ci • Busy-wait with read operations rather than test&set • Test-and-test&set lock • Keep testing with ordinary load • cached lock variable will be invalidated when release occurs • When value changes (to 0), try to obtain lock with test&set • only one attemptor will succeed; others will fail and start testing again CS of USTC

Improved Hardware Primitives: LL-SC • Goals: • Test with reads • Failed read-modify-write attempts don’t generate invalidations • Nice if single primitive can implement range of r-m-w operations • Load-Locked (or -linked), Store-Conditional • LL reads variable into register • Follow with arbitrary instructions to manipulate its value • SC tries to store back to location • succeed if and only if no other write to the variable since this processor’s LL • indicated by condition codes; • If SC succeeds, all three steps happened atomically • If fails, doesn’t write or generate invalidations • must retry acquire CS of USTC

Simple Lock with LL-SC lock: llreg1, location /* LL location to reg1 */ bnz reg1,lock //其他操作 sc location, reg2 /* SC reg2 into location*/ beqz lock/* if failed, start again */ ret unlock: st location, #0/* write 0 to location */ ret • Can do more fancy atomic ops by changing what’s between LL & SC • But keep it small so SC likely to succeed • Don’t include instructions that would need to be undone (e.g. stores) • SC can fail (without putting transaction on bus) if: • Detects intervening write even before trying to get bus • Tries to get bus but another processor’s SC gets bus first • LL, SC are not lock, unlock respectively • Only guarantee no conflicting write to lock variable between them • But can use directly to implement simple operations on shared variables CS of USTC

Trade-offs So Far • Latency? • Bandwidth? • Traffic? • Storage? • Fairness? • What happens when several processors spinning on lock and it is released? • traffic per P lock operations? CS of USTC

Ticket Lock • Only one r-m-w per acquire • Two counters per lock (next_ticket, now_serving) • Acquire: fetch&inc next_ticket; wait for now_serving == next_ticket • atomic op when arrive at lock, not when it’s free (so less contention) • Release: increment now-serving • Performance • low latency for low-contention - if fetch&inc cacheable • O(p) read misses at release, since all spin on same variable • FIFO order • like simple LL-SC lock, but no inval when SC succeeds, and fair • Backoff? • Wouldn’t it be nice to poll different locations ... CS of USTC

Array-based Queuing Locks • Waiting processes poll on different locations in an array of size p • Acquire • fetch&inc to obtain address on which to spin (next array element) • ensure that these addresses are in different cache lines or memories • Release • set next location in array, thus waking up process spinning on it • O(1) traffic per acquire with coherent caches • FIFO ordering, as in ticket lock, but, O(p) space per lock • Not so great for non-cache-coherent machines with distributed memory • array location I spin on not necessarily in my local memory (solution later) CS of USTC

A r r a y - b a s e d l L L - S C 6 L L - S C , e x p o n e n t i a l n T i c k e t u T i c k e t , p r o p o r t i o n a l s 7 7 7 u u u u u u u 6 6 6 6 u u u u u u u u u u u 6 u 6 5 5 5 u u 6 l u u u l u u u 6 l l 6 l l u u 6 l l 6 l l l u l u l l l 6 l 6 l 4 4 4 l l l l l 6 l u l l l l l s 6 u l s l s l l l s u s l l l s s l l u 6 s l l s s l 6 u s u s n s s s 6 s s 6 s s Time (s) s Time (s) Time (s) s s s s s u s s s s u u 6 s s s s s s s n s s s 6 s n 6 3 3 3 n 6 n 6 n n 6 n n n n n n n 6 n 6 6 6 6 n n 2 2 2 6 n 6 n 6 6 n n 6 u s 6 n 6 6 n 6 6 n 6 n n 6 n n n 6 u n 1 1 n 1 n l n s n 6 n n s n 6 n s n n n l n s n u l l 6 n n u 6 u 6 0 0 0 1 3 5 7 9 1 1 1 3 1 5 1 3 5 7 9 1 1 1 3 1 5 1 3 5 7 9 1 1 1 3 1 5 Number of processors Number of processors Number of processors (a) Null (c = 0, d = 0) (c) Delay (c = 3.64 s, d = 1.29 s) (b) Critical-section (c = 3.64 s, d = 0) Lock Performance on SGI Challenge Loop:lock; delay(c); unlock; delay(d); CS of USTC

Performance Priority inversion. Convoying Productivity Deadlock—not composable Performance vs. ease of use Fine-granulated lock Coarse-granulated lock Key: Lock is conservative Thread 1: Lock B Lock A … UnLock A UnLock B Thread 2: Lock A Lock B … UnLock B UnLock A The Problems of lock CS of USTC

Concept of Transactional Memory • Borrowed from Database transaction (xact) concept • ACID property • Atomic Consistency IsolationDurability • A xact is an atomic group of instructions • All results appear to the system if the xact successes • No results appear to the system if the xact fails • Temporary results in different xacts are not interfered with each other • Compare: Lock • Compare: LL-SC inst. CS of USTC

Interaction between xacts in TCC model For (i=0; i < 100; i++) { xact_begin; cnt++; xact_commit; } For (i=0; i < 100; i++) { Lock(l_counter); cnt++; unLock(l_counter); } CS of USTC

Nesting Transactions • Flattening • Closed nesting • Partial abort • Open nesting • Partial commit CS of USTC

Works on Hardware Transactional Memory • Basic Model • Basic HTM, 1993 Herlihy • SLE/TLR,2002? Rajwar & Goodman • TCC,2004 Lance Hammond • LogTM,2006 K. E. Moore • Virtualization • UTM,2005 C. Scott Ananian • VTM,2005 Rajwar & Herlihy • TTM, 2004 K. E. Moore • LogTM-SE,2007 • And more… CS of USTC

TCC abstract • Lazy version management & conflict detection • Old value store in-place • Detect conflict when one of the transactions commits. • Bus based • Buffer write address in write buffer • Broadcast modification on commit • Fast abort by discarding buffered new value. • Ordered transaction support (phase number) • Centralized arbitration needed. CS of USTC

TCC Architecture CS of USTC

TCC Programming Model Steps: 1. Divide into Transactions 2. Specify Order -unordered transaction model -ordered transaction model 3. Performance Tuning Optimization guidelines: (1) maximize parallelism & minimize data dependencies (2) Large transaction are advisable (3) Chose small transaction when violate frequently. 0 0 0 0 Time 0 0 0 1 1 1 2 4 3 5 6 7 CS of USTC

Cache Structure extension • read bit: • set on speculative load, check against remote commit; • on line basis or word basis. • modified bit: • set on speculative store, discard lines with the flag on abort • on line basis. • renamed bit: • similar to modified bit. • On word basis. • Avoid false violations CS of USTC

What if cache is full ? • Speculative lines can not be replaced in transaction. • Solution: • (1) move the cache line to victim buffer. • (2) stall and request for commit permission. CS of USTC

Rollback & Commit • How to rollback ? -- Checkpoints • (1) by hardware or by software. • (2) hardware scheme can associated with register renaming. • How to commit ? • (1) write buffer: • separate buffer that buffs all store • (2) address buffer: • only keep the tags of cache line to be commit. CS of USTC

LogTM abstract • Eager version management & conflict detection • New value in-place • Detect conflict on memory reference • Implemented on a modified directory based MOESI protocol • Log old value in memory • Fast local commit. No arbitration needed • Slow local abort. Rollback log entries • Un-ordered transaction only • No transaction size limit CS of USTC

LogTM Architecture CS of USTC

Transaction Log & Rollback • Log region in memory • Cacheable • Specified by two registers • Add log entries on first modification to a block • Check W bit • Minimize duplication • Trap to software handler when rollback • Rollback the logs in FILO mode. CS of USTC

Problems of LogTM • Transaction conflict may stall requestor • False sharing can not be easily eliminated CS of USTC

Conflict detection CS of USTC

Overflow handling • Speculative data can be swapped out as normal cache line. • How to track the speculation in this situation? • Modified directory protocol • The overflow bit! CS of USTC

Overflow bit CS of USTC

Buffer new data locally（old one globally available） Central arbitration before commit Broadcast new data on commit Conflict detected upon commit Specify ordering using phase number Trans. size limited by cache size (stall to solve) Solve false sharing by mask Log old data in memory（new one visible） No arbitration Rollback on abort, local commit Detect conflict as transaction running Deadlock detection needed Only unordered transaction supported No transaction size limit Hard to solve false sharing TCC v.s. LogTM CS of USTC

Lecture on High Performance Processor Architecture ( CS05162 )