720 likes | 752 Views
Transactional Memory. Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit. Traditional Software Scaling. 7x. Speedup. 3.6x. 1.8x. User code. Traditional Uniprocessor. Time: Moore’s law. 7x. 3.6x. Speedup. 1.8x. Multicore Software Scaling.
E N D
Transactional Memory Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Art of Multiprocessor Programming
Traditional Software Scaling 7x Speedup 3.6x 1.8x User code Traditional Uniprocessor Time: Moore’s law
7x 3.6x Speedup 1.8x Multicore Software Scaling User code Multicore Unfortunately, not so simple…
Real-World Multicore Scaling Speedup 2.9x 2x 1.8x User code Multicore Parallelization and Synchronization require great care…
Why? Amdahl’s Law: Speedup = 1/(ParallelPart/N + SequentialPart) Pay for N = 8 cores SequentialPart = 25% Speedup = only 2.9 times! As num cores grows the effect of 25% becomes more accute 2.3/4, 2.9/8, 3.4/16, 3.7/32….
Fine grained parallelism has huge performance benefit The reason we get only 2.9 speedup c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c Shared Data Structures Fine Grained Coarse Grained 25% Shared 25% Shared 75% Unshared 75% Unshared
b c d a A FIFO Queue Head Tail Dequeue() => a Enqueue(d)
Head Tail a b c d Q: Enqueue(d) P: Dequeue() => a A Concurrent FIFO Queue Simple Code, easy to prove correct Object lock Contention and sequential bottleneck
a b c d Fine Grain Locks Finer Granularity, More Complex Code Head Tail Q: Enqueue(d) P: Dequeue() => a Verification nightmare: worry about deadlock, livelock…
a a b c d b Head Tail Fine Grain Locks Complex boundary cases: empty queue, last item Head Tail Q: Enqueue(b) P: Dequeue() => a Worry how to acquire multiple locks
Moreover: Locking Relies on Conventions • Relation between • Lock bit and object bits • Exists only in programmer’s mind Actual comment from Linux Kernel (hat tip: Bradley Kuszmaul) • /* • * When a locked buffer is visible to the I/O layer • * BH_Launder is set. This means before unlocking • * we must clear BH_Launder,mb() on alpha and then • * clear BH_Lock, so no reader can see BH_Launder set • * on an unlocked buffer and then risk to deadlock. • */ Art of Multiprocessor Programming 11
a b c d Lock-Free (JDK 1.5+) Even Finer Granularity, Even More Complex Code Head Tail Q: Enqueue(d) P: Dequeue() => a Worry about starvation, subtle bugs, hardness to modify…
a b a d b c d c Composing Objects Complex: Move data atomically between structures Enqueue(Q2,a) Head Head Tail Tail P: Dequeue(Q1,a) More than twice the worry…
a b c d Transactional Memory Great Performance, Simple Code Head Tail Q: Enqueue(d) P: Dequeue() => a Don’t worry about deadlock, livelock, subtle bugs, etc…
a a b c d b Head Tail Promise of Transactional Memory Don’t worry which locks need to cover which variables when… Head Tail Q: Enqueue(d) P: Dequeue() => a TM deals with boundary cases under the hood
a b a d b c d c Composing Objects Will be easy to modify multiple structures atomically Enqueue(Q2,a) Head Head Tail Tail P: Dequeue(Q1,a) ProvideComposability…
The Transactional Manifesto • Current practice inadequate • to meet the multicore challenge • Research Agenda • Replace locking with a transactional API • Design languages to support this model • Implement the run-time to be fast enough Art of Multiprocessor Programming 17
Transactions • Atomic • Commit: takes effect • Abort: effects rolled back • Usually retried • Serizalizable • Appear to happen in one-at-a-time order Art of Multiprocessor Programming 18
Atomic Blocks atomic{ x.remove(3); y.add(3);}atomic{ y =null; } Art of Multiprocessor Programming 19
Atomic Blocks atomic { x.remove(3); y.add(3);}atomic {y =null; } No data race Art of Multiprocessor Programming 20
Designing a FIFO Queue Public void LeftEnq(item x) { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } Write sequential Code Art of Multiprocessor Programming 21
Designing a FIFO Queue Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } } Art of Multiprocessor Programming 22
Designing a FIFO Queue Public void LeftEnq(item x) { atomic { Qnode q = new Qnode(x); q.left = this.left; this.left.right = q; this.left = q; } } Enclose in atomic block Art of Multiprocessor Programming 23
Not always this simple Conditional waits Enhanced concurrency Complex patterns But often it is Works for sadistic homework Warning Art of Multiprocessor Programming 24
Composition Public void Transfer(Queue<T> q1, q2) { atomic { T x = q1.deq(); q2.enq(x); } } Trivial or what? Art of Multiprocessor Programming 25
Roll Back Public T LeftDeq() { atomic { if (this.left == null) retry; … } } Roll back transaction and restart when something changes Art of Multiprocessor Programming 26
OrElse Composition atomic { x = q1.deq(); } orElse { x = q2.deq(); } Run 1st method. If it retries … Run 2nd method. If it retries … Entire statement retries Art of Multiprocessor Programming 27
Transactional Memory • Software transactional memory (STM) • Hardware transactional memory (HTM) • Hybrid transactional memory (HyTM, try in hardware and default to software if unsuccessful) Art of Multiprocessor Programming 28
Hardware versus Software • Do we need hardware at all? • Analogies: • Virtual memory: yes! • Garbage collection: no! • Probably do need HW for performance • Do we need software? • Policy issues don’t make sense for hardware Art of Multiprocessor Programming 29
Transactional Consistency • Memory Transactions are collections of reads and writes executed atomically • Tranactions should maintain internal and external consistency • External: with respect to the interleavings of other transactions. • Internal: the transaction itself should operate on a consistent state.
External Consistency Invariant x = 2y 4 x Transaction A: Write x Write y 8 2 y 4 Transaction B: Read x Read y Compute z = 1/(x-y) = 1/2 Application Memory Art of Multiprocessor Programming
Simple Lock-Based STM • STMs come in different forms • Lock-based • Lock-free • Here we will describe a simple lock-based STM Art of Multiprocessor Programming
Synchronization • Transaction keeps • Read set: locations & values read • Write set: locations & values to be written • Deferred update • Changes installed at commit • Lazy conflict detection • Conflicts detected at commit Art of Multiprocessor Programming
STM: Transactional Locking Map V# Array of Versioned- Write-Locks Application Memory V# V# Art of Multiprocessor Programming 34
V# V# V# V# Reading an Object Mem Locks V# • Put V#s & value in RS • If not already locked Art of Multiprocessor Programming 35
To Write an Object V# V# V# V# Mem Locks V# • Add V# and new value to WS Art of Multiprocessor Programming 36
To Commit V# V# Mem Locks V# • Acquire W locks • Check V#s unchanged • In RS & WS • Install new values • Increment V#s • Release … X V#+1 V# Y V#+1 V# Art of Multiprocessor Programming 37
Problem: Internal Inconsistency • A Zombie is a currently active transaction that is destined to abort because it saw an inconsistent state • If Zombies see inconsistent states errors can occur and the fact that the transaction will eventually abort does not save us
Internal Consistency Invariant x = 2y 4 x Transaction B: Read x = 4 8 2 Transaction A: Write x (kills B) Write y y 4 Transaction B: (zombie) Read y = 4 Compute z = 1/(x-y) Application Memory DIV by 0 ERROR Art of Multiprocessor Programming
Solution: The “Global Clock” • Have one shared global clock • Incremented by (small subset of) writing transactions • Read by all transactions • Used to validate that state worked on is always consistent Art of Multiprocessor Programming
Read-Only Transactions 100 Shared Version Clock Mem Locks • Copy V Clock to RV • Read lock,V# • Read mem • Check unlocked • Recheck V# unchanged • Check V# < RV 12 Reads form a snapshot of memory. No read set! 32 56 100 19 17 Private Read Version (RV) Art of Multiprocessor Programming 41
Regular Transactions 100 Shared Version Clock Mem Locks • Copy V Clock to RV • On read/write, check: • Unlocked • V# ≤ RV • Add to R/W set 12 32 56 19 100 17 Private Read Version (RV) Art of Multiprocessor Programming 42
Regular Transactions Mem Locks • Acquire locks • WV = F&Inc(V Clock) • Check each V# ≤ RV • Update memory • Set write V#s to WV 12 x 100 32 56 19 100 100 101 100 y 17 Private Read Version (RV) Shared Version Clock Art of Multiprocessor Programming 43
Hardware Transactional Memory • Exploit Cache coherence • Already almost does it • Invalidation • Consistency checking • Speculative execution • Branch prediction = optimistic synch! Art of Multiprocessor Programming 44
T HW Transactional Memory read active caches Interconnect memory Art of Multiprocessor Programming 45
T T Transactional Memory read active active caches memory Art of Multiprocessor Programming 46
T T Transactional Memory active committed active caches memory Art of Multiprocessor Programming 47
D T Transactional Memory write committed active caches memory Art of Multiprocessor Programming 48
D T T Rewind write aborted active active caches memory Art of Multiprocessor Programming 49
Transaction Commit • At commit point • If no cache conflicts, we win. • Mark transactional entries • Read-only: valid • Modified: dirty (eventually written back) • That’s all, folks! • Except for a few details … Art of Multiprocessor Programming 50