Optimizing memory transactions

Optimizing memory transactions Tim Harris, Maurice Herlihy, Jim Larus, Virendra Marathe, Simon Marlow, Simon Peyton Jones, Mark Plesko, Avi Shinnar, Satnam Singh, David Tarditi

Moore’s law Increasing ILP through speculation 1,000,000,000 Data-parallel extensions IntelPentium 4 Superscalar processors IntelItanium 2 IntelPentium Pipelined execution 1,000,000 Increases in clock frequency IntelPentium II Increases in ILP Parallelism Intel 486 286 Intel 386 8086 4004 1,000 1970 1980 1990 2000 2005 Source: table from http://www.intel.com/technology/mooreslaw/index.htm

Source: viewed 15 Nov 2006

Making use of parallelism hardware is hard • How do we share data between concurrent threads? • How do we find things to run in parallel? …and locks provide a complicated and non-composable solution Programmers aren’t good at thinking about thread interleavings Traditional HPC is reasonably understood too: problems can be partitioned, or parallel building blocks composed Server workloads often provide a natural source of parallelism: deal with different clients concurrently Both of these are difficult problems: in this talk I’ll focus on the first problem

Overview • Introduction • Problems with locks • Controlling concurrent access • Research challenges • Conclusions

A deceptively simple problem 3 11 14 12 6 30 5 67 PushLeft / PopLeft work on the left hand end PushRight / PopRight work on the right hand end • Make it thread-safe by adding a lock • Needlessly serializes operations when the queue is long • One lock for each end? • Much more complex when the queue is short (0, 1, 2 items)

Locks are broken • Races: due to forgotten locks • Deadlock: locks acquired in “wrong” order. • Lost wakeups: forgotten notify to condition variable • Error recovery tricky: need to restore invariants and release locks in exception handlers • Simplicity vs scalability tension • Lack of progress guarantees • …but worst of all… locks do not compose

Atomic memory transactions • To a first approximation, just write the sequential code, and wrap atomic around it • All-or-nothing semantics: atomic commit • Atomic block executes in isolation • Cannot deadlock (there are no locks!) • Atomicity makes error recovery easy (e.g. exception thrown inside the Get code) Item Get() { atomic { ... sequential get code ... } }

Atomic blocks compose (locks do not) • Guarantees to get two consecutive items • Get() is unchanged (the nested atomic is a no-op) • Cannot be achieved with locks (except by breaking the Get() abstraction) void GetTwo() { atomic { i1 = Get(); i2 = Get(); } DoSomething( i1, i2 ); }

Blocking: how does Get() wait for data? • retry means “abandon execution of the atomic block and re-run it (when there is a chance it’ll complete)” • No lost wake-ups • No consequential change to GetTwo(), even though GetTwo must wait for there to be two items in the queue Item Get() { atomic { if (size==0) { retry; } else { ...remove item from queue... } } }

Choice: waiting for either of two queues Q2 Q1 void GetEither() { atomic { do { i = Q1.Get(); } orelse { i = Q2.Get(); } R.Put( i ); } } • do {...this...} orelse {...that...} tries to run “this” • If “this” retries, it runs “that” instead • If both retry, the do-block retries. GetEither() will thereby wait for there to be an item in either queue R

Why is ‘orelse’ left-biased? Blocking v polling: • No need to write separate blocking / polling methods • Can extend to deal with timeouts by using a ‘wakeup’ message on a shared memory buffer Item TryGet() { atomic { do { return Get(); } orelse { return null; } } }

Choice: a server waiting on two clients atomic {do { WaitForShutdownRequest(); // Process shutdown request } orelse { } } atomic {Item i;do { i = q1.PopLeft(3)} orelse { i = q2.PopLeft (3) } // Process item } Unlike WaitAny or Select we can compose code using retry – e.g. wait for a shutdown request, or for client requests Try this first – if it gets an item without hitting retry then we’re done If q1.PopLeft can’t finish then undo its side effects and try q2.PopLeft instead

Invariants • Suppose a queue should always have at least one element in it; how can we check this? • Find all the places where queues are modified and put in a check? • Define ‘NonEmptyQueue’ which wraps the ordinary queue operations? • The check is duplicated (and possibly forgotten) • The invariant may be associated with multiple data structures (“queue X contains fewer elements than queue Y”) • What if different queues have different constraints?

Invariants • Define invariants that must be preserved by successful atomic blocks Queue NewNeverEmptyQueue(Item i) { atomic { Queue q = new Queue(); q.Put(i); always { !q.IsEmpty(); } return q; } } Computation to check the invariant: must be true when proposed, and preserved by top-level atomic blocks Queue q = NewNeverEmptyQueue(i); atomic { q.Get(); } Invariant failure: would leave q empty

Summary so far With locks, you think about: • Which lock protects which data? What data can be mutated when by other threads? Which condition variables must be notified when? • None of this is explicit in the source code With atomic blocks you think about • Purely sequential reasoning within a block, which is dramatically easier • What are the invariants (e.g. the tree is balanced)? • Each atomic block maintains the invariants • Much easier setting for static analysis tools

Summary so far • Atomic blocks (atomic, retry, orElse, always) are a real step forward • It’s like using a high-level language instead of assembly code: whole classes of low-level errors are eliminated • Not a silver bullet: • you can still write buggy programs; • concurrent programs are still harder to write than sequential ones; • aimed at shared memory • But the improvement remains substantial

Can they be implemented with acceptable perf? • Direct-update STM (but: see programming model implications later) • Locking to allow transactions to make updates in place in the heap • Log overwritten values • Reads naturally see earlier writes that the transaction has made (either use opt or pess concurrency control) • Makes successful commit operations faster at the cost of extra work on contention or when a transaction aborts • Compiler integration • Decompose the transactional memory operations into primitives • Expose the primitives to compiler optimization (e.g. to hoist concurrency control operations out of a loop) • Runtime system integration • Integration with the garbage collector or runtime system components to scale to atomic blocks containing 100M accesses

Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver = 100 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log:

Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver = 100 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 T1 reads from c1: logs that it saw version 100

Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver = 100 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 c1.ver=100 T2 also reads from c1: logs that it saw version 100

Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver = 100 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 Suppose T1 now reads from c2, sees it at version 200

Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } locked:T2 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 lock: c1, 100 Before updating c1, thread T2 must lock it: record old version number

Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } locked:T2 ver = 200 val = 11 val = 40 Example: contention between transactions (2) After logging the old value, T2 makes its update in place to c1 T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 lock: c1, 100 c1.val=10 (1) Before updating c1.val, thread T2 must log the data it’s going to overwrite

Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver=101 ver = 200 val = 10 val = 40 Example: contention between transactions (2) T2’s transaction commits successfully. Unlock the object, installing the new version number T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 lock: c1, 100 c1.val=10 (1) Check the version we locked matches the version we previously read

Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver=101 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 c2.ver=100 (1) T1 attempts to commit. Check the versions it read are still up-to-date. (2) Object c1 was updated from version 100 to 101, so T1’s transaction is aborted and re-run.

Compiler integration • We expose decomposed log-writing operations in the compiler’s internal intermediate code (no change to MSIL) • OpenForRead – before the first time we read from an object (e.g. c1 or c2 in the examples) • OpenForUpdate – before the first time we update an object • LogOldValue – before the first time we write to a given field Source code Basic intermediate code Optimized intermediate code atomic { … t += n.value; n = n.next; … } OpenForRead(n); t = n.value; OpenForRead(n); n = n.next; OpenForRead(n); t = n.value; n = n.next;

Compiler integration – avoiding upgrades Compiler’s intermediate code Optimized intermediate code Source code OpenForRead(c1); temp1 = c1.val; temp1 ++; OpenForUpdate(c1); LogOldValue(&c1.val); c1.val = temp1; atomic { … c1.val ++; … } OpenForUpdate(c1); temp1 = c1.val; temp1 ++; LogOldValue(&c1.val); c1.val = temp1

Compiler integration – other optimizations • Hoist OpenFor* and Log* operations out from loops • Avoid OpenFor* and Log* operations on objects allocated inside atomic blocks (these objects must be thread local) • Move OpenFor* operations from methods to their callers • Further decompose operations to allow logging-space checks to be combined • Expose OpenFor* and Log*’s implementation to inlining and further optimization

Results: concurrency control overhead Fine-grained locking (2.57x) Buffer memory accesses, non-blocking commit (5.69x) Coarse-grained locking (1.13x) Direct-update STM (2.04x) Normalised execution time Direct-update STM + compiler integration (1.46x) Sequential baseline (1.00x) Workload: operations on a red-black tree, 1 thread, 6:1:1 lookup:insert:delete mix with keys 0..65535 Scalable to multicore

Results: scalability Coarse-grained locking Fine-grained locking Buffered-update non-blocking STM Microseconds per operation Direct-update STM + compiler integration #threads

Results: long running tests 10.8 73 162 Direct-update STM Run-time filtering Compile-time optimizations Original application (no tx) Normalised execution time tree skip go merge-sort xlisp

Research challenges

The transactional elephant How do all these issues fit into the bigger picture (just the trunk, as it were, of the concurrency elephant) Semantics of atomic blocks, progress guarantees, forms of nesting, interaction outside the transaction Other applications of transactional memory – e.g. speculation Software transactional memory algorithms and APIs Compiling and optimizing atomic blocks to STM APIs Integration with other resources – tx IO etc. Hardware transactional memory Hardware acceleration of software transactional memory

Semantics of atomic blocks • What does it mean for a program using atomic blocks to be correctly synchronized? • What guarantees are lost / retained in programs which are not correctly synchronized? • What tools and analysis techniques can we develop to identify problems, or to prove correct synchronization?

Tempting goal: “strong” atomicity • “Atomic means atomic” – as if the block runs with other threads suspended and outside any atomic blocks • Problems: • Is it practical to implement? • Is it actually useful? The atomic action can run at any point during ‘x++’ atomic { x ++; x++; }

One approach: single lock semantics • Think of atomic blocks as acquiring & releasing a single process-wide lock atomic { x ++; } atomic { y ++; } lock (TxLock) { x ++; } lock (TxLock) { y ++; } • If the resulting single-lock program is correctly synchronized then so is the original

One approach: single lock semantics atomic { x ++; } atomic { x++ ;} Correct: consistently protected by atomic blocks atomic { x ++; } x++ ; Not correctly synchronized: race on x atomic { x ++; } lock (x) { x++ ;} Not correctly synchronized: no interatcion between “TxLock” and “x”s lock atomic { lock (x) { lock (x) { x ++; x ++; } } } Correct: consistently protected by “x”s lock

Single lock semantics & privatization • This is correctly synchronized under single-lock semantics: (initially x=0, x_shared=true) atomic { // Tx1 x_shared = false; } x ++; atomic { // Tx2 if (x_shared) { x ++; } } Single lock semantics => x=1 or x=2 after both threads’ atcions • Problems both with optimistic concurrency control (either in-place or deferred updates), also with non-strict pessimistic concurrency control • Q: is this a problem in the semantics or the impl’?

Segregated heap semantics • Many problems are simplified if we distinguish tx and non-tx accessible data • tx-accessible data is always and only accessed within tx • non-tx accessible data is never accessed within tx • This is the model we provide in STM-Haskell • A TVar a is a tx-mutable cell holding a value of type ‘a’ • A computation of type STM a is only allowed to update TVars, not other mutable cells • A computation of type IO a can mutate other cells, and only TVars by entering atomic blocks • …but vast annotation or expressiveness burden trying to control effects to the same extent in other languages

Conclusions

Conclusions • Atomic blocks and transactional memory provide a fundamental improvement to the abstractions for building shared-memory concurrent data structures • Only one part of the concurrency landscape • But we believe no one solution will solve all the challenges there • Our experience with compiler integration shows that a pure software implementation can perform well for some workloads • We still need a better understanding of realistic workloads • We need this to inform the design of the semantics of atomic blocks, as well as to guide the optimization of their implementation

Optimizing memory transactions

Optimizing memory transactions

Presentation Transcript

SAS: Managing Memory and Optimizing System Performance

Phase Reconciliation for Contended In-Memory Transactions

Optimizing Memory Accesses for Spatial Computation

EffiCuts: Optimizing Packet Classification for Memory and Throughput

Programming, Debugging, Profiling and Optimizing Transactional Memory Applications

Optimizing Power @ Design Time Memory

Elastic Transactions Unleashing Concurrency of Transactional Memory

Transactions

Optimizing Memory for Smooth Infinite Scrolling

On Transactional Memory, Spinlocks and Database Transactions

Silo : Speedy Transactions in Multicore In-Memory Databases

Transactional Memory Supporting Large Transactions

SAS: Managing Memory and Optimizing System Performance

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Scheduling Memory Transactions

xCalls: Safe I/O in Memory Transactions

Optimizing Multidimensional Index Trees for Main Memory Access

Transactions

Scheduling Memory Transactions

Transactions