1 / 42

Optimizing memory transactions

This talk explores the challenges and solutions for optimizing memory transactions in order to increase instruction-level parallelism (ILP). It focuses on the problems with locks, controlling concurrent access, and research challenges in sharing data between concurrent threads.

nkell
Download Presentation

Optimizing memory transactions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing memory transactions Tim Harris, Maurice Herlihy, Jim Larus, Virendra Marathe, Simon Marlow, Simon Peyton Jones, Mark Plesko, Avi Shinnar, Satnam Singh, David Tarditi

  2. Moore’s law Increasing ILP through speculation 1,000,000,000 Data-parallel extensions IntelPentium 4 Superscalar processors IntelItanium 2 IntelPentium Pipelined execution 1,000,000 Increases in clock frequency IntelPentium II Increases in ILP Parallelism Intel 486 286 Intel 386 8086 4004 1,000 1970 1980 1990 2000 2005 Source: table from http://www.intel.com/technology/mooreslaw/index.htm

  3. Source: viewed 15 Nov 2006

  4. Making use of parallelism hardware is hard • How do we share data between concurrent threads? • How do we find things to run in parallel? …and locks provide a complicated and non-composable solution Programmers aren’t good at thinking about thread interleavings Traditional HPC is reasonably understood too: problems can be partitioned, or parallel building blocks composed Server workloads often provide a natural source of parallelism: deal with different clients concurrently Both of these are difficult problems: in this talk I’ll focus on the first problem

  5. Overview • Introduction • Problems with locks • Controlling concurrent access • Research challenges • Conclusions

  6. A deceptively simple problem 3 11 14 12 6 30 5 67 PushLeft / PopLeft work on the left hand end PushRight / PopRight work on the right hand end • Make it thread-safe by adding a lock • Needlessly serializes operations when the queue is long • One lock for each end? • Much more complex when the queue is short (0, 1, 2 items)

  7. Locks are broken • Races: due to forgotten locks • Deadlock: locks acquired in “wrong” order. • Lost wakeups: forgotten notify to condition variable • Error recovery tricky: need to restore invariants and release locks in exception handlers • Simplicity vs scalability tension • Lack of progress guarantees • …but worst of all… locks do not compose

  8. Atomic memory transactions • To a first approximation, just write the sequential code, and wrap atomic around it • All-or-nothing semantics: atomic commit • Atomic block executes in isolation • Cannot deadlock (there are no locks!) • Atomicity makes error recovery easy (e.g. exception thrown inside the Get code) Item Get() { atomic { ... sequential get code ... } }

  9. Atomic blocks compose (locks do not) • Guarantees to get two consecutive items • Get() is unchanged (the nested atomic is a no-op) • Cannot be achieved with locks (except by breaking the Get() abstraction) void GetTwo() { atomic { i1 = Get(); i2 = Get(); } DoSomething( i1, i2 ); }

  10. Blocking: how does Get() wait for data? • retry means “abandon execution of the atomic block and re-run it (when there is a chance it’ll complete)” • No lost wake-ups • No consequential change to GetTwo(), even though GetTwo must wait for there to be two items in the queue Item Get() { atomic { if (size==0) { retry; } else { ...remove item from queue... } } }

  11. Choice: waiting for either of two queues Q2 Q1 void GetEither() { atomic { do { i = Q1.Get(); } orelse { i = Q2.Get(); } R.Put( i ); } } • do {...this...} orelse {...that...} tries to run “this” • If “this” retries, it runs “that” instead • If both retry, the do-block retries. GetEither() will thereby wait for there to be an item in either queue R

  12. Why is ‘orelse’ left-biased? Blocking v polling: • No need to write separate blocking / polling methods • Can extend to deal with timeouts by using a ‘wakeup’ message on a shared memory buffer Item TryGet() { atomic { do { return Get(); } orelse { return null; } } }

  13. Choice: a server waiting on two clients atomic {do { WaitForShutdownRequest(); // Process shutdown request } orelse { } } atomic {Item i;do { i = q1.PopLeft(3)} orelse { i = q2.PopLeft (3) } // Process item } Unlike WaitAny or Select we can compose code using retry – e.g. wait for a shutdown request, or for client requests Try this first – if it gets an item without hitting retry then we’re done If q1.PopLeft can’t finish then undo its side effects and try q2.PopLeft instead

  14. Invariants • Suppose a queue should always have at least one element in it; how can we check this? • Find all the places where queues are modified and put in a check? • Define ‘NonEmptyQueue’ which wraps the ordinary queue operations? • The check is duplicated (and possibly forgotten) • The invariant may be associated with multiple data structures (“queue X contains fewer elements than queue Y”) • What if different queues have different constraints?

  15. Invariants • Define invariants that must be preserved by successful atomic blocks Queue NewNeverEmptyQueue(Item i) { atomic { Queue q = new Queue(); q.Put(i); always { !q.IsEmpty(); } return q; } } Computation to check the invariant: must be true when proposed, and preserved by top-level atomic blocks Queue q = NewNeverEmptyQueue(i); atomic { q.Get(); } Invariant failure: would leave q empty

  16. Summary so far With locks, you think about: • Which lock protects which data? What data can be mutated when by other threads? Which condition variables must be notified when? • None of this is explicit in the source code With atomic blocks you think about • Purely sequential reasoning within a block, which is dramatically easier • What are the invariants (e.g. the tree is balanced)? • Each atomic block maintains the invariants • Much easier setting for static analysis tools

  17. Summary so far • Atomic blocks (atomic, retry, orElse, always) are a real step forward • It’s like using a high-level language instead of assembly code: whole classes of low-level errors are eliminated • Not a silver bullet: • you can still write buggy programs; • concurrent programs are still harder to write than sequential ones; • aimed at shared memory • But the improvement remains substantial

  18. Can they be implemented with acceptable perf? • Direct-update STM (but: see programming model implications later) • Locking to allow transactions to make updates in place in the heap • Log overwritten values • Reads naturally see earlier writes that the transaction has made (either use opt or pess concurrency control) • Makes successful commit operations faster at the cost of extra work on contention or when a transaction aborts • Compiler integration • Decompose the transactional memory operations into primitives • Expose the primitives to compiler optimization (e.g. to hoist concurrency control operations out of a loop) • Runtime system integration • Integration with the garbage collector or runtime system components to scale to atomic blocks containing 100M accesses

  19. Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver = 100 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log:

  20. Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver = 100 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 T1 reads from c1: logs that it saw version 100

  21. Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver = 100 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 c1.ver=100 T2 also reads from c1: logs that it saw version 100

  22. Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver = 100 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 Suppose T1 now reads from c2, sees it at version 200

  23. Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } locked:T2 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 lock: c1, 100 Before updating c1, thread T2 must lock it: record old version number

  24. Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } locked:T2 ver = 200 val = 11 val = 40 Example: contention between transactions (2) After logging the old value, T2 makes its update in place to c1 T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 lock: c1, 100 c1.val=10 (1) Before updating c1.val, thread T2 must log the data it’s going to overwrite

  25. Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver=101 ver = 200 val = 10 val = 40 Example: contention between transactions (2) T2’s transaction commits successfully. Unlock the object, installing the new version number T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 lock: c1, 100 c1.val=10 (1) Check the version we locked matches the version we previously read

  26. Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver=101 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 c2.ver=100 (1) T1 attempts to commit. Check the versions it read are still up-to-date. (2) Object c1 was updated from version 100 to 101, so T1’s transaction is aborted and re-run.

  27. Compiler integration • We expose decomposed log-writing operations in the compiler’s internal intermediate code (no change to MSIL) • OpenForRead – before the first time we read from an object (e.g. c1 or c2 in the examples) • OpenForUpdate – before the first time we update an object • LogOldValue – before the first time we write to a given field Source code Basic intermediate code Optimized intermediate code atomic { … t += n.value; n = n.next; … } OpenForRead(n); t = n.value; OpenForRead(n); n = n.next; OpenForRead(n); t = n.value; n = n.next;

  28. Compiler integration – avoiding upgrades Compiler’s intermediate code Optimized intermediate code Source code OpenForRead(c1); temp1 = c1.val; temp1 ++; OpenForUpdate(c1); LogOldValue(&c1.val); c1.val = temp1; atomic { … c1.val ++; … } OpenForUpdate(c1); temp1 = c1.val; temp1 ++; LogOldValue(&c1.val); c1.val = temp1

  29. Compiler integration – other optimizations • Hoist OpenFor* and Log* operations out from loops • Avoid OpenFor* and Log* operations on objects allocated inside atomic blocks (these objects must be thread local) • Move OpenFor* operations from methods to their callers • Further decompose operations to allow logging-space checks to be combined • Expose OpenFor* and Log*’s implementation to inlining and further optimization

  30. Results: concurrency control overhead Fine-grained locking (2.57x) Buffer memory accesses, non-blocking commit (5.69x) Coarse-grained locking (1.13x) Direct-update STM (2.04x) Normalised execution time Direct-update STM + compiler integration (1.46x) Sequential baseline (1.00x) Workload: operations on a red-black tree, 1 thread, 6:1:1 lookup:insert:delete mix with keys 0..65535 Scalable to multicore

  31. Results: scalability Coarse-grained locking Fine-grained locking Buffered-update non-blocking STM Microseconds per operation Direct-update STM + compiler integration #threads

  32. Results: long running tests 10.8 73 162 Direct-update STM Run-time filtering Compile-time optimizations Original application (no tx) Normalised execution time tree skip go merge-sort xlisp

  33. Research challenges

  34. The transactional elephant How do all these issues fit into the bigger picture (just the trunk, as it were, of the concurrency elephant) Semantics of atomic blocks, progress guarantees, forms of nesting, interaction outside the transaction Other applications of transactional memory – e.g. speculation Software transactional memory algorithms and APIs Compiling and optimizing atomic blocks to STM APIs Integration with other resources – tx IO etc. Hardware transactional memory Hardware acceleration of software transactional memory

  35. Semantics of atomic blocks • What does it mean for a program using atomic blocks to be correctly synchronized? • What guarantees are lost / retained in programs which are not correctly synchronized? • What tools and analysis techniques can we develop to identify problems, or to prove correct synchronization?

  36. Tempting goal: “strong” atomicity • “Atomic means atomic” – as if the block runs with other threads suspended and outside any atomic blocks • Problems: • Is it practical to implement? • Is it actually useful? The atomic action can run at any point during ‘x++’ atomic { x ++; x++; }

  37. One approach: single lock semantics • Think of atomic blocks as acquiring & releasing a single process-wide lock atomic { x ++; } atomic { y ++; } lock (TxLock) { x ++; } lock (TxLock) { y ++; } • If the resulting single-lock program is correctly synchronized then so is the original

  38. One approach: single lock semantics atomic { x ++; } atomic { x++ ;} Correct: consistently protected by atomic blocks atomic { x ++; } x++ ; Not correctly synchronized: race on x atomic { x ++; } lock (x) { x++ ;} Not correctly synchronized: no interatcion between “TxLock” and “x”s lock atomic { lock (x) { lock (x) { x ++; x ++; } } } Correct: consistently protected by “x”s lock

  39. Single lock semantics & privatization • This is correctly synchronized under single-lock semantics: (initially x=0, x_shared=true) atomic { // Tx1 x_shared = false; } x ++; atomic { // Tx2 if (x_shared) { x ++; } } Single lock semantics => x=1 or x=2 after both threads’ atcions • Problems both with optimistic concurrency control (either in-place or deferred updates), also with non-strict pessimistic concurrency control • Q: is this a problem in the semantics or the impl’?

  40. Segregated heap semantics • Many problems are simplified if we distinguish tx and non-tx accessible data • tx-accessible data is always and only accessed within tx • non-tx accessible data is never accessed within tx • This is the model we provide in STM-Haskell • A TVar a is a tx-mutable cell holding a value of type ‘a’ • A computation of type STM a is only allowed to update TVars, not other mutable cells • A computation of type IO a can mutate other cells, and only TVars by entering atomic blocks • …but vast annotation or expressiveness burden trying to control effects to the same extent in other languages

  41. Conclusions

  42. Conclusions • Atomic blocks and transactional memory provide a fundamental improvement to the abstractions for building shared-memory concurrent data structures • Only one part of the concurrency landscape • But we believe no one solution will solve all the challenges there • Our experience with compiler integration shows that a pure software implementation can perform well for some workloads • We still need a better understanding of realistic workloads • We need this to inform the design of the semantics of atomic blocks, as well as to guide the optimization of their implementation

More Related