1 / 37

Optimizing memory transactions

Optimizing memory transactions. Tim Harris, Mark Plesko, Avi Shinnar, David Tarditi . The Big Question : are atomic blocks feasible?. Atomic blocks may be great for the programmer; but can they be implemented with acceptable performance?

rune
Download Presentation

Optimizing memory transactions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing memory transactions Tim Harris, Mark Plesko, Avi Shinnar, David Tarditi

  2. The Big Question: are atomic blocks feasible? • Atomic blocks may be great for the programmer; but can they be implemented with acceptable performance? • At first, atomic blocks look insanely expensive. A recent implementation (Harris+Fraser, OOPSLA ’03): • Every load and store instruction logs information into a thread-local log • A store instruction writes the log only • A load instruction consults the log first • At the end of the block: validate the log; and atomically commit it to shared memory • Assumptions throughout this talk: • Reads outnumber writes (3:1 or more) • Conflicts are rare

  3. State of the art ~ 2003 Fine-grained locking (2.57x) Harris+Fraser WSTM (5.69x) Coarse-grained locking (1.13x) Normalised execution time Sequential baseline (1.00x) Workload: operations on a red-black tree, 1 thread, 6:1:1 lookup:insert:delete mix with keys 0..65535

  4. Our new techniques prototyped in Bartok • Direct-update STM • Allow transactions to make updates in place in the heap • Avoids reads needing to search the log to see earlier writes that the transaction has made • Makes successful commit operations faster at the cost of extra work on contention or when a transaction aborts • Compiler integration • Decompose the transactional memory operations into primitives • Expose the primitives to compiler optimization (e.g. to hoist concurrency control operations out of a loop) • Runtime system integration • Integration with the garbage collector or runtime system components to scale to atomic blocks containing 100M memory accesses

  5. Results: concurrency control overhead Fine-grained locking (2.57x) Harris+Fraser WSTM (5.69x) Coarse-grained locking (1.13x) Direct-update STM (2.04x) Normalised execution time Direct-update STM + compiler integration (1.46x) Sequential baseline (1.00x) Workload: operations on a red-black tree, 1 thread, 6:1:1 lookup:insert:delete mix with keys 0..65535 Scalable to multicore

  6. Results: scalability Coarse-grained locking Fine-grained locking WSTM (atomic blocks) DSTM (API) OSTM (API) Microseconds per operation Direct-update STM + compiler integration #threads

  7. Direct update STM • Augment objects with (i) a lock, (ii) a version number • Transactional write: • Lock objects before they are written to (abort if another thread has that lock) • Log the overwritten data – we need it to restore the heap case of retry, transaction abort, or a conflict with a concurrent thread • Make the update in place to the object • Transactional read: • Log the object’s version number • Read from the object itself • Commit: • Check the version numbers of objects we’ve read • Increment the version numbers of object we’ve written, unlocking them

  8. Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver = 100 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log:

  9. Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver = 100 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 T1 reads from c1: logs that it saw version 100

  10. Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver = 100 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 c1.ver=100 T2 also reads from c1: logs that it saw version 100

  11. Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver = 100 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 Suppose T1 now reads from c2, sees it at version 200

  12. Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } locked:T2 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 lock: c1, 100 Before updating c1, thread T2 must lock it: record old version number

  13. Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } locked:T2 ver = 200 val = 11 val = 40 Example: contention between transactions (2) After logging the old value, T2 makes its update in place to c1 T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 lock: c1, 100 c1.val=10 (1) Before updating c1.val, thread T2 must log the data it’s going to overwrite

  14. Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver=101 ver = 200 val = 10 val = 40 Example: contention between transactions (2) T2’s transaction commits successfully. Unlock the object, installing the new version number T1’s log: T2’s log: c1.ver=100 c2.ver=200 c1.ver=100 lock: c1, 100 c1.val=10 (1) Check the version we locked matches the version we previously read

  15. Thread T2 Thread T1 c1 c2 atomic { t = c1.val; t ++; c1.val = t; } int t = 0; atomic { t += c1.val; t += c2.val; } ver=101 ver = 200 val = 10 val = 40 Example: contention between transactions T1’s log: T2’s log: c1.ver=100 c2.ver=100 (1) T1 attempts to commit. Check the versions it read are still up-to-date. (2) Object c1 was updated from version 100 to 101, so T1’s transaction is aborted and re-run.

  16. Compiler integration • We expose decomposed log-writing operations in the compiler’s internal intermediate code (no change to MSIL) • OpenForRead – before the first time we read from an object (e.g. c1 or c2 in the examples) • OpenForUpdate – before the first time we update an object • LogOldValue – before the first time we write to a given field Source code Basic intermediate code Optimized intermediate code atomic { … t += n.value; n = n.next; … } OpenForRead(n); t = n.value; OpenForRead(n); n = n.next; OpenForRead(n); t = n.value; n = n.next;

  17. Compiler integration – avoiding upgrades Compiler’s intermediate code Optimized intermediate code Source code OpenForRead(c1); temp1 = c1.val; temp1 ++; OpenForUpdate(c1); LogOldValue(&c1.val); c1.val = temp1; atomic { … c1.val ++; … } OpenForUpdate(c1); temp1 = c1.val; temp1 ++; LogOldValue(&c1.val); c1.val = temp1

  18. Compiler integration – other optimizations • Hoist OpenFor* and Log* operations out from loops • Avoid OpenFor* and Log* operations on objects allocated inside atomic blocks (these objects must be thread local) • Move OpenFor* operations from methods to their callers • Further decompose operations to allow logging-space checks to be combined • Expose OpenFor* and Log*’s implementation to inlining and further optimization

  19. What about… version wrap-around Commit, set version 17 Commit, set version 18 Open for update, see version 16 Open for update, see version 17 … time Commit: obj1 back to version 17 – oops Open obj1 for read, see version 17 • Solution: validate read log at each GC, force GC at least once every #versions transactions

  20. Runtime integration – garbage collection 1. GC runs while some threads are in atomic blocks atomic { … } obj1.field = old obj2.ver = 100 obj3.locked @ 100

  21. Runtime integration – garbage collection 1. GC runs while some threads are in atomic blocks atomic { … } obj1.field = old obj2.ver = 100 obj3.locked @ 100 2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed

  22. Runtime integration – garbage collection 3. GC visits objects reachable from refs overwritten in LogForUndo entries – retaining objects needed if any block rolls back 1. GC runs while some threads are in atomic blocks atomic { … } obj1.field = old obj2.ver = 100 obj3.locked @ 100 2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed

  23. Runtime integration – garbage collection 3. GC visits objects reachable from refs overwritten in LogForUndo entries – retaining objects needed if any block rolls back 1. GC runs while some threads are in atomic blocks atomic { … } obj1.field = old obj2.ver = 100 obj3.locked @ 100 2. GC visits the heap as normal – retaining objects that are needed if the blocks succeed 4. Discard log entries for unreachable objects: they’re dead whether or not the block succeeds

  24. Results: long running tests 10.8 73 162 Direct-update STM Run-time filtering Compile-time optimizations Original application (no tx) Normalised execution time tree skip go merge-sort xlisp

  25. Conclusions • Atomic blocks and transactional memory provide a fundamental improvement to the abstractions for building shared-memory concurrent data structures • Only one part of the concurrency landscape • But we believe no one solution will solve all the challenges there • Our experience with compiler integration shows that a pure software implementation can perform well for short transactions and scale to vast transactions (many accesses & many locations) • We still need a better understanding of realistic workload distributions

  26. Backup slides • Backup slides beyond this point

  27. ‘Parallelism preserving’ design • Any updates must pull data into the local cache in exclusive mode • Even if an operation is `read only’, acquiring a multi-reader lock will involve fetching a line in exclusive mode • Our optimistic design lets data shared by multiple cores remain cached by all of them • Scalability at the cost of wasted work when optimism does not pay off Data held in shared mode in multiple caches S Core 1 Core 2 Core 3 Core 4 S E S S L1 L1 L1 L1 L2 L2 Data held in exclusive mode in a single cache E L3

  28. try { … if (node.Right != this.sentinelNode) … } catch (AtomicFakeException) { } l_1 = ObjectField_open_read<Right>(t_0) t_1 = ObjectField_open_read<sentinelNode>(l_0) t_2 = Neq<bool>(l_1, t_1) OpentObjForRead(t_0) l_1 = ObjectField<Right>(t_0) OpenObjForRead(l_0) t_1 = ObjectField<sentinelNode>(l_0) t_2 = Neq(l_0, t_1) t = GetCurrentThread m = ObjectField<TryAllManager>(t) ReadLogReserve(m, 2) OpenObjForReadFast(m, t_0) l_1 = ObjectField<Right>(t_0) OpenObjForReadFast(m, l_0) t_1 = ObjectField<sentinelNode>(m, l_0) t_2 = Neq(l_0, t_1) Compilation MSIL + atomic blocks boundaries IR + cloned atomic code IR + explicit STM operations IR + low-level STM operations

  29. Some examples (xlisp garbage collector) /* follow the left sublist if there is one */ if (livecar(xl_this)) { xl_this.n_flags |= (byte)LEFT; tmp = prev; prev = xl_this; xl_this = prev.p; prev.p = tmp; } Open ‘prev’ for update here to avoid an inevitable upgrade

  30. Some examples (othello) public int PlayerPos (int xm, int ym, int opponent, bool upDate) { int rotate; // 8 degrees of rotation int x, y; bool endTurn; bool plotPos; int turnOver = 0; // inital checking ! if (this.Board[ym,xm] != BInfo.Blank) return turnOver; // can't overwrite player Calls to PlayerPos must open ‘this’ for read: do this in the caller

  31. Basic design: open for read Transactional version number 0 00 2. Copy meta data vtable 1. Store obj ref fields… Read objects log

  32. 1 00 Basic design: open for update Transactional version number 0 00 3. CAS to acquire object 2. Copy meta data vtable 1. Store obj ref fields… Updated objects log

  33. Transactional version number Hash code Lock word 00 01 10 11 Version+1 Version Hashcode Hashcode Lock word Lock word Multi-use header word vtable fields…

  34. ? ? Tag’ 00 Hash value Filtering duplicate log entries • Per-thread table of recently logged objects / addresses • Fast O(1) logical clearing by embedding transaction sequence numbers in entries Tag Hash value 00 ^ seq Hash value ^ seq

  35. Semantics of atomic blocks • I/O • Details • Workloads

  36. Challenges & opportunities • Moving away from locks as a programming abstraction lets us re-think much of the concurrent programming landscape Atomic blocks for synchronization and shared state access Explicit threads only for explicit concurrency CILK-style parallel loops and calls Application software Data-parallel libraries Managed runtime &STM implementation Re-use STM mechanisms for TLS and optimistic (in the technical sense) parallel loops etc Multi-core / many-core hardware H/W TM or TM-acceleration

More Related