410 likes | 554 Views
Compiler and Runtime Support for Efficient Software Transactional Memory. Vijay Menon. Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman. Motivation. Multi-core architectures are mainstream Software concurrency needed for scalability
E N D
Compiler and Runtime Supportfor EfficientSoftware Transactional Memory Vijay Menon Ali-Reza Adl-Tabatabai, Brian T. Lewis, Brian R. Murphy, Bratin Saha, Tatiana Shpeisman
Motivation • Multi-core architectures are mainstream • Software concurrency needed for scalability • Concurrent programming is hard • Difficult to reason about shared data • Traditional mechanism: Lock-based Synchronization • Hard to use • Must be fine-grain for scalability • Deadlocks • Not easily composable • New Solution: Transactional Memory (TM) • Simpler programming model: Atomicity, Consistency, Isolation • No deadlocks • Composability • Optimistic concurrency • Analogy • GC : Memory allocation ≈ TM : Mutual exclusion
Composability class Bank { ConcurrentHashMap accounts; … void deposit(String name, int amount) { synchronized (accounts) { int balance = accounts.get(name); // Get the current balance balance = balance + amount; // Increment it accounts.put(name, balance); // Set the new balance } } … } • Thread-safe – but no scaling • ConcurrentHashMap (Java 5/JSR 166) does not help • Performance requires redesign from scratch & fine-grain locking
Transactional solution • class Bank { HashMap accounts; … void deposit(String name, int amount) { atomic { int balance = accounts.get(name); // Get the current balance balance = balance + amount; // Increment it accounts.put(name, balance); // Set the new balance } } … } Underlying system provide: • isolation (thread safety) • optimistic concurrency
Transactions are Composable Scalability on 16-way 2.2 GHz Xeon System
Our System • A Java Software Transactional Memory (STM) System • Pure software implementation • Language extensions in Java • Integrated with JVM & JIT • Novel Features • Rich transactional language constructs in Java • Efficient, first class nested transactions • Risc-like STM API • Compiler optimizations • Per-type word and object level conflict detection • Complete GC support
Transactional Java Java + STM API ORP VM Transactional STIR Optimized T-STIR McRT STM Native Code System Overview Polyglot StarJIT
Transactional Java • Java + new language constructs: • Atomic: execute block atomically • atomic {S} • Retry: block until alternate path possible • atomic {… retry;…} • Orelse: compose alternate atomic blocks • atomic {S1} orelse{S2} … orelse{Sn} • Tryatomic: atomic with escape hatch • tryatomic {S} catch(TxnFailed e) {…} • When: conditionally atomic region • when (condition) {S} • Builds on prior research Concurrent Haskell, CAML, CILK, Java HPCS languages: Fortress, Chapel, X10
Transactional Java atomic { S; } STM API txnStart[Nested] txnCommit[Nested] txnAbortNested txnUserRetry ... Standard Java + STM API while(true) { TxnHandle th = txnStart(); try { S’; break; } finally { if(!txnCommit(th)) continue; } } Transactional Java → Java
JVM STM support • On-demand cloning of methods called inside transactions • Garbage collection support • Enumeration of refs in read set, write set & undo log • Extra transaction record field in each object • Supports both word & object granularity • Native method invocation throws exception inside transaction • Some intrinsic functions allowed • Runtime STM API • Wrapper around McRT-STM API • Polyglot / StarJIT automatically generates calls to API
Background: McRT-STM STM for • C / C++ (PPoPP 2006) • Java (PLDI 2006) • Writes: • strict two-phase locking • update in place • undo on abort • Reads: • versioning • validation before commit • Granularity per type • Object-level : small objects • Word-level : large arrays • Benefits • Fast memory accesses (no buffering / object wrapping) • Minimal copying (no cloning for large objects) • Compatible with existing types & libraries
Ensuring Atomicity: Novel Combination + In place updates + Fast commits + Fast reads + Caching effects + Avoids lock operations Quantitative results in PPoPP’06
McRT-STM: Example • STM read & write barriers before accessing memory inside transactions • STM tracks accesses & detects data conflicts … … atomic { B = A + 5; } … … … stmStart(); temp = stmRd(A); stmWr(B, temp + 5); stmCommit(); …
Transaction Record • Pointer-sized record per object / word • Two states: • Shared (low bit is 1) • Read-only / multiple readers • Value is version number (odd) • Exclusive • Write-only / single owner • Value is thread transaction descriptor (4-byte aligned) • Mapping • Object : slot in object • Field : hashed index into global record table
vtbl vtbl vtbl hash TxR x x x y y y TxR1 TxR2 TxR3 … TxRn Transaction Record: Example • Every data item has an associated transaction record Extra transaction record field class Foo { int x; int y; } Object granularity Object words hash into table of TxRs Hash is f(obj.hash, offset) class Foo { int x; int y; } Word granularity
Transaction Descriptor • Descriptor per thread • Info for version validation, lock release, undo on abort, … • Read and Write set : {<Ti, Ni>} • Ti: transaction record • Ni: version number • Undo log : {<Ai, Oi, Vi, Ki>} • Ai: field / element address • Oi: containing object (or null for static) • Vi: original value • Ki: type tag (for garbage collection) • In atomic region • Read operation appends read set • Write operation appends write set and undo log • GC enumerates read/write/undo logs
T1 atomic { t = foo.x; bar.x = t; t = foo.y; bar.y = t; } McRT-STM: Example Class Foo { int x; int y; }; Foo bar, foo; T2 atomic { t1 = bar.x; t2 = bar.y; } • T1 copies foo into bar • T2 reads bar, but should not see intermediate values
T1 stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit(); McRT-STM: Example T2 stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit(); • T1 copies foo into bar • T2 reads bar, but should not see intermediate values
T1 3 stmStart(); t = stmRd(foo.x); stmWr(bar.x,t); t = stmRd(foo.y); stmWr(bar.y,t); stmCommit; hdr x = 9 y = 7 McRT-STM: Example 7 T1 foo 5 bar Abort Commit hdr x = 9 x = 0 y = 7 y = 0 T2 stmStart(); t1 = stmRd(bar.x); t2 = stmRd(bar.y); stmCommit(); T2 waits <bar, 7> <bar, 5> Reads Reads <foo, 3> <foo, 3> Writes <bar, 5> • T2 should read [0, 0] or should read [9,7] Undo <bar.y, 0> <bar.x, 0>
Early Results: Overhead breakdown • Time breakdown on single processor • STM read & validation overheads dominate Good optimization targets
Transactional Java Java + STM API ORP VM Transactional STIR Optimized T-STIR McRT STM Native Code System Overview Polyglot StarJIT
Leveraging the JIT • StarJIT: High-performance dynamic compiler • Identifies transactional regions in Java+STM code • Differentiates top-level and nested transactions • Inserts read/write barriers in transactional code • Maps STM API to first class opcodes in STIR Good compiler representation → greater optimization opportunities
atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } } … stmWr(&a.x, t1) stmWr(&a.y, t2) if(stmRd(&a.z) != 0) { stmWr(&a.x, 0); stmWr(&a.z, t3) } Representing Read/Write Barriers Traditional barriers hide redundant locking/logging
Redundancies exposed: atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } } txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnOpenForWrite(a) txnLogObjectInt(&a.y, a) a.y = t2 txnOpenForRead(a) if(a.z != 0) { txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = 0 txnOpenForWrite(a) txnLogObjectInt(&a.z, a) a.z = t3 } An STM IR for Optimization
atomic { a.x = t1 a.y = t2 if(a.z == 0) { a.x = 0 a.z = t3 } } txnOpenForWrite(a) txnLogObjectInt(&a.x, a) a.x = t1 txnLogObjectInt(&a.y, a) a.y = t2 if(a.z != 0) { a.x = 0 txnLogObjectInt(&a.z, a) a.y = t3 } Optimized Code Fewer & cheaper STM operations
Compiler Optimizations for Transactions • Standard optimizations • CSE, Dead-code-elimination, … • Careful IR representation exposes opportunities and enables optimizations with almost no modifications • Subtle in presence of nesting • STM-specific optimizations • Immutable field / class detection & barrier removal (vtable/String) • Transaction-local object detection & barrier removal • Partial inlining of STM fast paths to eliminate call overhead
Experiments • 16-way 2.2 GHz Xeon with 16 GB shared memory • L1: 8KB, L2: 512 KB, L3: 2MB, L4: 64MB (per four) • Workloads • Hashtable, Binary tree, OO7 (OODBMS) • Mix of gets, in-place updates, insertions, and removals • Object-level conflict detection by default • Word / mixed where beneficial
Effective of Compiler Optimizations • 1P overheads over thread-unsafe baseline Prior STMs typically incur ~2x on 1P With compiler optimizations: - < 40% over no concurrency control - < 30% over synchronization
Scalability: Java HashMap Shootout • Unsafe (java.util.HashMap) • Thread-unsafe w/o Concurrency Control Synchronized • Coarse-grain synchronization via SynchronizedMap wrapper Concurrent (java.util.concurrent.ConcurrentHashMap) • Multi-year effort: JSR 166 -> Java 5 • Optimized for concurrent gets (no locking) • For updates, divides bucket array into 16 segments (size / locking) Atomic • Transactional version via “AtomicMap” wrapper Atomic Prime • Transactional version with minor hand optimization • Tracks size per segment ala ConcurrentHashMap • Execution • 10,000,000 operations / 200,000 elements • Defaults: load factor, threshold, concurrency level
Scalability: 100% Gets Atomic wrapper is competitive with ConcurrentHashMap Effect of compiler optimizations scale
Scalability: 20% Gets / 80% Updates ConcurrentHashMap thrashes on 16 segments Atomic still scales
20% Inserts and Removes Atomic conflicts on entire bucket array - The array is an object
20% Inserts and Removes: Word-Level We still conflict on the single size field in java.util.HashMap
20% Inserts and Removes: Atomic Prime Atomic Prime tracks size / segment – lowering bottleneck No degradation, modest performance gain
20% Inserts and Removes: Mixed-Level • Mixed-level preserves wins & reduces overheads • word-level for arrays • object-level for non-arrays
Scalability: java.util.TreeMap 100% Gets 80% Gets Results similar to HashMap
Scalability: OO7 – 80% Reads Operations & traversal over synthetic database “Coarse” atomic is competitive with medium-grain synchronization
Key Takeaways • Optimistic reads + pessimistic writes is nice sweet spot • Compiler optimizations significantly reduce STM overhead • - 20-40% over thread-unsafe • - 10-30% over synchronized • Simple atomic wrappers sometimes good enough • Minor modifications give competitive performance to complex fine-grain synchronization • Word-level contention is crucial for large arrays • Mixed contention provides best of both
Research challenges • Performance • Compiler optimizations • Hardware support • Dealing with contention • Semantics • I/O & communication • Strong atomicity • Nested parallelism • Open transactions • Debugging & performance analysis tools • System integration
Conclusions • Rich transactional language constructs in Java • Efficient, first class nested transactions • Risc-like STM API • Compiler optimizations • Per-type word and object level conflict detection • Complete GC support