540 likes | 708 Views
Guilt-free Nonblocking Software Transactional Memory. Virendra J. Marathe Department of Computer Science University of Rochester. Single Processor Performance Problems. Heat wall Diminishing ILP gains. http://www.tomshardware.com/2005/11/21/. The Concurrency Revolution is here!.
E N D
Guilt-free Nonblocking Software Transactional Memory Virendra J. Marathe Department of Computer Science University of Rochester
Single Processor PerformanceProblems • Heat wall • Diminishing ILP gains http://www.tomshardware.com/2005/11/21/
The Concurrency Revolution is here! • Multicore chips are already here • Intel, AMD Dual and Quad Core chips already in market • 8-core versions already on their way • Sun’s Niagara and Rock processors • Parallel programming reaching the masses • Applications need to become parallel to leverage the multicore performance potential
The Parallel Programming Challenges • Finding Parallelism • Expressing Parallelism • Concurrency Control (synchronization) • Program Verification & Debugging • Performance Debugging
Synchronization • Traditional Approach: Locks • Gives mutual exclusion guarantees • Tension between locking granularity and scalability • Locking not composable Coarse-grain locking + Easy to maintain – Poor scaling • Fine-grain locking • + Good scaling • – Hard to get right • Data races • Deadlocks
Transactional Memory (TM) • TM borrowed from database transactions • Memory is the database • Marked blocks of code are transactions • Ensures • Atomicity: blocks of code will execute atomically • Isolation: blocks of code will not observe mutations in the memory • Consistency: blocks of code will ensure program invariants are guaranteed • Durability: not supported (memory isn’t durable)
Transactional Memory to the Rescue • Programmer does not have to worry about concurrency control • Best of both coarse and fine grain locking • Simplicity of coarse grain locking • Scalability of fine grain locking • Gives Composability HashTable hash1, hash2; ... atomic{ Object o1 = hash1.remove(key1); hash2.insert(key1, o1); } class HashTable { ... Object remove(Key key) { atomic{ ... } } void insert(Key key, Object obj) { atomic { ... } } }
A Typical Memory Transaction • Speculatively executed blocks of code • Reads and Writes are speculative • Transactions try to commit updates at the end • The runtime ensures that transactions are atomic and isolated • Requires conflict detection mechanisms • Done entirely in software (STM), hardware (HTM), or a hybrid of both (HyTM) • Non-atomic or non-isolated transactions abort and re-execute begin read (A) read (B) write (C) write (A) commit
Lock-based Transactions • Use per location locks for conflict detection • Causes unnecessary waiting in some situations Releases locks for A, B, C begin write (A) read (B) write (C) abort T1 conflict abort T1 begin read (A) commit T2 Stall till T1 releases lock for A
Nonblocking Transactions • Nonblocking algorithms • Get rid of locks • No need of waiting • Requires non-trivial engineering to get right • But payoff is worth it begin write (A) read (B) write (C) abort T1 conflict abort T1 begin read (A) commit No stalling T2
Nonblocking Progress and STM • Nonblocking Progress – arbitrary delays in some threads do not prevent others from making forward progress • Nonblocking STMs • Transactions acquire revocable locks for written locations • Acquired locations are released at commit/abort time • Competing transactions need not block for current owners of locks
Nonblocking Progress and STM • TM research began for nonblocking concurrent algorithms [Herlihy&Moss ISCA’93] • Early software TMs (STMs) were nonblocking, but slow • Recent shift toward blocking STMs • Significant performance improvements • General argument – nonblocking STMs are fundamentally slow • We argue – one can improve the common case performance of nonblocking STMs
Remaining Talk • Why is nonblocking progress important? • Background on STM Implementations • What makes nonblocking STMs slow? • Making nonblocking STMs fast • Experimental Results • Conclusions
The Virtues of Nonblocking Progress • Tolerance from arbitrary delays due to • Preemption, • Page faults, • Thread faults • External scheduler support mitigates some problems, but • Not portable • Better to contain the problem within the STM • Environments where blocking isunacceptable • TxLinux interrupt handler transactions
Agenda • Why is nonblocking progress important? • Background on STM Implementations • What makes nonblocking STMs slow? • Making nonblocking STMs fast • Experimental Results • Conclusions
STM Speculative Writes • Two types of implementations for speculative writes: • Redo Log: writes made to private buffer begin write (A) read (B) write (C) commit copyback new values T1 Status Write Set A, new value C, new value A B C T1
STM Speculative Writes • Two types of implementations for speculative writes: • Undo Log: writes are made directly to memory begin write (A) read (B) write (C) abort restore old values T1 Status Undo Log A, old value C, old value A B C T1
STM Speculative Reads • Reads are invisible • Logged in a private read set • Read set validated to ensure isolation • Several schemes (e.g. incremental, commit counter, timestamp, etc.) begin write (A) read (B) write (C) commit T1 Verifies that B hasn’t changed, then commits log B Status Write Set Read Set B, curr-state A B C T1
Agenda • Why is nonblocking progress important? • Background on STM Implementations • What makes nonblocking STMs slow? • Making nonblocking STMs fast • Experimental Results • Conclusions
What makes Nonblocking STMs slow? • Nonblocking STMs require infrastructure to avoid waiting during conflicts • Indirection (object-based STMs) • Copying and Cloning • Helping • Stealing • Incremental Read Set Validation • Extremely costly • These usually lead to overheads in the (contention-free)common case
What makes Nonblocking STMs slow? • Indirection (in object-based STMs), and • Copying and cloning DSTM Transactional Object RSTM Transactional Object Txn Start Owner Txn Txn 1 Txn 2 Start Old Data Old Data Owner Txn Old Data New Data Owner Txn Old Data New Data New Data Locator New Data
What makes Nonblocking STMs slow? • Helping: Help the conflicting transaction to finish begin write (A) read (B) write (C) commit T1 begin read (A) help T2 help Too much contention begin read (A) T3 help begin read (A) T4
What makes Nonblocking STMs slow? • Stealing • Steal the right to access conflicting location • Take over the responsibility of cleanup begin write (A) read (B) write (C) abort T1 begin write (A) steal A commit T2 begin write (A) steal A commit T3
Stealing [Harris & Fraser approach] • Need infrastructure to • Handle the case of multiple stealers • reference counters • Retrieve correct logical values of stolen locs • storing old and new values, • expensive memory management for preserving logical values in transaction read/write sets • helping to restore logical values of stolen locations • Manage races among stealers • Extra atomic ops (2N for N locs) • Stealing is still promising
What makes Blocking STMs fast? • Significantly less overhead in the common case • Simple metadata structure • Just 1 word to indicate ownership • Streamlined fast path • Performance optimizations • Timestamp based validation • We need to incorporate all these features in a nonblocking STM to make it competitive
Agenda • Why is nonblocking progress important? • Background on STM Implementations • What makes nonblocking STMs slow? • Making nonblocking STMs fast • Experimental Results • Conclusions
Our Contributions • A novel approach for stealing • Keep the common case simple • Resort to complicated case only when stealing happens • More streamlined common case execution path • Incorporate recent optimizations (timestamp based validation) • We are the first to do this in nonblocking STMs
STM Data Structures • Word-based STM • Conflict detection at granularity of contiguous blocks of memory • Appropriate for unmanaged languages – C, C++ • A table of ownership records (orecs) • Each heap location hashes into a single orec • Each orec indicates if currently owned or free, and identifies the owner • Transaction Descriptor • Read set • Write set (redo log) – a 2D list, each row corresponds to an acquired orec • Status – Active/Aborted/Committed
STM Data Structures Write Set T1 COMMITTED locX:11 hashing o1 10 locX o2 Read Set o3 o4 o5 Shared Heap Ownership Records (orec) third owner (stealer 2)
Common Case Execution • Algorithm behaves like a blocking STM in the absence of contention • Log reads, writes of transaction • Acquire ownership of write set locations via their orecs • Ensure that reads are still consistent (timestamp-based validation) • Copyback updates after commit • Release orecs via store instruction (details offline) • Ours is the first nonblocking STM with this feature
Timestamps and Validation • A significant optimization to read set validation (e.g. TL2) Global Clock TS: 1 o1 A 10 11 o2 Check TS(loc) B TS: 4 o3 C begin write (A) read (B) write (C) commit o4 T1 TS: 10 o5 ACTIVE Begin_TS: 10 T1 orecs Shared Heap
Timestamps and Validation • Ensures that transactions access mutually consistent data • Validation per memory access takes constant time • Assumption that conflicts will be rare • Results in major performance difference • Prior nonblocking STMs required incremental validation
Adding Timestamps • Recall: orec contains a pointer to the owner • Superimpose a timestamp on this pointer • A writer releases orec by storing back the current global time
Common Case Example Copyback complete Copyback in progress locX’s logical value Write Set Release Store T1 ACTIVE T1 COMMITTED locX:11 ID, flags ver# hashing o1 locX 10 11 o2 o3 o4 o5 S C Shared Heap Ownership Records (orec) third owner (stealer 2)
Uncommon Case Stealing • Two flags in the orec for the stealing process • stolen_orec: for orec’s stolen/unstolen state • copier_exists: indicates if there exists an owner in cleanup phase
Stealing Example Copyback complete Copyback in progress locX’s logical value OWNER Write Set Clear C T1 COMMITTED locX:11 ID, flags ver# hashing 1 0 0 0 1 1 0 0 o1 locX 11 12 10 o2 STEALER 1 Write Set T2 ACTIVE T2 COMMITTED o3 locX:11 locX:12 o4 o5 S C STEALER 2 Write Set T3 ACTIVE locX:12 Redo Copyback Shared Heap Ownership Records (orec) third owner (stealer 2)
Stealing Complexity • Stealing mechanism quite complex • Several corner case race conditions need to be handled (happy to talk offline) • Invariant: At most 1 transaction does a copyback for an orec at any given time • Simplifies our design significantly • Overhead of accessing stolen locations is quite high, requiring a lookup in the last stealer’s write set • However, we can throttle stealing and make it an uncommon case
Undo Log Variant • We have developed the first nonblocking undo log STM through simple modifications to a redo log variant • Stealing of orecs happens in the redo log STM when a committed owner is delayed • In undo log STMs stealing largely happens when an aborted owner is delayed • Logical values of locations are in aborted owner’s undo log
Agenda • Why is nonblocking progress important? • Background on STM Implementations • What makes nonblocking STMs slow? • Making nonblocking STMs fast • Experimental Results • Conclusions
Experimental Platform • Implementation of all STMs done in C • Throughput tests conducted on microbenchmarks • Scalable workloads: hash table, binary search tree • Torture tests (no scaling): counter, array of counters • Tests conducted on a 16 processor Sun Fire machine • We compared the following STMs • TL2, • TL2 with schedctl calls to avoid preemption pathologies, • Harris and Fraser’s word-based nonblocking STM • Our Base blocking and nonblocking variants (do not contain store-based release and optimizations), and • 3 variants of our Optimized STM (eager redo log, lazy redo log, undo log)
Binary Search Tree (32K nodes) Our Optimized STMs TL2 Base NB HF-STM Major performance gap closed
Hash Table (64 buckets, 256 keys) TL2-Sched TL2 Our Optimized STMs
Array of 16 Counters Undo Log TL2 TL2-Sched Redo Log
Conclusion • We presented several variants of a new STM that • Effectively decouples the common case from nonblocking infrastructure • Enables a more streamlined fast path (comparable to state-of-the-art blocking STMs) • Enables integration of key optimizations such as • Timestamp-based transaction validation • We have shown that common case performance of nonblocking STMs can be made competitive with state-of-the-art blocking STMs
My Work during Ph.D. • Nonblocking STMs • Comparison of nonblocking STMs [LCR’04, URCS TR 839] • Adaptive STM [DISC’05] • Rochester STM [Transact’06, DISC’06, Transact’07] • Word-based STMs [PODC’05, PPoPP’07, PPoPP’08, URCS TR 932] • Enabling Cooperation among Transactions • Transaction Synchronizers [SCOOL’05] • Hardware Acceleration of STMs • RTM [Transact’06, ISCA’07] • Programming Model aspects of software transactions • Privatization [PODC’07, URCS TR 915, submitted] • Bag-of-tasks programming model with Transactions [PPoPP’07] • Interaction with non-transactional code • Transaction Safe Nonblocking Algorithms [DISC’07, URCS TR 924] • Composite Abortable Locks [IPDPS’06]
Future Goals and Directions • Short Term • Investigate Programming Models centered around TM • Language integration and semantics of software transactions (a hot research topic) • Interaction of TM with data, dataflow parallelism • Interaction with traditional lock-based code • Investigate workloads to understand usability of TM • More aspects of STM runtimes • Concurrent nesting, data locality, runtime optimizations, etc. • Long Term Goal • Make parallel programming much more accessible to the masses
Thank You! Questions?
Array of 16 Counters – Stealing Rate Undo Log Redo Log
My Work during Ph.D. • Nonblocking STMs • Comparison of nonblocking STMs [LCR’04, URCS TR 839] • Identified several design tradeoffs • Adaptive STM [DISC’05] • Adaptation in levels of indirection and ownership acquisition technique • Rochester STM [Transact’06, DISC’06, Transact’07] • Further reduction in levels of indirection • Word-based STMs[PODC’05, PPoPP’07, PPoPP’08, URCS TR 932] • Really guilt-free nonblocking STMs
My Work during Ph.D. • Enabling Cooperation among Transactions • Transaction Synchronizers [SCOOL’05, ongoing] T2 T_1_2 T1 Comm. Channel