370 likes | 384 Views
Toward High Performance Nonblocking Software Transactional Memory. Virendra J. Marathe University of Rochester. Mark Moir Sun Microsystems Labs. Nonblocking Progress & Transactional Memory.
E N D
Toward High Performance Nonblocking Software Transactional Memory Virendra J. Marathe University of Rochester Mark Moir Sun Microsystems Labs
Nonblocking Progress & Transactional Memory • Nonblocking Progress – arbitrary delays in some threads do not prevent others from making forward progress • TM research began for nonblocking concurrent algorithms [Herlihy&Moss ISCA’93] • Early software TMs (STMs) were nonblocking, but slow • Recent shift toward blocking STMs • Significant performance improvements • General argument – nonblocking STMs are fundamentally slow • We were not convinced
Agenda • Why is nonblocking progress important? • Background on STM Implementations • What makes nonblocking STMs slow? • Making nonblocking STMs fast • Experimental Results • Conclusions
The Virtues of Nonblocking Progress • Tolerance from arbitrary delays due to • Preemption, • Page faults, • Thread faults • External scheduler support mitigates some problems, but • Not portable • Ideally contain the problem within the STM • Environments where blocking is unacceptable • TxLinux interrupt handler transactions
Agenda • Why is nonblocking progress important? • Background on STM Implementations • What makes nonblocking STMs slow? • Making nonblocking STMs fast • Experimental Results • Conclusions
STM Implementations • Transactions execute speculatively • Reads and writes use STM metadata • Speculative writes typically acquire ownership of locations (using atomic ops. e.g. CAS) • Reads are typically logged in a private read set for validation at commit time • Post-commit/abort cleanup • Make speculative updates non-speculative, or rollback speculative updates • Release ownership of locations This forces waiting in blocking STMs
STM Implementations • Two types of implementations for speculative writes: • Redo Log – • writes made to private buffer, • and flushed out on commit • ownership acquisition can be done at first write (eager acquire) or commit time (lazy acquire) • Undo Log – • writes are made directly to memory (need eager acquire), • old values are logged in a private buffer, and • old values are restored in case of an abort • Read set validation to ensure isolation • Several schemes (e.g. incremental, commit counter, timestamp, etc.)
Agenda • Why is nonblocking progress important? • Background on STM Implementations • What makes nonblocking STMs slow? • Making nonblocking STMs fast • Experimental Results • Conclusions
What makes nonblocking STMs slow? • In Blocking STMs • Transaction waits for a conflicting transaction in its post-commit/abort cleanup phase • Nonblocking STMs avoid waiting with • Indirection (object-based STMs) • Copying and Cloning • Helping • Stealing (Harris & Fraser; also our approach) • These usually lead to overheads in the (contention-free) common case
What makes blocking STMs fast? • Significantly less overhead in the common case • Simple metadata structure • Streamlined fast path • Performance optimizations • Timestamp based validation • We need to incorporate all these features in a nonblocking STM to make it competitive
Agenda • Why is nonblocking progress important? • Background on STM Implementations • What makes nonblocking STMs slow? • Making nonblocking STMs fast • Experimental Results • Conclusions
Our Contributions • Keep the common case simple • Resort to complicated case only when cleanup is delayed • More streamlined common case execution path • Incorporate recent optimizations (timestamp based validation)
STM Data Structures • Word-based STM • Conflict detection at granularity of contiguous blocks of memory • Appropriate for unmanaged languages – C, C++ • A table of ownership records (orecs) • Each heap location hashes into a single orec • Each orec indicates if currently owned or free, and identifies the owner • Transaction Descriptor • Read set • Write set (redo log) – a 2D list, each row corresponds to an acquired orec • Status – Active/Aborted/Committed
Common Case Execution • Algorithm behaves like a blocking STM in the absence of contention • Log reads, writes of transaction • Acquire ownership of write set locations via their orecs • Ensure that reads are still consistent (read set validation) • Flush out updates after commit/abort • Release orecs
Uncommon Case: Stealing • Two flags in the orec for the stealing process • stolen_orec: for orec’s stolen/unstolen state • copier_exists: indicates if there exists an owner in cleanup phase
Stealing Example Copyback complete Copyback in progress locX’s logical value OWNER Write Set Clear C T1 COMMITTED locX:11 ID, flags ver# hashing 0 1 0 0 1 1 0 0 o1 locX 11 10 12 o2 STEALER 1 Write Set T2 ACTIVE T2 COMMITTED o3 locX:11 locX:12 o4 o5 S C STEALER 2 Write Set T3 ACTIVE locX:12 Redo Copyback Shared Heap Ownership Records (orec) third owner (stealer 2)
Stealing Complexity • Stealing mechanism quite complex • Several corner case race conditions need to be handled (read the paper for further details) • Overhead of accessing stolen locations is quite high, requiring a lookup in the last stealer’s write set • However, we can throttle stealing and make it an uncommon case
Streamlining Common Case • To release acquired orecs prior nonblocking STMs required • Expensive synch. instructions (e.g. CAS) • Indirection & garbage collection • Blocking STMs use store instruction • So do we(details in the paper)
Timestamps and Validation • A significant optimization to read set validation (e.g. TL2) • Log time at which orec was modified (done when owner releases orec) • A reader checks if the orec was modified after it began execution, and if so, aborts conservatively
Adding Timestamps • Recall: orec contains a pointer to the owner • Superimpose a timestamp on this pointer • A writer releases orec by storing back the current global time • Timestamps lowered the cost of read set validation significantly
Undo Log Variant • We have developed the first nonblocking undo log STM through simple modifications to a redo log variant • Stealing of orecs happens in the redo log STM when a committed owner is delayed • In undo log STMs stealing largely happens when an aborted owner is delayed • Logical values of locations are in aborted owner’s undo log
Agenda • Why is nonblocking progress important? • Background on STM Implementations • What makes nonblocking STMs slow? • Making nonblocking STMs fast • Experimental Results • Conclusions
Experimental Platform • Implementation of all STMs done in C • Throughput tests conducted on microbenchmarks • Scalable workloads: hash table, binary search tree • Torture tests (no scaling): counter, array of counters • Tests conducted on a 16 processor Sun Fire machine • We compared the following STMs • TL2, • TL2 with schedctl calls to avoid preemption pathologies, • Harris and Fraser’s word-based nonblocking STM • Our Base blocking and nonblocking variants (do not contain store-based release and optimizations), and • 3 variants of our Optimized STM (eager redo log, lazy redo log, undo log)
Binary Search Tree Our Optimized STMs TL2 Base NB HF-STM
Hash Table TL2-Sched TL2 Our Optimized STMs
Array of Counters Undo Log TL2 TL2-Sched Redo Log
Array of Counters – Stealing rate Undo Log Redo Log
Conclusion • We presented several variants of a new STM that • Effectively decouples the common case from nonblocking infrastructure • Enables a more streamlined fast path (comparable to state-of-the-art blocking STMs) • Enables integration of key optimizations such as • Timestamp-based transaction validation • We have shown that common case performance of nonblocking STMs can be made competitive with state-of-the-art blocking STMs
Thank You! Questions?
Common Case Example Copyback complete Copyback in progress locX’s logical value Write Set Release Store T1 ACTIVE T1 COMMITTED locX:11 ID, flags ver# hashing 0 0 o1 locX 10 11 o2 o3 o4 o5 S C Shared Heap Ownership Records (orec) third owner (stealer 2)
Basic Idea • Transaction steals ownership of the location under conflict • Inspired by Harris & Fraser’s WSTM • Stealing • Requires complex metadata management • Leads to high latency reads and writes • Switch the stolen location back to unstolen state as quickly as possible
Phase-I STM: Switching orec back to Unstolen state • If an orec is stolen, logical values of mapping locations may be in the last stealer’s write set (pointed by the orec) • Stealer will reuse such a write set row (for a new transaction) only after it is reclaimed • Subsequent stealer that comes across a stolen orec with (copier_exists == false) switches orec to unstolen state • Stealing-releasing is a complex process
Phase-I STM: Illustration First owner T1 COMMITTED Clear C ID, flags ver# hashing 1 1 0 0 1 0 0 0 o1 Second owner (stealer 1) o2 T2 ACTIVE o3 Third owner (stealer 2) o4 T3 ACTIVE o5 S C Shared Heap Ownership Records (orec) third owner (stealer 2)
STM API • stm_begin(my_txn): Initializes a transacation • stm_read(my_txn,loc): Speculative read of location loc • stm_write(my_txn,loc,val): Speculative write val to loc • stm_commit(my_txn): Attempt to commit transaction
Phase-I STM: Example Copyback complete Copyback in progress locX’s logical value First owner Write Set Clear C T1 COMMITTED locX:11 ID, flags ver# hashing 0 1 0 0 1 0 1 0 o1 locX 11 10 o2 Second owner (stealer 1) Write Set T2 ACTIVE o3 locX:11 o4 o5 S C Third owner (stealer 2) Write Set T3 ACTIVE locX:11 Redo Copyback Shared Heap Ownership Records (orec) third owner (stealer 2)
Phase-I STM: Stealing Mechanism • Steal orec when transaction encounters orec acquired by a committed transaction • The committed transaction is copying back its speculative updates • Stealing done in two steps: • Merge speculative updates of victim to the orec’s locations into stealer’s write set • Acquire the orec with an atomic op • This involves setting some special flags that indicate to the system that the orec is stolen
Phase-I STM: Stolen orec state • Logical values of stolen locations are always in the stealer’s write set • Subsequent accesses to these locations must lookup the stealer’s write set • Quite expensive • We use some flags to indicate when it is safe for a new stealer to switch the orec back to the unstolen state