200 likes | 295 Views
Overview of POWER HTM. Maged Michael IBM T J Watson Research Center WTTM 2014 15 July 2014. Outline. POWER HTM features. Use cases. Performance results. Acknowledgment of IBM colleagues in Austin, Yorktown, Tokyo, and Toronto.
E N D
Overview of POWER HTM Maged Michael IBM T J Watson Research Center WTTM 2014 15 July 2014
Outline • POWER HTM features • Use cases • Performance results Acknowledgment of IBM colleagues in Austin, Yorktown, Tokyo, and Toronto Any errors in describing POWER HTM features and performance in this presentation are my own. WTTM 2014 - POWER HTM
POWER HTM Features WTTM 2014 - POWER HTM
Basic Transactional Instructions • TBEGIN: Begins an outermost transaction (or increments nesting level) • TEND: Commits an outermost transaction (or decrements nesting level) • TBEGIN sets a condition register to indicate success or failure • TEND sets a condition register to indicate whether it was executed in a transaction or not (i.e., extraneous TEND) • Transaction failure transfers control to the instruction following TBEGIN • Basic example tbegin. # begin transaction beq failure_handler # branch to failure handler if failure code is set ... tend. bgt was_not_in_a_transaction # (optional) check if tend was extraneous WTTM 2014 - POWER HTM
Features of Basic Transactions • No hardware progress guarantee. Failure handlers must include an alternative non-HTM software path. • Strong isolation. Hardware detection of conflicts with non-transactional accesses. • Flat nesting. Transaction failure transfers control to the instruction following the outermost TBEGIN. • Order guarantee for successful transactions among three groups of (cacheable write-back) memory accesses: • Before TBEGIN • Inside the transaction • After TEND Example: Initially X == Y == 0. r1 == r2 == 0 not allowed st X = 1 tbegin. ld r1 = Y tend. st Y = 1 tbegin. ld r2 = X tend. WTTM 2014 - POWER HTM
Transaction Abort • TABORT: Causes transaction failure • Unconditional variants with and without 8-bit code • Conditional variants with 32/64-bit register or immediate parameters • Example: Transactional lock elision entry tbegin. beq- tle_failure_handler ld r=LOCK # load lock cmpi r==FREE # compare with free value beq+ $+8 # if free, start critical section tabort. # if not free, abort TLE transaction <critical section> tbegin. beq- tle_failure_handler ld r=LOCK # load lock tabort[wd]ci. r!=FREE # If not free, abort TLE transaction <critical section> WTTM 2014 - POWER HTM
Transactional Registers and Failure Causes • TFHAR: Address of failure handler, i.e., outermost TBEGIN + 4 • TFIAR: Address of failure instruction when applicable • TEXASR: Transaction exception and status register. Includes cause of transaction failure. • TEXASR register contains a summary bit that provides a hint of whether the cause of failure is likely to be persistent or transient • TEXASR register also contains an 8-bit software code that may have been provided with a TABORT instruction • Failure causes include conflicts, abort instructions, footprint overflow , I/O, access to non-write-back memory, nesting level overflow, disallowed instructions (e.g., sleep, cache invalidation). WTTM 2014 - POWER HTM
Suspending/Resuming Transactional State • TSUSPEND: Suspends the current transaction. I.e., transitions from transactional state to suspended • TRESUME: Resumes the suspended transaction. • Loads and stores in suspended state are performed non-speculatively as they occur and do not use hardware transactional resources • No new transactions can be initiated in suspended state • Transaction failure is recorded but failure handling is deferred until the transaction is resumed • Load instructions of location written transactionally return the written values as long as the transaction has not failed • Stores in suspended state to locations accessed transactionally cause transaction failure • TCHECK: Checks for transaction failure and validity of prior memory operations. (May be used in transactional state too) WTTM 2014 - POWER HTM
Rollback Only Transactions (ROT) • Intended for single thread speculation • Not intended for shared data • No conflict detection • Keeps track only of transactional stores • No order guarantees • May be nested with atomic transactions WTTM 2014 - POWER HTM
Use Cases WTTM 2014 - POWER HTM
Transactional Lock Elision • Transactional lock elision - Entry pthread_mutex_lock(mutex) { if (do_tle(mutex)) { // Check TLE state and collect stats if needed attempts = 0; // Count TLE attempts for current TRY_TLE: if (__TM_begin()) { // Inside HW transaction if (!is_free(mutex)) __TM_abort(); // If mutex is busy abort HW transaction return 0; // return SUCCESS } // HW transaction failed // Failure handler: // Decide to retry TLE or fallback on conventional implementation // based on number of failed attempts, cause of failure, and lock recursion // May update TLE stats for the mutex if (decide_to_try_TLE_again(mutex,++attempts,__TM_is_failure_persistent())) { wait_until_free(mutex); backoff(attempts); goto TRY_TLE; } } <Fallback on conventional non-TLE lock acquisition implementation> } WTTM 2014 - POWER HTM
Transactional Lock Elision • Transactional lock elision - Exit pthread_mutex_unlock(mutex) { if (is_free(mutex)) if (__TM_end() // End TLE transaction return 0; // return success <Follow conventional non-TLE path> } WTTM 2014 - POWER HTM
Path Length Reduction • Example: java.util.concurrent ConcurrentLinkedQueue.offer() critical path of CAS-based implementation No TM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 l t=[tail] isync l s=[t.next] isync l r=[tail] isync cmp r,t bne start_over cmpi s,0 bne fix_tail hwsync L1: larx r=[t.next] cmp r,s bne start_over stcx [t.next]=n bne- L1 hwsync L2: larx r=[tail] cmp r,t bne skip_stcx stcx [tail]=n bne- L2 isync WTTM 2014 - POWER HTM
Path Length Reduction • CLQ with TM TM 1 2 3 4 5 6 7 8 9 tbegin beq- failure_handler l t=[tail] l s=[t.next] cmpi s,0 beq+ L1 # skip next instruction mr t=s # not common case L1: st [t.next]=n st [tail]=n tend • Fallback on conventional CAS-based implementation in case of TM failure • Aggregation of memory barriers WTTM 2014 - POWER HTM
Other Use Case Examples • Hybrid HW/SW high-level transactions. E.g., HTM commit acceleration, spin-waiting in suspended state. • Thread-level speculation with commit ordering using suspended-mode accesses • Single thread speculation using Rollback-Only Transaction. Assume safe optimization and rollback if optimization was unsafe. WTTM 2014 - POWER HTM
Performance WTTM 2014 - POWER HTM
Single Thread • An empty Pthreads TLE critical section is 6% faster than a conventional Pthreads critical section. • 71% reduction in execution time (warm caches) of CLQ offer()/poll() pairs using TM path length reduction and memory barrier aggregation • The execution time of an empty transaction with suspend/resume is 3.4x that of an empty transaction without suspend/resume WTTM 2014 - POWER HTM
Pthreads TLE - Microbenchmarks • Pattern 1: high contention, no conflicts, data set fits in TM capacity • Pattern 2: high contention, data set that overflows TM capacity • Pattern 3: Mixed pattern 80% high contention, no conflict, fits in TM capacity 20% medium contention, overflows TM capacity WTTM 2014 - POWER HTM
Pthreads TLE - Memcached • Memcached server with varying number of threads • Client running on the same machine. • 96 hardware threads. 12 cores. SMT 8 • Best TLE throughput (on 16 threads) is 26.9% higher than best locking throughput (on 12 threads) • On 16 threads, TLE is higher by 37.5% WTTM 2014 - POWER HTM
Summary • POWER HTM Instruction Set • Suspend / Resume • Rollback Only Transactions • Low HTM overheads • Caution not to learn wrong lessons from specific implementations of specific HTM architectures. E.g., POWER HTM and BG/Q HTM Thank You WTTM 2014 - POWER HTM