1 / 20

Overview of POWER HTM

Overview of POWER HTM. Maged Michael IBM T J Watson Research Center WTTM 2014 15 July 2014. Outline. POWER HTM features. Use cases. Performance results. Acknowledgment of IBM colleagues in Austin, Yorktown, Tokyo, and Toronto.

abiola
Download Presentation

Overview of POWER HTM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview of POWER HTM Maged Michael IBM T J Watson Research Center WTTM 2014 15 July 2014

  2. Outline • POWER HTM features • Use cases • Performance results Acknowledgment of IBM colleagues in Austin, Yorktown, Tokyo, and Toronto Any errors in describing POWER HTM features and performance in this presentation are my own. WTTM 2014 - POWER HTM

  3. POWER HTM Features WTTM 2014 - POWER HTM

  4. Basic Transactional Instructions • TBEGIN: Begins an outermost transaction (or increments nesting level) • TEND: Commits an outermost transaction (or decrements nesting level) • TBEGIN sets a condition register to indicate success or failure • TEND sets a condition register to indicate whether it was executed in a transaction or not (i.e., extraneous TEND) • Transaction failure transfers control to the instruction following TBEGIN • Basic example tbegin. # begin transaction beq failure_handler # branch to failure handler if failure code is set ... tend. bgt was_not_in_a_transaction # (optional) check if tend was extraneous WTTM 2014 - POWER HTM

  5. Features of Basic Transactions • No hardware progress guarantee. Failure handlers must include an alternative non-HTM software path. • Strong isolation. Hardware detection of conflicts with non-transactional accesses. • Flat nesting. Transaction failure transfers control to the instruction following the outermost TBEGIN. • Order guarantee for successful transactions among three groups of (cacheable write-back) memory accesses: • Before TBEGIN • Inside the transaction • After TEND Example: Initially X == Y == 0. r1 == r2 == 0 not allowed st X = 1 tbegin. ld r1 = Y tend. st Y = 1 tbegin. ld r2 = X tend. WTTM 2014 - POWER HTM

  6. Transaction Abort • TABORT: Causes transaction failure • Unconditional variants with and without 8-bit code • Conditional variants with 32/64-bit register or immediate parameters • Example: Transactional lock elision entry tbegin. beq- tle_failure_handler ld r=LOCK # load lock cmpi r==FREE # compare with free value beq+ $+8 # if free, start critical section tabort. # if not free, abort TLE transaction <critical section> tbegin. beq- tle_failure_handler ld r=LOCK # load lock tabort[wd]ci. r!=FREE # If not free, abort TLE transaction <critical section> WTTM 2014 - POWER HTM

  7. Transactional Registers and Failure Causes • TFHAR: Address of failure handler, i.e., outermost TBEGIN + 4 • TFIAR: Address of failure instruction when applicable • TEXASR: Transaction exception and status register. Includes cause of transaction failure. • TEXASR register contains a summary bit that provides a hint of whether the cause of failure is likely to be persistent or transient • TEXASR register also contains an 8-bit software code that may have been provided with a TABORT instruction • Failure causes include conflicts, abort instructions, footprint overflow , I/O, access to non-write-back memory, nesting level overflow, disallowed instructions (e.g., sleep, cache invalidation). WTTM 2014 - POWER HTM

  8. Suspending/Resuming Transactional State • TSUSPEND: Suspends the current transaction. I.e., transitions from transactional state to suspended • TRESUME: Resumes the suspended transaction. • Loads and stores in suspended state are performed non-speculatively as they occur and do not use hardware transactional resources • No new transactions can be initiated in suspended state • Transaction failure is recorded but failure handling is deferred until the transaction is resumed • Load instructions of location written transactionally return the written values as long as the transaction has not failed • Stores in suspended state to locations accessed transactionally cause transaction failure • TCHECK: Checks for transaction failure and validity of prior memory operations. (May be used in transactional state too) WTTM 2014 - POWER HTM

  9. Rollback Only Transactions (ROT) • Intended for single thread speculation • Not intended for shared data • No conflict detection • Keeps track only of transactional stores • No order guarantees • May be nested with atomic transactions WTTM 2014 - POWER HTM

  10. Use Cases WTTM 2014 - POWER HTM

  11. Transactional Lock Elision • Transactional lock elision - Entry pthread_mutex_lock(mutex) { if (do_tle(mutex)) { // Check TLE state and collect stats if needed attempts = 0; // Count TLE attempts for current TRY_TLE: if (__TM_begin()) { // Inside HW transaction if (!is_free(mutex)) __TM_abort(); // If mutex is busy abort HW transaction return 0; // return SUCCESS } // HW transaction failed // Failure handler: // Decide to retry TLE or fallback on conventional implementation // based on number of failed attempts, cause of failure, and lock recursion // May update TLE stats for the mutex if (decide_to_try_TLE_again(mutex,++attempts,__TM_is_failure_persistent())) { wait_until_free(mutex); backoff(attempts); goto TRY_TLE; } } <Fallback on conventional non-TLE lock acquisition implementation> } WTTM 2014 - POWER HTM

  12. Transactional Lock Elision • Transactional lock elision - Exit pthread_mutex_unlock(mutex) { if (is_free(mutex)) if (__TM_end() // End TLE transaction return 0; // return success <Follow conventional non-TLE path> } WTTM 2014 - POWER HTM

  13. Path Length Reduction • Example: java.util.concurrent ConcurrentLinkedQueue.offer() critical path of CAS-based implementation No TM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 l t=[tail] isync l s=[t.next] isync l r=[tail] isync cmp r,t bne start_over cmpi s,0 bne fix_tail hwsync L1: larx r=[t.next] cmp r,s bne start_over stcx [t.next]=n bne- L1 hwsync L2: larx r=[tail] cmp r,t bne skip_stcx stcx [tail]=n bne- L2 isync WTTM 2014 - POWER HTM

  14. Path Length Reduction • CLQ with TM TM 1 2 3 4 5 6 7 8 9 tbegin beq- failure_handler l t=[tail] l s=[t.next] cmpi s,0 beq+ L1 # skip next instruction mr t=s # not common case L1: st [t.next]=n st [tail]=n tend • Fallback on conventional CAS-based implementation in case of TM failure • Aggregation of memory barriers WTTM 2014 - POWER HTM

  15. Other Use Case Examples • Hybrid HW/SW high-level transactions. E.g., HTM commit acceleration, spin-waiting in suspended state. • Thread-level speculation with commit ordering using suspended-mode accesses • Single thread speculation using Rollback-Only Transaction. Assume safe optimization and rollback if optimization was unsafe. WTTM 2014 - POWER HTM

  16. Performance WTTM 2014 - POWER HTM

  17. Single Thread • An empty Pthreads TLE critical section is 6% faster than a conventional Pthreads critical section. • 71% reduction in execution time (warm caches) of CLQ offer()/poll() pairs using TM path length reduction and memory barrier aggregation • The execution time of an empty transaction with suspend/resume is 3.4x that of an empty transaction without suspend/resume WTTM 2014 - POWER HTM

  18. Pthreads TLE - Microbenchmarks • Pattern 1: high contention, no conflicts, data set fits in TM capacity • Pattern 2: high contention, data set that overflows TM capacity • Pattern 3: Mixed pattern 80% high contention, no conflict, fits in TM capacity 20% medium contention, overflows TM capacity WTTM 2014 - POWER HTM

  19. Pthreads TLE - Memcached • Memcached server with varying number of threads • Client running on the same machine. • 96 hardware threads. 12 cores. SMT 8 • Best TLE throughput (on 16 threads) is 26.9% higher than best locking throughput (on 12 threads) • On 16 threads, TLE is higher by 37.5% WTTM 2014 - POWER HTM

  20. Summary • POWER HTM Instruction Set • Suspend / Resume • Rollback Only Transactions • Low HTM overheads • Caution not to learn wrong lessons from specific implementations of specific HTM architectures. E.g., POWER HTM and BG/Q HTM Thank You WTTM 2014 - POWER HTM

More Related