430 likes | 646 Views
Hybrid Transactional Memory. Nir Shavit MIT and Tel-Aviv University Joint work with Alex Matveev (and describing the work of many in this summer school). Haswell. Transactional Memory [HerlihyMoss93]. Transactional Memory.
E N D
Hybrid Transactional Memory Nir Shavit MIT and Tel-Aviv University Joint work with Alex Matveev (and describing the work of many in this summer school)
Transactional Memory • Memory Transactions are collections of reads and writes executed atomically • Should Provide • Disjoint Access Parallelism • Should maintain internal and external consistency • External (Serializability): with respect to the interleavings of other transactions. • Internal (Opacity): the transaction itself should operate on a consistent state.
External Consistency Transaction A: Read y Write x = 4 Return x+y Transaction B: Read x Write y = 4 Return x+y X 0 0 Y Cannot both return 4 Canonical synchronization problem all STM/HTM implementations must solve Application Memory
V# Locking STMs Map Array of Versioned- Write-Locks Application Memory
V# 1 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V# 0 V#+1 0 V# 0 V# 0 V# 0 V# 0 V# 0 V#+1 0 V# 1 V#+1 0 V#+1 0 V#+1 0 V# 0 V# 0 V# 1 V# 1 V# 0 X Y Commit Time Locking (Write Buff) Mem Locks • To Read/Write: Check unlocked add to Read/Write set • Acquire Locks • Validate read/write v#’s unchanged • Write Values • Release each lock with v#+1 X Y Read/Write Lock Validate Write Unlock
Internal Inconsistency (Opacity)[GuerraouiKapalka07] 4 8 Transaction A: Write x = 4 X Transaction B: Read x Read y 4 Y Compute z = 1/(x-y) Transaction A: Write y = 2 DIV by 0 ERROR!
TL2/TinySTM’s Global Clock [DiceShalevShavit06/ReigelFelberFetzer06] • Have a shared global version clock • Incremented by writing transactions (as infrequently as possible) • Read by all transactions • Used to validate state viewed by transaction is always opaque
99 0 87 0 50 0 50 0 121 0 V# 0 87 0 34 0 88 0 44 0 V# 0 34 0 99 0 99 0 87 0 50 0 44 0 87 0 34 0 V# 0 50 0 34 1 99 1 87 0 88 0 121 0 121 0 121 0 50 0 TL2 Style STM 120 121 100 100 VClock Mem Locks • Read Vclock • Read/Write: if unlocked and v# less clock add to Read/Write-Set • Acquire Locks • Increment Clock • Validate each v# less than clock • Write values • Release locks with v# = new clock X X Y Y Read Clock Read/Write Lock Inc Validate Write Unlock
TL2 Style STM • Advantages • Great Disjoint Access Parallelism • Disadvantages • Accessing Meta-Data is Expensive • Progress guarantee is only deadlock freedom
NOrec STM [DalessandroSpearScott10] • Use shared global clock as a seqlock • Validation in every read if a seqlock change is detected • Value-based validation: no need for meta-data (local time stamps or locks)
NOrec STM Lock seqlock (set odd) with validation if seqlock changed 100 101 104 103 102 100 104 seqlock Read/Write (with validation if seqlock changed) R/W Set Unlock seqlock (set even) = X X Not odd? seqlock Z Z Z Write Y = Y
NOrec STM • Advantages • No Expensive Meta-Data • Disadvantages • Poor Disjoint Access Parallelism (all writes are serialized by clock) • Progress guarantee is only starvation freedom
Hardware TM [HerlihyMoss93,IBM/Intel13] • Advantages • Everything in Hardware, No Meta Data • Great Disjoint Access Parallelism • Disadvantages • No Progress Guarantee; Fail because of: • Unsupported instructions: system or protected instructions • Exceptions: page faults and similar • Capacity limit: too many accessed locations
Hybrid TM [Moir,Damron et. Al, Kumar et. al] • Fast-Path: Execute Trans Using Best Effort HTM • If it Aborts because of Special Instructions or Transaction Too Large, then… • Slow-Path: Execute Trans Using STM Performance of HTM with progress guarantee of STM
Software Transaction Update locks 0 0 Traditional Hybrid TM [DamronFedorovaLevLuchangcoMoirNussbaum06] Hardware Transaction Test Versioned- Write- Lock in every Read/Write. Update in Write. Versioned- Write-Lock 0 1 Versioned- Write-Lock 0 1
Traditional Hybrid TM • Advantages • Progress Guarantee of STM • Disadvantages • HTM must access meta data • Fast path is actually slow because of extra load and branch on every read
Phased TM [LevMoirNussbaum07] • Two modes: all hardware or all software • Shared globalmode indicator • If some hardware transaction aborts switch to software mode • Eventually mode reverts back to hardware
Phased TM • Advantages • Fast-path Pure HTM: No Meta Data Accesses • Disadvantages • Single Software Transaction Causes all HTM to switch to STM slow path • Not clear how to tune to avoid frequent mode transitions…
Hybrid Norec (1st Attempt) SoftwareNorec: Unlock Seqlock (set even) Lock Seqlock (set odd) Read/Write (with validation) Not odd? seqlock Write Validate Software will fail seqlock validation! Hardware: Write seqlock +2 Not odd? seqlock Read/Write (no validation)
Hybrid Norec (1st Attempt) SoftwareNorec: Lock Seqlock (set odd) Unlock Seqlock (set even) Read/Write (with validation) Not odd? seqlock Validate Write Hardware will fail seqlock validation! Hardware: Write seqlock +2 Not odd? seqlock Read/Write (no validation)
Hybrid Norec (1st Attempt) SoftwareNorec: Guaranteed External Consistency Lock Seqlock (set odd) Unlock Seqlock (set even) Read/Write (with validation) Odd? seqlock Validate Write Hardware will fail seqlock validation! Hardware: Write seqlock +2 Not odd? seqlock Read/Write (no validation)
Hybrid Norec (1st Attempt) SoftwareNorec: Problem: hardware opacity Lock Seqlock (set odd) Unlock Seqlock (set even) Read/Write (with validation) Not odd? seqlock Validate Write Hardware will fail seqlock validation! Hardware: Write seqlock +2 Not odd? seqlock Read/Write (no validation)
Internal Inconsistency (Opacity)[GuerraouiKapalka07] 4 8 Software A: Lock seqlock +1 Write x = 4 X Hardware B: Read x Read y 4 Y Compute z = 1/(x-y) … Odd? Seqlock Write y = 2 Unlock seqlock+1 DIV by 0 ERROR!
Hybrid Norec (2nd Attempt) SoftwareNorec: Guarantee hardware opacity Lock Seqlock (set odd) Unlock Seqlock (set even) Read/Write (with validation) Not odd? seqlock Validate Write Hardware will detect seqlock invalidation! Hardware: Write seqlock +2 Not odd? seqlock Read/Write (no validation)
Hybrid NOrec • Advantages • Fast-path HTM: No Meta Data Accesses • Disadvantages • Limited Disjoint Access Parallelism • Seqlock is in hardware tracking set throughout HTM transaction • Major sequential bottleneck
Possible Solutions • Forget Opacity, Use sandboxing [DalessandroCarougeWhiteLevMoirScottSpear2011] • Hybrid Norec 2 [RiegelMarlierNowackFelberFetzer11]: use non-transactional operations in a hardware transaction to read and validate seqlock has not changed after every read But sandboxing is complex…and non-transactional ops only available in AMD proposal, not actual IBM or Intel …
Reduced Hardware Approach to HyTM [MatveevShavit13] • Use short hardware transactions in the software slow-path • I.e. create new “mixed” software/hardware path • Not in order to make slow-path faster • But rather, in order to remove meta-data accesses from fast path • Default to all software if mixed path fails
Transactional Writes Imply Hardware Opacity 4 8 Trans A: Write x = 4 X Hardware B: Read x Read y 4 Y 2 Compute z = 1/(x-y) Write y = 2 DIV by 0 ERROR! If in a hardware transaction this cannot happen…
Reduced Hardware NOrec [MatveevShavit13] • In Slow-path commit, use a small hardware transaction to: • Write all values • Check seqlock has not changed • Write seqlock+1 • In Fast-path: • Move seqlock test to end, un-instrumented read/writes
Reduced Hardware NOrec SoftwareNorec: Guarantee fast-path opacity without having seqlock in TM tracking set for long In HTM Trans: Write values Changed? seqlock seqlock +1 Lock seqlock (set odd) Lock seqlock (set even) Read/Write (with validation) Changed? seqlock Write Validate Hardware will detect write conflict without seqlock! Hardware: Write seqlock +1 Changed? seqlock Read seqlock Read/Write (no instrumentation)
Reduced Hardware NOrec • Properties • Fast-path: No Meta Data; No instrumentation of reads or writes • Slow-path: • short hardware transaction: size of write set • can repeatedly attempt short hardware transaction in commit
Reduced Hardware NOrec • Advantages • Hardware Disjoint Access Parallelism • seqlock accessed only at end of HTM transaction • Surprise: 1st HyTM that is Obstruction-free and Privatizing • Disadvantages • Still window of possible abort due to seqlock increment
Reduced Hardware TL2 Style Hardware Will See Software SoftwareTL2 style: In HTM Trans: Write values Write Read Clock Read/Write (validate) Validate Hardware will detect write conflict Hardware: Read/Write (no validation) Read Clock Write values With Clock +1
Problem: if between validate and hardware write, can have inconsistency Reduced Hardware TL2 Style Solution: combine validation and writes in single transaction SoftwareTL2 style: In HTM Trans: Validate and Write values In HTM Trans: Write values Read Clock Read/Write (validate) Validate Hardware will detect write conflict Hardware: Read/Write (no validation) Read Clock Write values With Clock +1
Reduced Hardware TL2 Style • Advantages • Complete Disjoint Access Parallelism • GV6 clock incremented on aborts only • Obstruction-free • Disadvantages • No privatization • Mixed path transaction size of meta-data set
HyTM: Long Journey • Combination of ideas: • hardware transactions, • global clocks, • no meta data access, • mixed hardware software paths • And there is still room for improvement