380 likes | 404 Views
Explore the development of transactional memory technology, challenges faced, implementations with Rock processor, CPS tests, and conclusions on transaction failures and strategies.
E N D
Early Experience with a Commercial HTM Implementation Dave Dice Yossi Lev Mark Moir and Dan Nussbaum
background • Why TM?- Developers can no longer rely on next year’s processor to run Faster. • Application must be able to exploit more cores. • Traditional approach -> Locks -> Developer responsibility, bottleneck, error prone. • TM -> System responsibility, Scalable. • TM essence – ensure multiple memory accesses are done “atomically”.
Bounded HTM – imposes unreasonable constraints – ensure that transaction does not access more than fixed architecture specific number of cache lines. • Can be useful by combining them with software that exploit HTM, but does not depend on any particular hardware transaction succeeding -> BEST EFFORT. • Best Effort – can commit much larger transactions. • Best Effort – processor can response to difficult events by aborting.
: Two pre production revisions that included HTM- R1, R2(after feedback on R1). Exceptions.
Goal: test that HTM feature worked as expected. Challenge – R1 provided identical feedback about transaction failure that required different response – refined feedback in R2.
General • Rock - > is SPARC processor using aggressive speculation to provide high single thread performance. • example – on LOAD miss it runs ahead speculatively in order to issue subsequent memory request fast. • Speculation is enabled by check points in the processor state, with Hw reverting back to re-execute. • The check point mechanist is used for HTM feature – two instructions – CHKPT and COMMIT - all code between is atomic. • CHKPT is pc-relative fail address in case of transaction failure. • CPS Register – give feedback about the cause of the abort.
General cont’ • Rock has 16 cores. • Each core is capable of executing 2 threads – SE mode – 32 threads. • SSE Mode – 16 SW Threads – combine dedicate resources of two HW threads – the store buffer of the two HW threads is combined to a larger one. • Rock has deferred queue – speculatively instructions that depends on LOADs that miss are held pending for cache fill – overflow cause transaction failure. • WE USE SSE MODE IN ALL TESTs IN ARTICLE.
CPS Register – CPS Tests and Indications • CPS tests – verify circumstances why Transactions abort. • CPS register – failing transaction can set multiple bits (see table in previous slide). • CPS Tests: • Save-restore – transaction fails when execute RESTORE instruction immediately after SAVE instruction(this pattern is common in function calls) –> CPS = 0x8= INST. • TLB misses- to test DTLB misses – we re-mmap the memory accessed by the transaction before executing it . • load from address with no TLB mapping -> CPS=0x90=LD-PREC. • Store from address with no TLB mapping -> CPS=0x100=ST. • ITLB miss – we copy code to mmap memory and try to execute it with no ITLB mapping. • Eviction – in transaction we perform long sequence of LOAD instruction which cannot all reside in L1 cache – transaction will fail -> CPS=0x80=LD and CPS=0x40=SIZ. • CPS=0x80=LD -> transaction removed transitionally marked cache line from L1. • CPS=0x40=SIZ -> to many instruction were deferred due to cache miss. • CPS=0x001=EXOG -> context switch after failure and before CPS was read by thread.
CPS Register – CPS Tests and Indications - cont’ • CPS Tests cont’: • Cache set test - perform load to five different addresses that map to same 4-way L1 cache set ->CPS= 0x80=LD and CPS=0x002=COH sometimes. • why 0x002 (coherence) – if we run, read-only single threaded? – OS Idle loop evict from L2 cache a transitionally marked line from L1 cache. • Overflow – perform store to 33 different cache lines – we know the “store queue” has 32 entries- > CPS= 0x100=ST (if there are no TLB mapping) And CPS= 0x140=ST|SIZ (if we “warm” the TLB first, by dummy CAS – from zero to zero - to memory location accessed by transaction). • Coherence - perform store to 16 different cache lines – transaction won’t fail due to overflow – but all threads store to same location – cause transaction to conflict. • No back off before retrying – success rate is low –> CPS=0x2=COH.
Some Conclusions (CPS Tests): • It is still challenging in some cases to determine transaction failure reason – ST bit for example • ST bit is set because either address for store is unav • ailable due to heavy LOAD miss or Micro-TLB miss. • In first case – retry will help. • In second – since Micro-TLB mapping is derived from TLB – if TLB is “empty” – retry won’t do. • Good transaction Strategy for ST bit failure – retry several times, and then retry again after TLB “warmup”. • Unreasonable CPS values – CPS indicates failure values that could not happen. • UCTI bit – transaction miss peculate, by executing a branch that has been miss predicted before the load on which the branch depends on, is resolved. • UCTI bit – indicates a branch was executed when the load on which it was depend was not yet resolved. • UCTU bit- transaction should retry – may load is resolved – and branch is correct (correct code is executed).
Simple Static Transactions • Implementing a counter – by CAS and by HTM • With/without back off (in HTM back off is due to COH bit) • Result – > as expected -> more threads (centralized data) -> more contention -> degradation in throughput. • Result -> HTM without back off was sever -> Rock Conflict Resolution is : “Requester wins” – request for transitionally marked cache lines are honored immediately. • Implementing a DCAS – described in the ATMTP article.
Hash Table • STM – SkySTM/TL2 • HyTM – Rock HTM + SkySTM/TL2 • PhTM – Rock HTM + SkySTM/TL2 • One-lock – SW only -> no TM • Phase TM - all transactions in the system execute in the same mode - one failing hardware Transaction causes all transactions in the system to execute in software. • Hybrid TM - transactions using different modes can execute concurrently. • All decision about retry, back off, switch to STM -> are made by the library (programmer transparent).
(Hash Table (2^17 buckets • 50% inserts, 50% deletes key Range -256 • Hash table pre-populated about half of the keys • 1000000 operation each
50% inserts, 50% deletes key Range -128000 • Hash table pre-populated about half of the keys • 1000000 operation each
Hash Table -conclusion • High HW success rates for both HyTM, PhTM -> outperform all SW-only methods. • Both Key Ranges – perform ~similar. • In second scenario (128000 key range) – HW Quantitative benefits over SW-only is less. • In second scenario – Key Range is large enough that the active part of the Hash Table does not fit the L1 cache -> all methods suffer cache misses. • In second scenario – more than half of the HW transactions are retries – even in single thread. • In first scenario – only 0.02% HW transactions retry in single thread and 16% with 16 threads. • CPS Register – in first scenario (256 key range) -> COH – more contention in smaller range. • CPS Register – in second scenario (128000 key range) -> ST – more cash misses. • Possible that: “fail(to SW)-retry(in HW)” path interfere with subsequent retry attempts – hard to diagnose – because adding code will change the behavior.
Red Black Tree • More challenging than Hash Tables • Longer Transactions – sometimes the tree rotates to get balance – more stores! • More Data Dependencies. • miss predicted branches are more likely when traversing the tree. • Tree is half populated in the specified key range. • Each Thread perform 1000000 operations according to operation distribution. • Scenario (a) – 128 keys and 100% lookups. Scenario (b) – 2048 keys, 96% lookups, 2%inserts and 2% deletes. • Measured total operations completed per second.
Red Black Tree - Results • Scenario (a) yields excellent results – similar to Hash table results. • In some cases TL2-STM outperform the PhTM – PhTM should benefit both HW and SW – the challenge is in deciding when to switch. • NEW APPROACH was added: • in PhTM a call back function was added when SW transaction try to commit. • Only the Switching thread will be in SW phase – ability to examine SW transaction executed by a thread just failed HW transaction. • Data Collected: 1) operation name (insert, delete, get). 2)read set size. 3) write set size (number of cache lines/words). 4) max number of cache lines mapped to single L1 cache. 5) number of words in the write set mapped to the store queue. 6) stack writes. • With Much Larger tree – even in single threaded and 100% lookup -> HW transaction fails. • Transaction read more locations with deeper tree -> more chance of L1 cache miss. • Problem was Nooverflowing the L1 cache set (they were 2 loads hit usually). • Problem was No exceeding the “Store Queue”. • Problem was: too many instructions were deferred due to cache misses -> increasing the number of HW retries reduce transaction failure(it brings data to cache –> no need to defer). • The retries prevents to achieve better performance than using SW transactions -> no chance to get better performance with PhTM. • CPS = ST most of the time -> conclusion with the above – failures due to stores encounter a MicroTLB misses.
TLE • TLE – use HW transaction to perform Lock’s critical section (without acquiring the lock)- CRITICAL SECTION EXECUTE IN PARALLEL (if no conflict). • Scenario (a) – STL Vector(c++): • array of elements. • constant-time access to any element. • We start with a vector of 100 elements - each representing a counter initialized to 20. • thread performs 10,000 iterations, each of which accesses a random counter. • performing an increment (20%), decrement (20%) or read (60%) on the counter. • decrement a counter to 0 also delete the counter. • increment a counter to 40 also “split” the counter. • 20 Transactional Retries before acquiring the lock.
TLE – Scenario (a) – STL Vector (C++) –Results: • Number of retries was increased from 4 to 20 –> because with many threads there are a lot cache misses-> transactions fail –> lock was used. • Results were excellent – very scalable with TLE. • Scenario (b) – TLE in JAVA: • Elide locks introduced by “synchronized” keyword : • TLE aware JIT compiler. • JVM make use of CPS register to guide decisions whether to retry or back off and then retry or give-up and acquire the lock. • Hashtable – Synchronize. Hashmap – un synchronize – can be made thread safe with wrapper.
TLE – Scenario (b) – Hashtable (JAVA) - Results • 2-6-2 indicates 20%puts, 60%gets, 20% removes. • With 100% gets – successful – scalability with number of threads. • With more puts and gets -> more transaction fails -> more acquiring the LOCK -> performance diminishes. • TLE outperform the LOCK in any deviation. TLE – Scenario (b) – Hashmap (JAVA) - Results • In some cases -There was a degradation compare to Hashtable -> almost comparable to original lock. • When there was NO degradation – compiler put the “synchronized collection wrapper” INLINE with the PUT function-call -> when the JVM converted the synchronized methods to transactions the code to be executed was in the same function. • When there WAS degradation - “synchronized collection wrapper” was inlined with worker loop body – and then the PUT function was called - > this failed the transactions because now the transaction contained a Function call. • TLE MUST BE APPLIED SELECTIVLY!
Minimum Spanning Forest algorithm • Algorithm that uses Transactions to build MSF given a Graph. • Using STM only -> good scalability, but too much overhead -> Parallelization not profitable. • Algorithm Overview - general: • Each Thread picks a starting vertex v’. • Grow MST from v’ using Prims’ algorithm. • Maintain HEAP – containing all edges connecting current MST with other nodes. • When MST of two threads meet over vertex -> MST and Heaps are merged – one will start over with new vertex v’’.
Minimum Spanning Forest algorithm • The Algorithm Main Transaction: • Extract minimum weighted edge from T’s heap – examine v’ connected to it: • Case 1: v’ does not belong to any MST • a) add v’ to T’s MST. • b) remove T’s heap from public space – for edge addition. • Case 2: if v’ belongs to T’s MST do nothing. • If v’ belong to another thread (T2) MST: • Case 3: if T2’s heap in public space – steal it and merge it with T’s heap. • Case 4: otherwise – move T’s heap to public Queue of T2 – so T2 later update the heaps. • The Above transaction is Failure prone –1) heap is like the RB-Tree- traversing dynamic data that confound branch prediction. 2) Extract-MIN operation is too big. • The Authors Variant algorithm – in case 1 and 3 – since T’s heap is removed from public space – Extract-MIN could be done OUTSIDE the transaction- right after transaction commits (and heap is private).
Minimum Spanning Forest algorithm • The Variant alternative: • Examine Min edge in the heap inside the transaction. • Based on conflict resolution of the vertex v’: • In case 2,4 – extract it transitionally. • In case 1,3 – extract it non-transitionally. • 7 Versions of the MSF were evaluated: • Msf-seq: a sequential version of the original variant, run single threaded- no protection for atomic blocks. • Msf-{orig/opt}-sky :original/new variants of the benchmark. atomic blocks executed as software transactions, using SkySTM library. • Msf-{orig/opt}-le: the original/new variants of the benchmark. atomic blocks executed using TLE (8 HW transaction retries). • Msf-{orig/opt}-lock:original/new variants of the benchmark. atomic blocks executed using a single lock.
Minimum Spanning Forest algorithm- Results • Single threaded with STM is slower than sequential (single threaded) version. • Fraction of transactions that ended up acquiring the lock in single-threaded: • 0.04% in msf-opt-le! • 33% in msf-orig-le. • All STM/TLE are scalable with threads number. • Transaction is still too big – not scalable with SINGLE LOCK solution.
We can exploit HTM to enhance performance of STM. HTM-aware compilers may be able to make code more amenable to succeeding in HW transaction. We need richer feedback in future HTM feature > >