1 / 38

Early Experience with Commercial HTM Implementation: Challenges and Solutions

Explore the development of transactional memory technology, challenges faced, implementations with Rock processor, CPS tests, and conclusions on transaction failures and strategies.

bwitt
Download Presentation

Early Experience with Commercial HTM Implementation: Challenges and Solutions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Early Experience with a Commercial HTM Implementation Dave Dice Yossi Lev Mark Moir and Dan Nussbaum

  2. background • Why TM?- Developers can no longer rely on next year’s processor to run Faster. • Application must be able to exploit more cores. • Traditional approach -> Locks -> Developer responsibility, bottleneck, error prone. • TM -> System responsibility, Scalable. • TM essence – ensure multiple memory accesses are done “atomically”.

  3. Bounded HTM – imposes unreasonable constraints – ensure that transaction does not access more than fixed architecture specific number of cache lines. • Can be useful by combining them with software that exploit HTM, but does not depend on any particular hardware transaction succeeding -> BEST EFFORT. • Best Effort – can commit much larger transactions. • Best Effort – processor can response to difficult events by aborting.

  4. : Two pre production revisions that included HTM- R1, R2(after feedback on R1). Exceptions.

  5. Goal: test that HTM feature worked as expected. Challenge – R1 provided identical feedback about transaction failure that required different response – refined feedback in R2.

  6. Transactional Hash table >

  7. General • Rock - > is SPARC processor using aggressive speculation to provide high single thread performance. • example – on LOAD miss it runs ahead speculatively in order to issue subsequent memory request fast. • Speculation is enabled by check points in the processor state, with Hw reverting back to re-execute. • The check point mechanist is used for HTM feature – two instructions – CHKPT and COMMIT - all code between is atomic. • CHKPT is pc-relative fail address in case of transaction failure. • CPS Register – give feedback about the cause of the abort.

  8. General cont’ • Rock has 16 cores. • Each core is capable of executing 2 threads – SE mode – 32 threads. • SSE Mode – 16 SW Threads – combine dedicate resources of two HW threads – the store buffer of the two HW threads is combined to a larger one. • Rock has deferred queue – speculatively instructions that depends on LOADs that miss are held pending for cache fill – overflow cause transaction failure. • WE USE SSE MODE IN ALL TESTs IN ARTICLE.

  9. CPS Register – CPS Tests and Indications • CPS tests – verify circumstances why Transactions abort. • CPS register – failing transaction can set multiple bits (see table in previous slide). • CPS Tests: • Save-restore – transaction fails when execute RESTORE instruction immediately after SAVE instruction(this pattern is common in function calls) –> CPS = 0x8= INST. • TLB misses- to test DTLB misses – we re-mmap the memory accessed by the transaction before executing it . • load from address with no TLB mapping -> CPS=0x90=LD-PREC. • Store from address with no TLB mapping -> CPS=0x100=ST. • ITLB miss – we copy code to mmap memory and try to execute it with no ITLB mapping. • Eviction – in transaction we perform long sequence of LOAD instruction which cannot all reside in L1 cache – transaction will fail -> CPS=0x80=LD and CPS=0x40=SIZ. • CPS=0x80=LD -> transaction removed transitionally marked cache line from L1. • CPS=0x40=SIZ -> to many instruction were deferred due to cache miss. • CPS=0x001=EXOG -> context switch after failure and before CPS was read by thread.

  10. CPS Register – CPS Tests and Indications - cont’ • CPS Tests cont’: • Cache set test - perform load to five different addresses that map to same 4-way L1 cache set ->CPS= 0x80=LD and CPS=0x002=COH sometimes. • why 0x002 (coherence) – if we run, read-only single threaded? – OS Idle loop evict from L2 cache a transitionally marked line from L1 cache. • Overflow – perform store to 33 different cache lines – we know the “store queue” has 32 entries- > CPS= 0x100=ST (if there are no TLB mapping) And CPS= 0x140=ST|SIZ (if we “warm” the TLB first, by dummy CAS – from zero to zero - to memory location accessed by transaction). • Coherence - perform store to 16 different cache lines – transaction won’t fail due to overflow – but all threads store to same location – cause transaction to conflict. • No back off before retrying – success rate is low –> CPS=0x2=COH.

  11. Some Conclusions (CPS Tests): • It is still challenging in some cases to determine transaction failure reason – ST bit for example • ST bit is set because either address for store is unav • ailable due to heavy LOAD miss or Micro-TLB miss. • In first case – retry will help. • In second – since Micro-TLB mapping is derived from TLB – if TLB is “empty” – retry won’t do. • Good transaction Strategy for ST bit failure – retry several times, and then retry again after TLB “warmup”. • Unreasonable CPS values – CPS indicates failure values that could not happen. • UCTI bit – transaction miss peculate, by executing a branch that has been miss predicted before the load on which the branch depends on, is resolved. • UCTI bit – indicates a branch was executed when the load on which it was depend was not yet resolved. • UCTU bit- transaction should retry – may load is resolved – and branch is correct (correct code is executed).

  12. Transactional Hash table >

  13. Simple Static Transactions • Implementing a counter – by CAS and by HTM • With/without back off (in HTM back off is due to COH bit) • Result – > as expected -> more threads (centralized data) -> more contention -> degradation in throughput. • Result -> HTM without back off was sever -> Rock Conflict Resolution is : “Requester wins” – request for transitionally marked cache lines are honored immediately. • Implementing a DCAS – described in the ATMTP article.

  14. Hash Table • STM – SkySTM/TL2 • HyTM – Rock HTM + SkySTM/TL2 • PhTM – Rock HTM + SkySTM/TL2 • One-lock – SW only -> no TM • Phase TM - all transactions in the system execute in the same mode - one failing hardware Transaction causes all transactions in the system to execute in software. • Hybrid TM - transactions using different modes can execute concurrently. • All decision about retry, back off, switch to STM -> are made by the library (programmer transparent).

  15. (Hash Table (2^17 buckets • 50% inserts, 50% deletes key Range -256 • Hash table pre-populated about half of the keys • 1000000 operation each

  16. 50% inserts, 50% deletes key Range -128000 • Hash table pre-populated about half of the keys • 1000000 operation each

  17. Hash Table -conclusion • High HW success rates for both HyTM, PhTM -> outperform all SW-only methods. • Both Key Ranges – perform ~similar. • In second scenario (128000 key range) – HW Quantitative benefits over SW-only is less. • In second scenario – Key Range is large enough that the active part of the Hash Table does not fit the L1 cache -> all methods suffer cache misses. • In second scenario – more than half of the HW transactions are retries – even in single thread. • In first scenario – only 0.02% HW transactions retry in single thread and 16% with 16 threads. • CPS Register – in first scenario (256 key range) -> COH – more contention in smaller range. • CPS Register – in second scenario (128000 key range) -> ST – more cash misses. • Possible that: “fail(to SW)-retry(in HW)” path interfere with subsequent retry attempts – hard to diagnose – because adding code will change the behavior.

  18. Red Black Tree • More challenging than Hash Tables • Longer Transactions – sometimes the tree rotates to get balance – more stores! • More Data Dependencies. • miss predicted branches are more likely when traversing the tree. • Tree is half populated in the specified key range. • Each Thread perform 1000000 operations according to operation distribution. • Scenario (a) – 128 keys and 100% lookups. Scenario (b) – 2048 keys, 96% lookups, 2%inserts and 2% deletes. • Measured total operations completed per second.

  19. Red Black Tree – key range [0,128]

  20. Red Black Tree – key range [0,2048]

  21. Red Black Tree - Results • Scenario (a) yields excellent results – similar to Hash table results. • In some cases TL2-STM outperform the PhTM – PhTM should benefit both HW and SW – the challenge is in deciding when to switch. • NEW APPROACH was added: • in PhTM a call back function was added when SW transaction try to commit. • Only the Switching thread will be in SW phase – ability to examine SW transaction executed by a thread just failed HW transaction. • Data Collected: 1) operation name (insert, delete, get). 2)read set size. 3) write set size (number of cache lines/words). 4) max number of cache lines mapped to single L1 cache. 5) number of words in the write set mapped to the store queue. 6) stack writes. • With Much Larger tree – even in single threaded and 100% lookup -> HW transaction fails. • Transaction read more locations with deeper tree -> more chance of L1 cache miss. • Problem was Nooverflowing the L1 cache set (they were 2 loads hit usually). • Problem was No exceeding the “Store Queue”. • Problem was: too many instructions were deferred due to cache misses -> increasing the number of HW retries reduce transaction failure(it brings data to cache –> no need to defer). • The retries prevents to achieve better performance than using SW transactions -> no chance to get better performance with PhTM. • CPS = ST most of the time -> conclusion with the above – failures due to stores encounter a MicroTLB misses.

  22. TLE • TLE – use HW transaction to perform Lock’s critical section (without acquiring the lock)- CRITICAL SECTION EXECUTE IN PARALLEL (if no conflict). • Scenario (a) – STL Vector(c++): • array of elements. • constant-time access to any element. • We start with a vector of 100 elements - each representing a counter initialized to 20. • thread performs 10,000 iterations, each of which accesses a random counter. • performing an increment (20%), decrement (20%) or read (60%) on the counter. • decrement a counter to 0 also delete the counter. • increment a counter to 40 also “split” the counter. • 20 Transactional Retries before acquiring the lock.

  23. TLE – Scenario (a) – STL Vector (C++)

  24. TLE – Scenario (a) – STL Vector (C++) –Results: • Number of retries was increased from 4 to 20 –> because with many threads there are a lot cache misses-> transactions fail –> lock was used. • Results were excellent – very scalable with TLE. • Scenario (b) – TLE in JAVA: • Elide locks introduced by “synchronized” keyword : • TLE aware JIT compiler. • JVM make use of CPS register to guide decisions whether to retry or back off and then retry or give-up and acquire the lock. • Hashtable – Synchronize. Hashmap – un synchronize – can be made thread safe with wrapper.

  25. TLE – Scenario (b) – Hashtable (JAVA)

  26. TLE – Scenario (b) – Hashtable (JAVA) - Results • 2-6-2 indicates 20%puts, 60%gets, 20% removes. • With 100% gets – successful – scalability with number of threads. • With more puts and gets -> more transaction fails -> more acquiring the LOCK -> performance diminishes. • TLE outperform the LOCK in any deviation. TLE – Scenario (b) – Hashmap (JAVA) - Results • In some cases -There was a degradation compare to Hashtable -> almost comparable to original lock. • When there was NO degradation – compiler put the “synchronized collection wrapper” INLINE with the PUT function-call -> when the JVM converted the synchronized methods to transactions the code to be executed was in the same function. • When there WAS degradation - “synchronized collection wrapper” was inlined with worker loop body – and then the PUT function was called - > this failed the transactions because now the transaction contained a Function call. • TLE MUST BE APPLIED SELECTIVLY!

  27. Minimum Spanning Forest algorithm • Algorithm that uses Transactions to build MSF given a Graph. • Using STM only -> good scalability, but too much overhead -> Parallelization not profitable. • Algorithm Overview - general: • Each Thread picks a starting vertex v’. • Grow MST from v’ using Prims’ algorithm. • Maintain HEAP – containing all edges connecting current MST with other nodes. • When MST of two threads meet over vertex -> MST and Heaps are merged – one will start over with new vertex v’’.

  28. Minimum Spanning Forest algorithm • The Algorithm Main Transaction: • Extract minimum weighted edge from T’s heap – examine v’ connected to it: • Case 1: v’ does not belong to any MST • a) add v’ to T’s MST. • b) remove T’s heap from public space – for edge addition. • Case 2: if v’ belongs to T’s MST do nothing. • If v’ belong to another thread (T2) MST: • Case 3: if T2’s heap in public space – steal it and merge it with T’s heap. • Case 4: otherwise – move T’s heap to public Queue of T2 – so T2 later update the heaps. • The Above transaction is Failure prone –1) heap is like the RB-Tree- traversing dynamic data that confound branch prediction. 2) Extract-MIN operation is too big. • The Authors Variant algorithm – in case 1 and 3 – since T’s heap is removed from public space – Extract-MIN could be done OUTSIDE the transaction- right after transaction commits (and heap is private).

  29. Minimum Spanning Forest algorithm • The Variant alternative: • Examine Min edge in the heap inside the transaction. • Based on conflict resolution of the vertex v’: • In case 2,4 – extract it transitionally. • In case 1,3 – extract it non-transitionally. • 7 Versions of the MSF were evaluated: • Msf-seq: a sequential version of the original variant, run single threaded- no protection for atomic blocks. • Msf-{orig/opt}-sky :original/new variants of the benchmark. atomic blocks executed as software transactions, using SkySTM library. • Msf-{orig/opt}-le: the original/new variants of the benchmark. atomic blocks executed using TLE (8 HW transaction retries). • Msf-{orig/opt}-lock:original/new variants of the benchmark. atomic blocks executed using a single lock.

  30. Minimum Spanning Forest algorithm- Results

  31. Minimum Spanning Forest algorithm- Results • Single threaded with STM is slower than sequential (single threaded) version. • Fraction of transactions that ended up acquiring the lock in single-threaded: • 0.04% in msf-opt-le! • 33% in msf-orig-le. • All STM/TLE are scalable with threads number. • Transaction is still too big – not scalable with SINGLE LOCK solution.

  32. We can exploit HTM to enhance performance of STM. HTM-aware compilers may be able to make code more amenable to succeeding in HW transaction. We need richer feedback in future HTM feature > >

More Related