1 / 24

EazyHTM : Eager-Lazy Hardware Transactional Memory

EazyHTM : Eager-Lazy Hardware Transactional Memory. Saša Tomić , Cristian Perfumo , Chinmay Kulkarni , Adrià Armejach , Adri á n Cristal, Osman Unsal , Tim Harris, Mateo Valero. Barcelona Supercomputing Center, UPC BITS Pilani Microsoft Research Cambridge. Why Transactional Memory?.

elan
Download Presentation

EazyHTM : Eager-Lazy Hardware Transactional Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, CristianPerfumo, ChinmayKulkarni, AdriàArmejach, Adrián Cristal, OsmanUnsal, Tim Harris, Mateo Valero Barcelona Supercomputing Center, UPC BITS Pilani Microsoft Research Cambridge

  2. Why Transactional Memory? • Lock-based parallel programming has problems • Deadlocks, races, complexity, performance, … • Transactional Memory (TM) to the rescue • Optimistic concurrency control mechanism • Easy to use • Deadlock free • Supports composability • Protects data in critical sections • Hardware-TM (HTM), Software-TM (STM) and hybrid

  3. HTM terminology • Atomic section/transaction: group of instructions that appear to take effect instantaneously • Where are speculative values stored (version management): • in-place, and log the original value, or • buffered in private storage, publish on commit • Conflict: TX writes where others TX reads • Detection: an action in which we check for conflicts • Resolution: an action performed to resolve the conflict • Can be abort, stalling the execution, …

  4. Eager HTM • A.k.a. pessimistic • Writes in-place, detects&resolves conflicts on every access • LogTM [Moore, HPCA06], LogTM-SE [Yen, HPCA07] TX 3 TX 1 Fast commit TX 2 W Slow abort R Stall Limited concurrency R fast commit

  5. Lazy HTM • A.k.a. optimistic • Writes buffered, detect&resolve conflicts on commit • TCC [Hammond, ISCA04], Scalable-TCC [Chafi, HPCA07] TX 3 TX 1 Fast abort TX 2 W Good concurrency R R Complex commit complex commit: validate + write

  6. The MotivationSplitting conflict management • Eager-Lazy hardware-software TM exists (FlexTM [Shriraman, ISCA08]): • Software begin, commit and abort • Probabilistic (signature based) conflict detection • EazyHTM is the first pure-hardware TM Conflict resolution Good concurrency Eager Lazy Fast commit EazyHTM Eager LogTM Conflict detection Lazy Impossible TCC, S-TCC

  7. Outline • Motivation • Contributions • Hardware changes • The Protocol • Evaluation • Conclusions

  8. EazyHTM Contributions • The best of two worlds • Eager conflict detection: simple commit/exact list of conflicts in advance • Lazy conflict resolution: good concurrency • Parallel commits of non-conflicting TXs • Designed for CMPs (Chip-Multiprocessors) • Use cores proximity • MESI/MOESI protocol upgrade (easier verification)

  9. Hardware changes read-only optimization bit (details in the paper) TD Existing directory logic Directory TD – 1 bit per line core ... core ... core ... holds read/write set SR SM Existing cache logic Private Cache(s) SR – 1 bit per line SM – 1 bit per line • tracks conflicts • bit-vector • 32 bits for 32 cores Register file checkpoint Racers list CPU Racers list – 1 bit per core Killers list – 1 bit per core Killers list

  10. Racers and killers list • If line is shared between two TXs: • Read-Read • No conflict • Write-Read, Read-Write, Write-Write • Writer adds reader TX into “racers” list • “TXs that I have to abort” list, if I commit first • Reader adds writer TX into “killers” list • “TXs that can abort me” list, if they commit first • We illustrate only the Write-after-Read (WAR) conflict

  11. EazyHTM Protocol Conflict Detection (1/2) TX 0 TX 2 BTX BTX RD A WR A CTX CTX no othersharers sharers @A Replaces GETS/GETX Directory ACK @A, 0 2 txMark @A 1 racers racers ... ... killers killers TX 0 TX 2

  12. EazyHTM Protocol Conflict Detection (2/2) 1 other sharer TX 0 TX 2 BTX BTX RD A WR A CTX CTX sharers @A Potential conflict Directory ACK @A, 1 txAccessor#2, @A 3 2 Remember: abort TX#0 on commit txMark@A 1 racers Writer #2, @A racers 5 Remember: TX#2 can abort me killers killers Reader #0, @A 4 TX 0 TX 2

  13. EazyHTM Protocol Conflict Resolution TX 0 TX 2 BTX BTX RD A WR A CTX CTX sharers @A Directory WR @A (commit) 3 racers Abort from TX#2 racers 1 2 killers killers Abort Ack from TX#0 TX 0 TX 2 1 TX#2 first came to the commit point, abort TX#0!

  14. EazyHTM Protocol Disjoint data => parallel commit NO SERIALIZATION TX 0 TX 2 BTX BTX WR A WR B CTX CTX TX 0 TX 2 BTX BTX WR A WR B CTX CTX TX 0 TX 2 BTX BTX WR A WR B CTX CTX sharers @A 0 othersharers 0 othersharers sharers @B Directory ACK @B, 0 ACK @A, 0 2 2 txMark@A WR @A (commit) WR @B (commit) txMark @B 1 1 3 3 racers racers ... ... killers killers TX 0 TX 2 TX#0 works with line @A TX#2 works with line @B

  15. Implementation • Implemented in M5, full-system simulator (Alpha) • Private L1 (32KB, 4-way, 64B CL, 2 cycles) • Private L2 (512KB, 8-way, 64B CL, 10 cycles) • Memory (with directory, 100 cycles) • ICN (2D Mesh, 10 cycles per hop)

  16. Evaluation • Evaluated STAMP benchmarks • Compared with Scalable-TCC-like HTM • Same base simulator • Implemented specialized directory protocol • Compared with ideal lazy HTM (MESI based) • magical conflict detection • instant conflict resolution • parallel write-back commit

  17. Kmeans Low Small TXs (RS 15 CL; WS 5 CL) Low contention(10% aborts) Similar profile to “replacing locks with atomic” Near ideal performance K-means: groups N-dimensional space into K clusters Most of the SPLASH-2 suite has similar profile

  18. SSCA2 Small TXs (RS 50 CL, WS 10 CL) Low contention(1.2% aborts) Near ideal performance Scalability affected by barriers, not by contention SSCA2: large directed graph operations

  19. Yada Large TXs (260 CL RS, 140 CL WS) Moderate contention (35% aborts) We can see good performance also for large TXs! Yada: delaunay mesh refinement

  20. Intruder Medium TXs (53 CL RS, 20 CL WS) High contention (85% aborts) Very bad scalability for all HTMs Every transaction detects conflicts over and over again – lot of conflict detection messages slow down the execution Intruder: signature based network intrusion detection system

  21. Only high-conflict STAMP >50% abort rate only High contention high-core-count should be optimized Averages: Labyrinth Intruder Kmeans-Hi Results highly affected by Intruder

  22. Only low-conflict STAMP <50% abort rate only Low abort rate necessary for scaling Excludes: Labyrinth 8-32 Intruder 16-32 Kmeans-Hi 32

  23. Conclusions • Introduced EazyHTM, a new HTM implementation • Eager conflict detection, lazy conflict resolution • Fast: performs well for low conflict parallel applications • Minimal changes to directory protocols (easier verification) • As scalable as standard directory protocol • EazyHTM mechanism could allow (future work): • Simpler transaction prioritization • Less wasted work • Better performance optimization • Power efficient TM mechanisms

  24. Thank you! Questions? sasa.tomic@bsc.es

More Related