310 likes | 418 Views
LogTM: Log-based Transactional Memory. Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo. Motivation. Previous TM systems abort fast, commit slow Old values “in place” New values somewhere else Commit is the common case!
E N D
LogTM: Log-based Transactional Memory Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo
Motivation • Previous TM systems abort fast, commit slow • Old values “in place” • New values somewhere else • Commit is the common case! • Remember Amdahl’s Law • Conflicts usually solved by hardware • Fast but myopic • Trapping to SW if needed for careful resolution
LogTM • Eager version management • Puts new values in place for faster commits • No data moves even on cache overflow • Eager conflict detection • Detects offending ld/st immediately • Fast conflict detection on evicted blocks • Fast commit by lazy reset of directory state • Handle aborts by SW • Aborts are much less common than commits
Eager Version Management • Per-thread log in cacheable virtual memory • On st. logs address and previous contents of block • Write bit • Tracks if a block has been stored and logged • Faster commits • Clear W bits and reset log (pointer) • Slower aborts • Also has to write old values back
Virtual Address R W Data Block 0 0 00 0 0 40 0 0 c0 1000 1040 1080 LogBase 1000 LogPtr 1 1000 LogPtr
Virtual Address R W Data Block 1 0 00 0 0 40 0 0 c0 1000 1040 1080 LogBase 1000 LogPtr 1 1000 LogPtr
Virtual Address R W Data Block 1 0 00 0 0 40 0 1 c0 1000 1040 1080 LogBase 1000 LogPtr 1 1048 LogPtr
Virtual Address R W Data Block 1 0 00 1 1 40 0 1 c0 1000 1040 1080 LogBase 1000 LogPtr 1 1090 LogPtr
Virtual Address R W Data Block 0 0 00 0 0 40 0 0 c0 1000 1040 1080 LogBase 1000 LogPtr 0 1000 LogPtr
Virtual Address R W Data Block 0 0 00 0 0 40 0 0 c0 1000 1040 1080 LogBase 1000 LogPtr 0 1000 LogPtr
Conflict detection • Coherence requests sent to directory • Directory will forward to other processor(s) • Processors will detect conflict • Using local state • Ack/Nack as response • Requester resolves any conflict • Adds read bit to each cache block • Extends MOESI protocol • “Sticky” states
Conflict detection • Works even after cache overflow • Forward to conflicting requests to “interested” processors • Adds a per processor overflow bit • The transactional block can be updated • Requests will still be redirected to the processor • Processor can Nack on conflict
Replacement behavior • Depends on MOESI state • M: Replace with transactional writeback • Sets state as “Sticky@Processor” • Requests are forwarded to the processor • S: Silently replaced, • Adds processor to sharer list • Requests forwarded to all sharers • O: Write back to directory • Add itself to sharer list, same as S if requested exclusively • E: Same as O
Directory Idle [old] P I (--) [none] TMcount: 1 Overflow: 0
Directory M@P [old] GETX DATA ACK P M (R W) [new] TMcount: 1 Overflow: 0
Directory M@P [old] GETS Fwd_GETS NACK P Q M (R W) [new] I (- -) [ ] NACK TMcount: 1 Overflow: 0 TMcount: 1 Overflow: 0
Directory M@P[new] PUTX NACK WB_XACT P I (- -) [ ] TMcount: 1 Overflow: 1
Directory M@P[new] Fwd_GETS NACK GETS NACK P Q I (- -) [ ] I (- -) [ ] TMcount: 1 Overflow: 1 TMcount: 1 Overflow: 0
Directory E@Q[new] DATA Fwd_GETS CLEAN GETS ACK P Q I (- -) [ ] E (R -) [new] TMcount: 0 Overflow: 0 TMcount: 1 Overflow: 0
Conflict detection • Lazy clean up better if overflow is rare • Can be improved otherwise (i.e. use Bloom filters) • Ambiguities handled conservatively • Refetch during same against earlier transaction • Set R&W bits • Log old values
Conflict Resolution • When two transactions conflict • At least one must stall or abort • Quick myopic decision by HW • Slow and careful by SW • Hybrid approach: • HW seeks fast solution, traps to software if problem persists
Conflict resolution • Distributed timestamp • Trap to conflict handler (SW) • Transaction could cause deadlock • Logically later than transaction in conflict • Per processor possible cycle flag • Conflict if nack received from a logically earlier transaction with possible cycle flag set
Evaluation • Target System • SPARC Solaris 32 Processors 1Ghz • L1: 16KB 4-way split, 1 cycle latency • L2: 4 MB 4-way unified, 12-cycle latency • Memory: 4GB 80-cycle latency • Directory: Full-bit vector sharer list, migratory sharing optimization, directory cache, 6-cycle latency • Interconnection: Hierarchical switch topology, 14-cycle link latency • Simulated using Simics • LogTM interface added by “magic” instructions
Microbenchmark • Shared counter micro-benchmark • Compared to • Exponential Backoff • MCS locks • LogTM outperforms them • LogTM does not abort transactions
SPLASH • Evaluated using a subset of SPLASH-2 • Used two versions of raytrace (with/without false sharing) • False sharing has significant impact! • Performance gains from moderate to large
Benchmark Analysis • LogTM must read a block before writing it to the log • Benchmarks showed that data is usually read anyway • LogTM is more sensitive to false sharing than lock approaches • Since the log is required to be valid only until an abort • A k-block log write buffer reduces most writes as shown in the benchmarks.
Related Work • TCC • Lazy version management (slow commits) • Lazy conflict detection (detect on commit) • LTM • On overflow stores new values in uncacheable in-memory hash table • LogTM allows both old and new versions cached
Related Work • UTM • Logs blocks targeted by both loads and stores • More complete conflict detection • Must walk log on certain coherence requests • VTM • Per address space virtual mode for cache evictions, paging, context switches • Virtualized VTM uses micro-code for conflict detection. (LogTM uses MOESI extension)
Conclusion • Presents a TM implementation designed to speed up the common case • Efficiently handles cache evictions • Requires simple architectural changes • Registers, state, directory extension • Work towards hybrid conflict detection • No paging or context switch support • Very sensitive to false sharing