220 likes | 370 Views
Transactional Memory Architectural Support for a Lock-Free Data Structure. By: Major Bhadauria. Some material borrowed from : Konrad Lai, Microprocessor Technology Labs, Intel Intel Multicore University Research Conference Dec 8, 2005. Motivation.
E N D
Transactional MemoryArchitectural Support for a Lock-Free Data Structure By: Major Bhadauria Some material borrowed from : Konrad Lai, Microprocessor Technology Labs, Intel Intel Multicore University Research Conference Dec 8, 2005
Motivation • Multiple cores face a serious programmability problem • Writing correct parallel programs is very difficult • Transactional Memory addresses key part of the problem • Makes parallel programming easier by simplifying coordination • Requires hardware support for performance • Implements lock-free data structures
Benefits of Going Lock-Free ALLOWS US TO AVOID: • Priority inversion: lower-priority process is preempted while holding lock needed by a higher-priority process • Convoying: A process holding a lock is de-scheduled, possibly stopping others programs from progressing needlessly. • Deadlock: Common problem of process A and B both needing to lock C and D. • Process B has lock for D. • Process A has lock for C • DEADLOCK
Problem: Lock-Based Synchronization • Software engineering problems • Lock-based programs do not compose • Performance and correctness tightly coupled • Timing dependent errors are difficult to find and debug • Performance problems • High performance requires finer grain locking • More and more locks add more overhead Lock-based synchronization of shared data access Is fundamentally problematic Need a better concurrency model for multi-core software
Thread 1 Thread 2 begin_xaction A = A – 20 B = B + 20 A = A – B C = C + 20 end_xaction begin_xaction C = C - 30 A = A + 30 end_xaction What is Transactional Memory (TM)? Transactional Memory (TM) allows arbitrary multiple memory locations to be updated atomically and serially. Basic mechanisms: • Isolation: Track read and writes, detect when conflicts occur • Version management: Record new/old values • Atomicity: Commit new values or abort back to old values Thread 1’s accesses and updates to A, B, C are atomic Thread 2 sees either “all” or “none” of Thread 1’s updates
Transactional Memory benefits • Focus: Multithreaded programmability crisis • Programmability & performance • Allows conservative synchronization • Programmer focuses on parallelism & correctness, HW extracts performance • Software engineering and composability • Allows library re-use and composition (locks make this very difficult) • Critical for wide multi-core demand • Makes high performance MT programming easier • Captures a fundamental, well-known, intuitive “atomic” construct • Been around for decades • Similar to a “critical section” but without its problems • No deadlocks, priority inversion, data races, unnecessary serialization
Software Transactional Memory • Software Transactional Memory (1995 until now) • Significant work from Sun, Brown, Cambridge, Microsoft • Serious performance limitations • Degrades “common” case of no conflicts/contention • >90% of transactions are no conflicts • 90% of critical sections are uncontended: what if all these slowed down by 5X? • Serious deployability limitations • Relies on special runtime support • Invasive to applications and libraries • Is STM is too slow and too invasive to deploy? • Could there be better implementation? • But great to understand complex usage models of the future
Herlihy & Moss’ Implementation • Minimum implementation builds on previous LL/SC structure: • Loading a shared value to read • Writing to a shared value • Commit/Abort Functions to commit or flush new values • Extensions for performance: • Store copy of old value in cache – reduce bus traffic, latency • Store committed data in cache – reduce bus traffic • Have a read value that you can later change. – reduce bus traffic • Have Validate function to check current status – reduce number of orphan transactions • Busy bus signal for transactions - reduce number of aborts
Herlihy & Moss’ Implementation • Transactional Memory primitives: • Load-transactional (LT): reads value of a shared memory location into a private register. • Load-transactional-exclusive (LTX): similar to LT, but indicates that likely to be updated. • Store-transactional (ST): tentatively writes a value to a shared memory location, but not visible until a successful committal. • TM instructions- commit, abort, validate. • Commit makes tentative changes permanent if the transaction’s read set (locations referenced by LT) has not changed, and no other process had read any location within the write set (locations referenced by LTX and ST) • Abort discards updates to write set. • Validate –tests the current transaction status, T-continue, F-Abort
Herlihy & Moss’ Implementation • Hardware Modifications (somewhat) OLD NEW
Core 1 Core 2 Coherence protocol for conflicts Hardware Support for Performance begin_xaction A.withdraw(20) B.deposit(20) end_xaction • Record recovery state • Buffer updates/track accesses • Commit if no external access (discard all updates if conflict) 1 A.sum = 100 A.sum = 80 begin_xaction Sum = A.sum + B.sum end_xaction 1 B.sum = 220 B.sum = 200 Architectural Memory state 1 A.sum = 100 A.sum = 80 A.sum = 100 1 B.sum = 220 B.sum = 200 1 B.sum = 200 1 Abort & restart Core 2 sees $300 – never $280 or $320
Herlihy & Moss’ Benchmarks • Compared against: • Test-and-test-and-set (TTS) w/ exponential backoff • Software queuing (QOSB mechanism) • LOAD_LINKED/STORE_COND (LL/SC)
Herlihy & Moss’ Implementation • Limitations • Transaction size implementation-dependent • Must finish in a single scheduling quantum • Locations accessed cannot exceed architecturally specified limit • Short duration, small data sets • Requires cache coherence model that’s sequentially consistent (otherwise may need fences) • Nested transactions problematic • Hence Programmer Must Be Aware of Hardware
What if HW is not sufficient? • This is the deployability challenge • The missing piece of the puzzle for all prior work… • Resource limitations are fundamental • Space: caches • More HW delays the inevitable: will always be an n+1 case • Time: scheduling quanta • Programmers have no control over time • Affects functionality, not just performance • Some transactions may never complete • Making HW limit explicit is difficult • Limited usage only • Unreasonable for high level languages • How do you architect it in an evolvable manner?
Virtualizing Transactional Memory • Hides hardware limitations like virtual memory hides physical memory limitations • 2 modes • 1 for common case which is built into HW • 2nd for buffer overflow, page faults, context switches or thread migration which is built using SW and HW data structures • Virtualization allows transactions to: be suspended, migrate, or overflow state from local buffers, and allows nested transactions • Extra challenge: Unlike normal TM, can’t only use cache coherence mechanisms, since need to detect conflicts b/w active transactions and transactions whose state partially or completely overflowed to virtual memory
Core 1 1 1 1 Virtualize for Completeness Timer interrupts, Context switches, Exceptions,… virtual address space Virtual TM Log/buffer space Limited buffers Out-of-band concurrency control • Overflow management • Using virtual memory • Software libs. and microcode • Programmer transparent • Performance isolation • Suspendable/swappable
Recent TM Research • Recently, focus on solving the harder problem of TM • Making the model immune to cache buffer size limitations, scheduling limitations, etc. • TCC (Stanford) (2004) • Same limitation as Herlihy/Moss for TM (size limited to local caches) • LTM (MIT), VTM (Intel), LogTM (Wisconsin) (2005) • Assume hardware TM support • Add support to allow transactions to be immune to resource limitations • Goals of each similar, approaches very different • LTM: only resource overflow • VTM: complete virtualization • LogTM: only resource overflow, AND optimize for COMMITs
Some Research Challenges • Large transactions • Language extensions • IO, loophole, escape hatches, … • Interaction and co-existence with • Other synchronization schemes: locks, flags, … • Other transactions • Database transaction • System transaction (Microsoft) • Other libraries, system software, operating system, … • Performance monitor, tuning, debugging, … • Open vs Closed Nesting • Interaction between transaction & non-transaction • Usage & Workload • PLDI workshop
TM – First Decade • IBM 801 Database Storage (1980s) • Lock attribute bits on virtual memory (via TLBs, PTEs) at 128 byte granularity • Load-Linked/Store Conditional (Jensen et al. 1987) • Optimistic update of single cache line (Alpha, MIPS, PowerPC) • Transactional Memory (Herlihy&Moss 1993) • Coined term; TM generalization of LL/SC • Instructions explicitly identify transactional loads and stores • Used dedicated transaction cache • Size of transactions limited to transaction cache • Oklahoma Update (Stone et al./IBM 1993) • Similar to TM, concurrently proposed • Didn’t use cache but dedicated monitored registers to operate upon
References • “Transactional Memory: Architectural Support for Lock-Free Data Structures”, Moss et. al, ISCA 1993 • “Virtualizing Transactional Memory” K. Lai et. al, ISCA 2005 • “LogTM: Log-based Transactional Memory”, Wood et. al, HPCA-12