460 likes | 637 Views
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory. Paul E. McKenney , IBM Linux Technology Center Maged M. Michael, IBM TJ Watson Research Jonathan Walpole, Portland State University Presented by Vidhya Priyadharshnee Palaniswamy Gnanam.
E N D
Why The Grass May Not Be Greener On The Other Side:A Comparison of Locking vs. Transactional Memory Paul E. McKenney, IBM Linux Technology Center Maged M. Michael, IBM TJ Watson Research Jonathan Walpole, Portland State University Presented by Vidhya Priyadharshnee Palaniswamy Gnanam
Outline • Concurrency Control Techniques Review • Objective • Locking Critique • TM Critique • Where do Locking and TM fit in? • Conclusion • Recent Work • Future Work
Multicore Computing • With the speed of individual cores no longer increasing at the rate it used to, we started using increased number of CPU cores to increase the speed of our ever-more complicated applications. • To use these extra cores, programs must be parallelized. • Synchronization of shared data access is critical for correctness of these programs.
Lock Based Synchronization • “Traditional” pessimistic synchronization approach • Simple. Partition the shared data and protect each partition with separate a lock • Locks prevent concurrent access and enable sequential reasoning about critical section code. • Reader Writer Locking: Allows multiple readers to gain access concurrently. Improves scalability if used correctly.
Lock Based Synchronization: Downsides Lock based Synchronization open a whole new can of worms, though. • High Contention on non-partitionable data structures. • Coarse Grained locking limits concurrency. Lock Contention. Poorly Scales. • Fine Grained locking is hard. Lock acquisition overhead affects performance. • Introduces dependencies among threads. • Propagation of thread failure • Affects fault tolerance of the system
Non Blocking Synchronization • Lock-free, “optimistic” synchronization. • Execute the critical section unconstrained, and check at the end to see if you were the only one If so, continue. If not roll back and retry • Optimistic synchronization keep threads independent giving different levels of fault tolerant properties like Block Freedom, Wait Freedom and Obstruction Freedom based on implementation.
Non Blocking Synchronization: Downsides • Difficult programming logic • Heavy use of atomic operations like CAS to do combination of verification and finalization (if passes). • Impact of contention can be quite severe. Increased number of retries causes heavy bus contention, cache contention and thus slows down progressive threads. • May not perform as well as a lock-based approach in non preemptible kernel.
Objective • Each technique has both green and dry areas. • The goal of paper is to • Spot green and dry areas of Lock Based Synchronization and Transactional memory (NBS) • Constructively criticize to them to understand where each technique fit
Locking Strengths • Simple and elegant idea. Allow only one CPU to access a given data at a time. • Provides Disjoint access parallelism but with more effort. • Does not require any specialized HW support. Can be used on existing commodity hardware. • Supported in multiple platforms as it is largely used and well-defined standardized locking APIs like POSIX pthread API exists. • Much of the legacy code use locking. • More experienced programmers • Contention effects are concentrated within locking primitives, allowing critical sections to run at full speed.
Locking Strengths • Degradation on performance can be minimized by reducing the power consumption during waiting on lock. • Good for protecting non-idempotent operations such as I/O, thread creation, memory remapping and system rebooting. • Interacts naturally with other synchronization mechanisms, including reference counting, atomic operations, non-blocking synchronization, RCU • Interacts in a natural manner with debuggers
Locking: Problems & Improvements Problem: Lock Contention • Some data structures such as unstructured graphs and trees are difficult to partition. • May have to settle for coarse grained locking which leading to high contention and reduced scalability Solution • Redesign algorithms to use partition-able data structures • Replace trees and graphs with hash tables and radix trees. Problem remains with non-partitionable data structures!
Locking: Problems & Improvements Problem: Lock Overhead • Lock granularity determines scalability. • Can we partition the shared data as much as possible and protect each partition with separate lock? • Locking uses expensive instructions and creates high synchronization overhead. • Locking introduces communication related cache misses into read mostly workloads which would otherwise run entirely within the cpu cache. Solution • While lock overhead cannot be completely overcome, it can be avoided. • In read mostly situations, locked updates may be paired with read-copy-update (RCU) or hazard pointers thus reducing lock overhead in common cases, increasing read side performance and scalability. Problem Remains in Update heavy workloads!
Locking: Problems & Improvements • Performance Vs Scalability • Need right granularity of locks!
Locking: Problems & Improvements Problem: Deadlock • Multiple threads acquire the same set of locks in different order. • Self-deadlock: if interrupt occurs while a lock is held by a thread and the interrupt handler also needs that lock Solution • Require a clear locking hierarchy; multiple locks are acquired in a pre-specified order • If lock not free, thread surrenders conflicting locks and retries • Detect deadlock; break cycle by terminating selected threads based upon priority/ work done. • Track lock acquisition, dynamically detect potential deadlock and prevent before it occurs • To avoid self deadlocks disable interrupts on entering CS/ avoid lock acquisition in handlers
Locking: Problems & Improvements Problem: Priority Inversion • Priority inversion can cause a high-priority thread to miss its real-time scheduling deadline, which is unacceptable in safety-critical systems Solution • Low priority thread holding the lock temporarily inherits priority of high priority blocked thread so that no medium priority thread can preempt it • Lock holder is assigned priority of the highest priority task that might acquire that lock • Preemption is disabled entirely while locks are held
Locking: Problems & Improvements Problem: Convoying • Preemption or blocking (due to I/O, page fault etc.) of the lock holder can block other threads. • Unrealistically increased critical section length. Non-deterministic lock acquisition latency • May lead to starvation of large critical sections. • Problem for real-time workloads. Solution • Use scheduler-conscious synchronization to avoid scheduler to preempt the thread holding a lock. • Use RCU for read side critical sections to avoid Non-deterministic lock acquisition latency in read side. • To avoid starvation use FCFS lock acquisition primitives with limit on number of threads- e.g. Semaphores
Locking: Problems & Improvements Problem: Lack of composability and Modularity • Enabling atomic operations to be composed into larger atomic operations is difficult. • Leads to self deadlock if the inner critical section tries to acquire same lock out critical section is holding Solution • Need to know what locks other modules use before calling/composing them. Abstraction is lost!
Locking: Problems & Improvements Problems: Indefinite blocking • Due to termination of the lock holder. • Creates problems for fault tolerant software. Solution • Abort and restart entire application- Simple, reliable • Identify the terminated lock holder and clean up its state- extremely complex Fault tolerance of the software is still affected!
Composability • In locking, operations may be thread safe individually, but not composed together. • Consider, pop from one stack and push into another. T2 structfoo *push (structfoo_stack *dst) { structfoo *q; lock (dst); get(q); q->next = dst; dst = q; unlock (dst); } T1 structfoo *pop (structfoo_stack *src) { structfoo *q; lock (src); q = src; src = q->next; unlock (src); } Intermediate state (item is in neither stacks) is visible!
TM Approach structfoo *pop_push(structfoo_stack *src, structfoo_stack *dst) { structfoo *q; begin_txn; q = src; src = q->next; q->next = dst; dst = q; end_txn; } Let the TM system take care of the rest!
Transactional Memory • Solution to the problem of consistency in the face of concurrency adopted from the database world - Transactions. • Simple, Composable, Scalable • Atomic Blocks == Transactions Atomicity: All-or-nothing execution of a tx. Isolation: Partial results are invisible to other txs/ threads
Transactional Memory • TM is a non-blocking synchronization mechanism: at least one thread will succeed • Can be constructed to be either as • Optimistic • Speculate concurrency without waiting for permission (acquire no locks on reads/writes) • Performs well when critical regions do not interfere with each other more often. • Pessimistic • "Always ask for permission"- Acquire locks on read/ writes (blocking) used in databases. • Good when conflicts are more
HW Transactional Memory • New instructions (LT, LTX, ST, Abort, Commit, Validate) • Fully-associative transactional cache for buffering updates • Piggy Backing on multi-processor cache coherence protocol to detect transaction conflicts
SW Transactional Memory • Obstruction free • Introduce level of indirection • Log the modifications to memory locations in descriptors. • Based on tx outcome, commit by writing the new values to memory locations atomically or abort by reverting to old values. • Non Obstruction free • Revocable Two Phase Locking for Writes: A transaction locks all objects that it writes and does not release these locks until the transaction terminates. If deadlock occurs then one transaction aborts, releasing its locks and reverting its writes. • Optimistic Concurrency Control for Reads: Whenever a transaction reads from an object, it logs the version it read. When the transaction commits, it verifies that these are still the current versions of the objects.
TM Strengths • Non-blocking: system as a whole makes progress • Familiar to large users in the context of database systems and trivial hardware implementation LL/SC Scalable • Allows multiple, non-interfering threads to concurrently execute in a critical section. Automatic Disjoint access parallelism • Achieved automatically without having to design complex fine grain locking solution. Modular & Composable • Transactions may be nested or composed
TM Strengths Deadlock Free • Avoids common pitfalls of lock composition such as deadlock. Fault tolerance • Failure of one transaction will not affect others Non Partitionable datastructures • Can be used with difficult to partition data structures such as unstructured graphs
TM Problems & Improvements Problem: Portability in Hardware TM • Portability: need special hardware • Size of transaction limited by transaction cache. • Overflow of transaction cache addressed by virtualization in newer implementations. Solution • Use HTM in case of small txs, but fall back to STM otherwise with language support. • Transparency to application requires semantics of HTM and STM to be identical.
TM Problems & Improvements Problem: Performance in Software TM Poor performance compared to locking even at low levels of contention • Atomic operations for acquiring shared object handles • Cost of consistency validation • Effect on cache of shared object metadata • Dynamic allocation, data copying and memory reclamation Solution: • STM performance can be improved by eliminating overheads of indirection, dynamic allocation, data copying, and memory reclamation by relaxing the non-blocking property Reintroduce many of the problems of locking!
TM Problems & Improvements Problem: Non Idempotent Operations: I/O • Cannot perform any operation that cannot be undone like I/O, memory remapping, thread creation and destruction • It cannot be performed multiple times on tx retry as it will lead to multiple send requests Common Solution • Postpone I/O until outcome of tx is known to avoid I/O retries. Problematic scenario • I/O waits until commit • And commit waits for I/O completion. Self deadlock!
TM Problems & Improvements Solutions: Non Idempotent Operations: I/O • Buffered I/O might be addressed by including the buffering mechanism within the scope of the transactions doing I/O This cannot handle the scenario shown. • Can expand both sender and receiver in one tx. But Tx limited to single system currently. • Txs performing non idempotent operations can be executed in “inevitable” mode, where it is guaranteed to commit avoiding the irreversibility problem of I/O etc. But it does not scale, as at most only one transaction can be inevitable.
TM Problems & Improvements Problem: Contention Management • When transactions collide, only one can proceed, others must be rolled back. • Starvation of large transactions by smaller ones • delay of a high-priority thread via rollback of its transactions due to conflicts with those of a lower-priority thread Solution • Communication b/w scheduler and tx contention manager. • Carefully select the transactions to roll back based on priority, amount of work done etc. • Convert read only transactions to non-transactional form, in a manner similar to the pairing of locking with RCU. • Writer should have necessary primitives to support non transactional readers. • Eg, A Relativistic Enhancement to Software Transactional Memory," by Philip Howard and Jonathan Walpole
TM Problems- Privatization Privatization • Optimization technique that allows access to some data non -transactionally. Need • To improve performance by temporarily exempting objects from the overhead of transactional access. • Trade Strong Isolation for performance Problems • Can break isolation guarantees causing inconsistent concurrent access. • Performance vs Correctness
TM Problems-Privatization Certain STM optimizations can result in allowing concurrent access to privatized data! time
TM Problems & Improvements Problem: Ratio of data and control operation overheads • DBMS: Data operation usually includes reads/writes to mass storage device. Tx overhead becomes negligible comparatively. • TM: Data operations almost always includes only reads/writes to memory. Tx overhead seems large. Solution • Use TM for heavy weight operations like grouping system calls. Problem: Debugability • Difficult debugability of Transactions- break points causes unconditional aborting Solution • Debugging issue can be addressed by using STM- High degree of compatibility between STM and HTM needed.
TM Problems & Improvements Others Problems: • Interaction with other systems is important. In practice it is complicated and expensive. • Conflict Prone Variables- inevitable data structures appearing in every CS causes excessive conflicts. • Performance overhead due to Conflict Resolution and excessive restarts in the face of High conflict rates.
Conclusion: Use the Right Tool For The Job! • There is no silver bullet: successful adoption of multithreaded/multi-core CPUs will require combination of techniques • Analogy with engineering: How many types of fasteners are there? How many subtypes? Nail, screw, clip, bolt, glue, joint, magnet... • Neither locking nor TM solve the fundamental performance and scalability problems • Combine strengths of various synchronization mechanisms according to the need • Integrate with other techniques: “use the right tool for the job” • TM's applicability may increase if STM performance improves • Formalize and generalize existing techniques such as RCU
Recent Work • cx_spinlocks • new hybrid TM and locking primitive • TxLinux: Using and Managing Hardware Transactional Memory in an Operating System by Christopher J. Rossbach, Owen S. Hofmann, Donald E. Porter, Hany E. Ramadan, AdityaBhandari, and Emmett Witchel • “Inevitable Transactions” special transactions containing non-idempotent operations (I/O). • Such transactions unconditionally abort any conflicting transactions, thus non-idempotence is OK. • Allowing more than one concurrent inevitable transaction is necessary to achieve reasonable I/O performance, but feasibility is an open question
Recent Work • Glue together Relativistic programming and Transactional Memory to gain scalability of readers and writers • A Relativistic Enhancement to Software Transactional Memory by Philip Howard, Jonathan Walpole
Future Work • Expand the comparison to include other synchronization mechanisms (message passing, deferred reclamation, RCU) • Investigate combining different mechanisms: • TM and locking (much work in this area) • RCU and locking (typical use of RCU) • TM and RCU (very little work done here) • There might still be hope for a “silver bullet” • But until then, it would be quite foolish to ignore combinations of existing mechanisms
References • Lecture Slides from Winter 2008 by the authors • Parallel Programming with Transactional Memory by Ulrich Drepper, Red Hat • Software Transactional Memory why is it only a research toy? by CalinCascaval, Colin Blundell, Maged Michael, Harold W.Cain, Peng Wu, StefaneChiras and Siddhartha Chatterjee • Privatization Techniques for Software Transactional Memory by Michael F. Spear, Virendra J. Marathe, Luke Dalessandro, and Michael L. Scott • Inevitability Mechanisms for Software Transactional Memory by Michael F. Spear, Maged M. Michael, Michael L. Scott • http://en.wikipedia.org/wiki/Software_transactional_memory