360 likes | 451 Views
Software Transactional Memory Should Not Be Obstruction Free. Robert Ennals Intel Research Cambridge 15 JJ Thomson Avenue, Cambridge, CB3 0FD, UK robert.ennals@intel.com presented by Ted Cooper for CS510 – Concurrent Systems (Spring 2014) Portland State University theod@pdx.edu.
E N D
Software Transactional Memory Should Not Be Obstruction Free • Robert Ennals • Intel Research Cambridge • 15 JJ Thomson Avenue, • Cambridge, CB3 0FD, UK • robert.ennals@intel.com • presented by Ted Cooper • for CS510 – Concurrent Systems (Spring 2014) • Portland State University • theod@pdx.edu
Grand Context (courtesy of Professor Walpole) • Locking is slow and hard to get right. Clearly, non-blocking algorithms must be the answer! • But non-blocking algorithms (harder to get right) might starve out threads. Thus, they should be wait-free. • Wait-free algorithms must use “helping” to ensure all threads make progress, so they perform poorly, and are no simpler to reason about. • Transactions look like lock-based and sequential programs, so maybe they're easier to reason about. Can we make them fast? • But hardware transactional memory implementations have limits on transaction size and other problems, must coexist with locks in real systems, and don't seem to be faster than locks in practice. Can we at least get an STM that handles transactions of arbitrary size and length and performs reasonably? • What properties do we really need in an STM? Does it need to be some flavor of non-blocking?
STM Context • STM performance not stellar compared to conventional locks. • Processor speed growing faster than memory bandwidth. Can we reduce memory accesses to improve STM performance? • Do existing STM implementations maximize processor use? If not, can we improve processor use to improve performance? • “Obstruction-freedom” has been borrowed by STM researchers from distributed systems (which have independent failure domains, so it's important that one node be able to continue progressing if another fails). Is this a useful property for STM? How does it affect performance?
Terminology • Thread: Programmer-level idea, single parallelizable control flow. Think green threads, user-level threads. Transactions run on threads. • Task: OS-level idea, one runs per available core. Runtime multiplexes threads onto tasks. Think OS threads. • Non-blocking: At any given time, there is some thread whose progress is not blocked (e.g. by mutual exclusion). • Obstruction-free: A property non-blocking algorithms can have. If all other threads are suspended (i.e. no contention), a thread can complete its operation in a finite number of its own steps. This may require retrying. Does not guarantee progress in the presence of conflicting operations, e.g. livelock is possible • Obstruction-free is the weakest additional “natural” property a non-blocking algorithm can have.
Livelock? • Threads are doing work, but one's work prevents the another from progressing. Just like deadlock, you can have 2-participant, 3-participant, n-participant livelock. • “A real-world example of livelock occurs when two people meet in a narrow corridor, and each tries to be polite by moving aside to let the other pass, but they end up swaying from side to side without making any progress because they both repeatedly move the same way at the same time.” http://en.wikipedia.org/wiki/Deadlock#Livelock • In this example, each person's “sway deterministically until there is no obstacle” algorithm is obstruction-free since it can proceed if the other person holds still, but not guaranteed to make progress while the other person does the same thing.
Non-blocking algorithms • Wait-free: Under contention, every thread makes progress, i.e. no starvation • Lock-free: Under contention, some thread makes progress. If multiple threads try to operate on the same data, someone will win. A given thread may never win, so could be starved, but the system as a whole will make progress, so no livelock. • Obstruction-free: In isolation (all contenders suspended), a given thread makes progress. Under contention, this progress may not be useful, i.e. 2 threads could forever interfere and retry, livelocking. obstruction-free lock-free wait-free
Do we need obstruction-free STM? • STM common case: parallelizing existing sequential programs • Sequential programmers are used to blocking semantics, e.g. system calls(?) • If we map tasks to cores 1-1, and run in-flight transactions to completion before scheduling new ones, it's unlikely that any thread will be suspended mid-transaction, and only suspended transactions can block other transactions.
There is no one thread use case to rule them all • Threading for convenience: Multiple threads to track computations that proceed independently, e.g. compute and GUI threads. Blocking locks are fine here, may need priority levels for locks to ensure low-priority threads don't block high-priority threads. • Threading for performance: Actual concurrent computation is possible. Blocking fine in sequential code, so also fine in transactions (draw picture) • To STMify lock-based code, we can map lock-protected critical sections to transactions. This is no worse, since locks don't allow any concurrency in critical sections.
Obstruction-free misconception 1 • Misconception: Obstruction-freedom prevents a long-running transaction from blocking others • Counterexample: A transaction t reads an object x, computes for a year, writes to x. t completes only if any other transaction that needs x blocks until t finishes. So, either t blocks contending transactions or t never completes. • Question: Is it a problem for a transaction to block others of the same or lower priority?
Obstruction-free misconception 2 • Misconception: Obstruction-freedom prevents the system from locking up if a thread t is switched out mid-transaction. • Argument 1: The OS will always switch the task running t back in eventually (provided all tasks have the same OS scheduling priority), so you don't need obstruction-freedom to make progress as long as temporary interruptions are okay. • Argument 2: STM runtime can match the number of tasks to the number of available cores (dynamically). In this situation tasks (and the threads they run) will be switched out by the OS rarely, if ever. • Argument 3: STM runtime can only start a new transaction on a given task when that tasks' last transaction completes, i.e. the runtime never preempts an in-flight transaction. That is, we allow in-flight transactions to obstruct new ones :)
Obstruction-free misconception 3 • Misconception: Obstruction-freedom prevents the system from locking up if a thread t fails. i.e. the system should continue to make progress as a whole if transactions fail silently. • Argument 1: If it's a software failure, an equivalent lock-based or sequential program would also fail. • Argument 2: If it's a hardware failure, then a) node failures in distributed systems are common, while independent core failures in shared memory multiprocessors that don't bork the whole system are exceedingly rare, and b) again, a hardware failure would also break a lock-based or sequential program.
Improved cache locality • If object metadata lives in the same cache line as object data, only one memory access to load a shared object. If program is memory bandwidth-limited, performance is directly proportional to number of memory accesses. • Any metadata we can't fit in the object data cache line should live in memory that is private to a given transaction, so transactions don't fight over it and so it stays in one cache.
Improved cache locality cont'd • What does this have to do with obstruction-freedom? • No obstruction-free STM can store object metadata and data in the same cache line. They all require object data to be behind a level of indirection to prevent the following situation: • Transaction t is writing to object x and is switched out. • Transaction s runs, needs x. What can s do? • s could wait for t to finish with x, but that isn't obstruction-free. • s could access x, but if t wakes up again it might overwrite x, invalidating s' transaction and leaving s in an undefined state. • s could abort t, but we can't guarantee abort has succeeded without an acknowlegement from t, and that isn't obstruction-free. Even if s could abort t, then t could restart and abort s, resulting in livelock. My question: Could we avoid livelock with a total ordering of abort precedence, i.e. s can abort t but t can't abort s? • This is the same reason we need pointers and copies in relativistic programming.
Optimal number of in-flight transactions • Consider N in-flight transactions on N cores. • A new transaction t tries to start before any of the N complete. • While t exists but has not yet been scheduled to run, it can make no progress in isolation, and so is not obstruction-free. • So as soon as t exists, we have to switch out an in-flight transaction and share N cores among N+1 transactions. • This introduces context-switching overhead, which was previously avoided, and which wastes cycles. • This also increases the number of concurrently running transactions, increasing the probability of conflicts among transactions. • Why not just let each transaction complete without context-switching it out, and once it completes run the new transaction in its task? Then we'd always have N transactions running on N cores.
What does a non-obstruction-free STM that employs these optimizations look like, and how does it perform against existing obstruction-free STMs?
The Lightweight Transaction Library • Ennals et al wrote a non-obstruction-free STM library to test these ideas. • In summary, it handily beats Fraser's STM and Fraser's C implementation of DSTM, both of which are obstruction-free. • It is available at: http://sourceforge.net/projects/libltx
Memory Layout • ltx designates a public memory region all transactions can access, where shared objects (and only shared objects) live. • It also allocates a private memory region to each transaction for the transaction state, which other transactions (usually) do not access. Each private region is allocated contiguosly starting at an aligned address once and reused by subsequent transactions that run on the same core, so it stays in that core's cache. This means that cache misses on private memory are rare.
What lives in private memory? • At the very beginning (i.e. the aligned base address), a descriptor for the transaction itself from which its priority can be determined. • Read and write descriptors, one for each shared object x the current transaction t has accessed. • Read descriptors contain: • x's version number as of the last time t read it. This is used to check whether t needs to restart because the data it read changed before t could commit. • A pointer to x, so t can read the data, check x's version, and check whether x has been locked for writing by another transaction.
What lives in private memory? cont'd • Write descriptors contain: • The object's version number as of the last time t read it. This is used to compute a new version number on a successful commit, or to roll x back the its previous version on abort. • A pointer to x, so t knows where to write on commit or abort. • A copy of x's object data. This is where t stages changes to x before committing. Note that unlike in RP, where changes are made visible by replacing a public pointer to the old version with a public pointer to the new version, ltx copies this staged object data back to the public object data during commit, enforcing the public/private division. This is unavoidable, since object metadata and data are stored adjacently in the public region in a fixed location (to avoid the extra memory accesses imposed by indirection).
Object handles • Each public object has a handle (metadata) stored adjacent to the object data. • The last bit of the handle signals whether a transaction is currently writing to the object x: • If 1, no transaction is currently writing, and the rest of the handle represents x's current version number. • If 0, a transaction t is currently writing to x, and the rest of the handle is a pointer to t's write descriptor (more on this later) for x. Some fixed number of higher order bits in this pointer can also be used to t's transaction descriptor, since private regions are allocated in aligned contiguous blocks.
0 or more ? 0 or more How could “Verision Seen” be a pointer?
Writes • Managed using revocable two-phase locking: • A transaction locks every object to which it needs to write, but keeps enough information around to release the lock and restore the object to its previous state on abort. • If two transactions deadlock on write sets, one aborts. My question: How does deadlock detection work in this case? Does a transaction s who needs an object x locked by t use x's handle to find t's write descriptors and some record of the set of objects t intends to ultimately lock, compare that to its own write descriptors and pending locks, look for a cycle, and abort if it finds one?
Writes cont'd • How does t lock x for writing? • t reads x's handle. If it ends in a 1 then the rest is x's version number, and t stores that and a pointer to x in a write descriptor d, then uses a compare and swap or other atomic operation to replace x's handle with a pointer to d with a trailing 0. If the atomic operation succeeds, t has locked x. Otherwise some other transaction has concurrently updated x, and t must retry. If t successfully locks x, it makes a copy of x's object data in the write descriptor.
Writes cont'd • What if x is already locked by another transaction s? • t (busy?) waits for a bounded number of cycles for x to become available. If this time expires and x is still locked, t gets s' transaction descriptor (available via the pointer in the locked handle) and checks whether s is of the same or lower priority, then requests that s abort itself.
Reads • Managed using optimistic concurrency control: • t reads x's handle. If x is not locked, it logs the version number from the handle in a read descriptor for x, along with a pointer to x. If x is locked, t waits in the same fashion as for writing. • When t attempts to commit, it compares its logged copy of x's version number to the current value in x's handle, and the commit fails if they differ.
Commits • When t is ready to commit, it first checks whether it is still valid: • If no other transaction has written to an object in t's read set (i.e. the version numbers in the write descriptors still match the handles), t is valid. • If t is valid, it can commit. t must have locked all the objects in its write set, so we don't need to check those for to determine validity. For each write descriptor d for an object x, t simply copies the updated object data in d (private memory) to the corresponding object data in public memory, then overwrites the lock in the x's handle with an incremented version number for x, releasing the lock and publishing the new version of x in one fell swoop.
Commits cont'd • What if t isn't valid? • t may have read inconsistent data and gone into a weird state, e.g. an infinite loop or a segfault from reading an out-of-date or corrupted array index. • Because we can't predict the behavior caused by inconsistent data, t may not retry properly, so the runtime has to periodically abort outstanding invalid transactions.
Performance Evaluation • Benchmarks on Fraser's testbed to ensure that comparison to Fraser's STM and C DSTM is fair. • SunFire 15K server • 106 UltraSparc III processors @ 1.2GHz • Benchmarks • Red-black tree and skip-list, both read and write random set elements, 75% reads, 25% writes.
Lower on y axis (CPU time per operation in microseconds) is better. • Key space varied to compare performance under contention • ltx takes 50-60% time of Fraser, 35% time of C DSTM • Probably wins because of cache locality optimization (fewer total memory accesses): ltx incurs 48% L2 misses, 58% L1 misses, and 22% TLB misses compared to Fraser
Lower on y axis (CPU time per operation in microseconds) is better. • Key space varied from 16 to 219 to compare performance under contention, number of processors used fixed at 90. • Under high contention (left region of each graph) ltx takes ~20% time of Fraser, C DSTM barely runs. • Fraser's transactions help blockers, performs poorly for the same reason wait-free algorithms do.
Lower on y axis (CPU time per operation in microseconds) is better. • Run on 4-way SPARC machine and number of tasks varied to measure effect of OS context-switching. • Unsurprisingly, as rate of context-switching increases performance degrades. • ltx more affected by context-switching than Fraser since switched-out transactions can block others in ltx, but ltx is still faster. • Under normal ltx deployment, number of tasks always upper-bounded by available cores, so context-switching rarely occurs.
Conclusions • Obstruction-freedom is not necessary for STM. • 2 non-obstruction-free STM optimizations that maximize cache locality and minimize context-switching are demonstrated in an implementation that outperforms existing best-in-class obstruction-free STM implementations. • Therefore, Ennals et al belive that STM designers should abandon obstruction-freedom. • But wait, ltx writers use locks. Weren't we trying to get away from locks?