280 likes | 454 Views
The Synergy Between Non-blocking Synchronization and Operating System Structure. By Michael Greenwald and David Cheriton Presented by Jonathan Walpole. The Cache Kernel. Yet Another Attempt at a µ-kernel Minimal privileged mode code: Caching model of kernel functionality
E N D
The Synergy Between Non-blocking Synchronization and Operating System Structure By Michael Greenwald and David Cheriton Presented by Jonathan Walpole CS510 – Concurrent Systems
The Cache Kernel • Yet Another Attempt at a µ-kernel • Minimal privileged mode code: Caching model of kernel functionality • OS functionality in user-mode class libraries allowing application-customizability of OS • Shared memory and signal-based communication are the only communication mechanisms • Goal: scalable, robust, flexible operating system design CS510 – Concurrent Systems
Synergy of NBS and kernel structure • Claim: NBS and Good Design go together in the Cache Kernel design and implementation • NBS allows better OS structuring • Good OS structure simplifies NBS CS510 – Concurrent Systems
Non-blocking Synchronization • Basic idea of NBS • Associate version number with data structure • Read version number at start of critical section • Atomically check and increment version number at end of critical section at the same time as applying the update(s), using a Double Compare and Swap (DCAS) instruction. • Roll back and retry on failure (i.e., if version number changed while you were in the critical section). • An optimistic synchronization technique similar to that of the Synthesis Kernel CS510 – Concurrent Systems
Example: Deletion from Linked List do { retry: backoffIfNeeded(); version = list->version; /* Read version # */ for (p = list->head;(p->next != elt); p = p->next) { if (p == NULL) { /* Not found */ if (version != list->version) { goto retry; } /* List changed */ return NULL; /* Really not found */ } } } while( !DCAS( &(list->version), &(p->next), version, elt, version+1, elt->next ) ) CS510 – Concurrent Systems
Double Compare and Swap int DCAS( int *addr1, int *addr2, int old1, int old2, int new1, int new2){<begin atomic> if ((*addr1 == old1) && (*addr2 == old2)) { *addr1 = new1; *addr2 = new2; return(TRUE); } else { return(FALSE); }<end atomic>} CS510 – Concurrent Systems
Hardware Support for DCAS • Could be supported through extension to LL/SC (load-linked/store-conditional) sequence • LLP/SCP • LLP (load-linked-pipelined) • Load and link a second address after an earlier LL • LLP is linked to the following SCP • SCP (store-conditional-pipelined) • Store depends on both previous LLP and LL not having been invalidated in this CPU’s cache • If LLP/SCP fails, so does enclosing LL/SC CS510 – Concurrent Systems
Software Support for DCAS • Basic idea: make DCAS operations atomic by putting them in a critical section protected by a lock! • OS manages the lock (how many locks do you need?) • Presumably, DCAS is a system call • Must make sure that delayed lock holders release the lock and roll back • Ensures that DCAS still has non-blocking properties • Problems implementing this efficiently • How to support readers? • How to avoid lock contention? CS510 – Concurrent Systems
Back to the List Example do { retry: backoffIfNeeded(); version = list->version; /* Read version # */ for (p = list->head;(p->next != elt); p = p->next) { if (p == NULL) { /* Not found */ if (version != list->version) { goto retry; } /* List changed */ return NULL; /* Really not found */ } } } while( !DCAS( &(list->version), &(p->next), version, elt, version+1, elt->next ) ) CS510 – Concurrent Systems
Problem … • What happens if another thread deletes an element while we are traversing the list? • We may end up with an invalid pointer in p • What if we traverse it into the free pool? • What if the memory has been reused? • What if we end up in a different data structure? • What if the memory is reused for a different type? • How can we prevent this kind of reader hijacking? • Or at least, how can we prevent it from crashing the reader? CS510 – Concurrent Systems
Type-stable memory management (TSM) • Descriptor that is allocated as type T is guaranteed to remain of type T for at least tstable • A generalization of existing technique – allocation pools: • e.g. process-descriptors are statically allocated at system init, and are type-stable for lifetime of system • But is this sufficient to ensure that read-side code will reach the DCAS in the presence of concurrent modification? • And how long is tstable ? • When is it safe to reuse memory? CS510 – Concurrent Systems
Other solutions to reader hijacking • If memory is reclaimed immediately upon updates, readers must protect themselves from hijacking • But how? • Must ensure that a pointer is valid before dereferencing it • Additional checks at the end of the read side critical section are required to ensure no concurrent write access took place CS510 – Concurrent Systems
Claim: TSM aids NBS • Type stability ensures safety of pointer dereferences. • Without TSM, delete example is too big to fit on slide • And very expensive to execute • Need to check for changes on each loop iteration • TSM makes NBS code simpler and faster CS510 – Concurrent Systems
Other benefits of NBS in Cache Kernel • Signals are only form of IPC in Cache Kernel • NBS simplifies synchronization in signal handlers • Makes it easier to write efficient signal-safe code CS510 – Concurrent Systems
Contention Minimizing Data Structures (CMDS) • Locality-based structuring used in Cache kernel: • Replication: Per-processor structures (ie. run queue) • Hierarchical data structures with read-mostly high levels (ie. hash tables with per bucket synchronization) • Cache block alignment of descriptors to minimize false sharing and improve cache efficiency • Well-known techniques used in other systems (Tornado, Synthesis, …) CS510 – Concurrent Systems
Benefits of CMDS • Minimizes logical and physical contention • Minimizes (memory) consistency overhead • Minimizes false sharing • Reduces lock conflicts/convoys in lock-based system • Reduces synchronization contention with NBS • fewer retries from conflict at point of DCAS • CMDS is good for locks, cache consistency and NBS! • NBS needs CMDS CS510 – Concurrent Systems
Minimizing the Window of Inconsistency • Delaying writes, and grouping them together at the end minimizes the window of inconsistency • Advantages of a small window of inconsistency • Less probability of failure leaving inconsistency • Preemption-safe: easy to back-out of critical section • Reduces lock hold time & contention (in lock-based systems) • Its good for system design whether you use blocking or non-blocking synchronization! CS510 – Concurrent Systems
How Does NBS Affect Contention? • What is the relationship between window of inconsistency and probability of conflict (contention)? • NBS approaches have a critical section just like locking approaches • NBS critical section is defined by the scope of the version number • This is NOT the same as the window of inconsistency! CS510 – Concurrent Systems
Priority Inversion Issues • NBS allows synchronization to be subordinate to scheduling • It avoids priority inversion problem of locks • Also, page fault,I/O, and other blocking operations • Does the highest priority process always make progress? • Is there a retry-based equivalent to priority inversion? CS510 – Concurrent Systems
Why NBS is good for OS structure • Fail-stop safe: OS class libraries tolerant of user threads being terminated. • Most Cache Kernel OS functionality implemented in user-mode class libraries • Portable: same code on uniprocessor, multiprocessor, and signal handlers • Deadlock free (almost) • See examples of deletion and insertion without going through reclamation CS510 – Concurrent Systems
Implementing non-blocking Synchronization • Basic approach • Read version number • Rely on TSM for type safety of pointers • Increment version number and check with every modification (abort if changed). • Straightforward transformation from locking • … so long as you have DCAS and can tolerate TSM • Replace acquire with read of version number • Replace release with DCAS … CS510 – Concurrent Systems
Implementing non-blocking Synchronization (cont.) • Variants of the basic approach: • N reads, 1 write: No backout • 2 reads, 2 writes: No version number • In the Cache kernel, every case of synchronization falls into the special cases CS510 – Concurrent Systems
Complexity and Correctness • DCAS reduces size of algorithms (lines of code) by 75% compared to CAS, for linked lists, queues and stacks • Special case uses of DCAS reduce complexity further • Relatively straightforward transformation from locking • Similar code size CS510 – Concurrent Systems
Performance Issues • Simulation-based study • With non-preemptive scheduling: • DCAS-based NBS almost as fast as spin-locks • CAS-based NBS slower • With preemptive scheduling: • DCAS and CAS-based NBS better than spin-locks • Note, this was using mid 1990’s CPUs • The cost of CAS is much higher on today’s CPUs • What are the implications of this? CS510 – Concurrent Systems
Hardware-Based Optimizations • A hardware-based optimization using advisory locking • Cache based advisory locking can help avoid contention in the form of “useless parallelism” • Instead of just checking at the end and forcing retries, check at the beginning and back off if retry is inevitable • Uses Cload instruction (conditional load) which succeeds only if the location does not have an advisory lock set, and sets the lock on success • Initially load version # with Cload, wait and retry on failure (seems like using TSL in a spin-lock?) CS510 – Concurrent Systems
Conclusions • Claim: “Good OS structure” can support Non-blocking synchronization • Type-Stable Mem. Mgmt (TSM) • Data Structures that Minimize Contention (CMDS) • Minimizing the window of inconsistency • Claim: Non-blocking synchronization can support convenient OS structure • Avoids deadlock; allows signals as sole IPC mechanism • Fault tolerant; kernel functionality can be moved to user space • Performance; isolates scheduling from synchronization • Claim: Strong synergy between non-blocking synchronization & good OS structure. CS510 – Concurrent Systems
Advantages of Non-blocking Synchronization • Non-blocking: • No deadlock! Especially useful for signal handlers. • Portability: • same code on uni and multiprocessors and in signal handlers • Performance: • Minimizes interference between synchronization and process scheduling • Recovery: • Insulation from process failures • So why isn’t it universally deployed? CS510 – Concurrent Systems
Obstacles to deployment • Complexity: • confusing to design and write efficient algorithms, especially in the absence of a DCAS primitive • Correctness: • Its hard to be convinced that there are no subtle bugs, especially on weak memory consistency architectures. • Is TSM enough to preserve critical section invariants? • Performance: • Poor performance under contention due to excessive retry • high overhead due to expense of CAS on modern CPUs CS510 – Concurrent Systems