1 / 30

CS 3304 Comparative Languages

Learn about synchronization implementations in concurrency, including mutual exclusion, spin locks, barriers, and nonblocking algorithms. Understand memory consistency models and the cost of ordering in parallel processing systems.

dbowling
Download Presentation

CS 3304 Comparative Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 3304Comparative Languages • Lecture 24:Concurrency – Implementations • 12 April 2012

  2. Implementing Synchronization • Typically, synchronization is used to: • Make some operation atomic. • Delay that operation until some necessary precondition holds. • Atomicity: usually achieved with mutual exclusion locks. • Mutual exclusion ensures that only one thread is executing some critical section of code at given point in time: • Much early research was devoted to figuring out how to build it from simple atomic reads and writes. • Dekker is generally credited with finding the first correct solution for two threads in the early 1960s. • Dijkstra: a version that works for n threads in 1965. • Peterson: a much simpler two-thread solution in 1981. • Condition synchronization: allows a thread to wait for a precondition: e.g. a predicate on the value(s) in one or more shared variables.

  3. Busy-Wait Synchronization • Busy-wait condition synchronization with atomic reads and writes is easy: • You just cast each condition in the form of “location X contains value Y” and you keep reading X in a loop until you see what you want. • Other forms are more difficult: • Spin locks: provide mutual exclusion. • Barriers: ensure that no thread continues past a given point in a program until all threads have reached that point.

  4. Spin Locks • Spin lock: a busy-wait mutual exclusion mechanism. • Processors have instructions for atomic read/modify/write. • The problem with spin locks is that they waste processor cycles: overdemand for hardware resources – contention. • Synchronization mechanisms are needed that interact with a thread/process scheduler to put a thread to sleep and run something else instead of spinning. • Note, however, that spin locks are still valuable for certain things, and are widely used. • In particular, it is better to spin than to sleep when the expected spin time is less than the rescheduling overhead. • Reader-writer lock: allow concurrent access to readers threads.

  5. Barriers • In data-parallel algorithms the correctness often depends on making sure that every thread completest the previous step before any moves on to the next. • Globally shared counter - modified by an atomic fetch_and_decrement instruction: • Threads toggle their local sense. • Threads decrement the counter and wait. • The last thread (counter is 1) allows other threads to proceed: • Reinitializes the counter to n. • Set the global sense to its local sense. • Sense reversing can lead to significant contention on large machines: • The fastest software barriers are O(log n). • Special hardware for near-constant-time.

  6. Nonblocking Algorithms • Compare and store (CAS) - a universal primitive for single-location atomic update:acquire(L) versus start: r1 := x r1 := x r2 := foo(r1) r2 := foo(r1) x := r2 r2 := CAS(x, r1, r2)release(L) if !r2 goto start • Non-blocking: if the CAS operation fails it is because some other has made progress. • Generalization:repeat prepare -- harmless if we need to repeat CAS -- if successful, completes in a way visible to all threadsuntil successclean up -- performed by any thread if the original is delayed • Advantages: tolerant of page faults and preemption; can be safely used in signal/interruption handlers; can be faster. • Disadvantages: exceptionally subtle and difficult to devise.

  7. Memory Consistency Models • Hardware memory coherence alone is not enough to make a multiprocessor behave as most programmers would expect. • When more than one location is written at about the same time, the order in which the writes become visible to different processors becomes very important. • Sequential consistency: • All writes are visible to all processors in the same order. • Any given processor’s writes are visible in order they were performed. • Very difficult to implement efficiently. • Relaxed memory models: • Certain loads and stores may appear to occur “out of order”. • Important ramifications for language designers, compiler writers, and the implementors of synchronization mechanisms and nonblocking algorithms.

  8. The Cost of Ordering • Straightforward implementations: require both hardware and compilers to serialize operations. • Example - ordinary store instruction: • Temporal loop: • A’s write of inspected precedes its read of X in program order. • B’s write of X precedes its read of inspected in program order. • B’s read of inspected appears to A’s write of inspected, because it sees the unset value. • A’s read of X appears to precede B’s write of X as well, leaving us withxa = 0and ib = false. • May be also caused by compiler optimization.

  9. Forcing Order • Avoiding temporal loops: use special synchronization or memory fence instructions. • Temporal loop - both A and B must prevent their read from bypassing (completing before) the logically earlier write: • Identifying the read or the write as a synchronization instruction • Sometimes more significant program changes are needed. • Fences and synchronization instructions may not suffice to solve the problem – concurrent propagation of writes. • Enclose the writes in a lock-based critical section.

  10. Data Race Freedom • Multiprocessor memory behavior - transitive happens before relationship between instructions: • In certain cases an instruction on one processor happens before an instruction on another processor. • Write data-race free programs according to some (language-specific) memory model: • Never performs conflicting operations unless they are ordered by the model. • Memory consistency models distinguish: • Data races (memory races): between ordinary loads and stores. • Synchronization races - between lock operations, volatile load and stores, or other distinguished operations: • Temporal loop: avoid by declaring both X and inspected as volatile. • Concurrent propagation of writes: both C and D should read X and Y together in a single atomic operation.

  11. Scheduler Implementation • OS-level processes must synchronize access to the ready list and condition queues, usually by means of spinning: • Assumes a single “low-level” lock (scheduler_lock) that protects the entire scheduler. • On a large multiprocessor we might increase concurrency by employing a separate lock for each condition queue, and another for the ready list. • Synchronization for sleep_on:disable_signalsacquire_lock(scheduler_lock)if not desired_condition sleep_on(condition_queue)release_lock(scheduler_lock)reenable signals

  12. Scheduler-Based Synchronization • Busy-wait synchronization is generally level independent: • Consumes cycles that could be used for computation. • Makes sense only if the processor is idle or the expected wait time is less than the time required to switch contexts. • Scheduler-based synchronization is level dependent: • Specific to threads (language implementation) or processes (OS). • Semaphores were the first proposed scheduler-based synchronization mechanism, and remain widely used. • Conditional critical regions (CCRs), monitors, and transactional memory came later. • Bounded buffer abstraction - a concurrent queue of limited size into which producer threads insert data: • Buffer evens out the fluctuations. • The correct implementation requires both atomicity and condition synchronization.

  13. Semaphores • A semaphore is a special counter: • Has an initial value and two operations, P and V, for changing value. • A semaphore keeps track of the difference between the number of P and V operations that have occurred. • A P operation is delayed (the process is de-scheduled) until #P-#V <= C, the initial value of the semaphore. • The semaphores are generally fair, i.e., the processes complete P operations in the same order they start them • Problems with semaphores: • They're pretty low-level: • When using them for mutual exclusion, it's easy to forget a P or a V, especially when they don't occur in strictly matched pairs. • Their use is scattered all over the place: • If you want to change how processes synchronize access to a data structure, you have to find all the places in the code where they touch that structure, which is difficult and error-prone

  14. Semaphore Operations - Scheduler • Implementations of P and V for the scheduler operations. • The code for sleep_on cannot disable timer signals and acquire the scheduler lock itself because the caller needs to test a condition and then block as a single atomic operation.

  15. Language-Level Mechanisms • Semaphores are considered to be too “low level” for well-structured, maintainable code: • Their operations are simple subroutine calls that are easy to leave out. • Uses of a given semaphore tend to get scattered throughout a program (unless hidden inside an abstraction) - difficult to track down for purposes of software maintenance. • Other language mechanisms include: • Monitors. • Conditional critical regions. • Transactional memory. • Implicit synchronization.

  16. Monitors • Suggested by Dijkstra as a solution to the problems of semaphores (languages Concurrent Pascal, Modula, Mesa). • Monitor is a module or object with operations, internal state, and a number of condition variables: • Only one operation of a given monitor is allowed to be active at a given point in time (programmers are relieved of the responsibility of using P and V operations correctly). • A thread that calls a busy monitor is automatically delayed until the monitor is free. • An operation can suspend itself by waiting on a condition variable (not the same as semaphores – no memory). • All operations on the encapsulated data , including synchronization, are collected together. • Monitors have the highest-level semantics, but a few sticky semantic problem - they are also widely used.

  17. Monitor - Semantic Details • Hoare’s definition of monitors: • One thread queue for every condition variable. • Two bookkeeping queues: • Entry queue: threads that attempt to enter a busy monitor. • Urgent queue: when a thread executes a signal operation from within a monitor, and some other thread is waiting on the specific condition, then the signaling thread waits on the monitor’s urgent queue. • Monitor variations: • Semantic of the signal operation. • Management of mutual exclusion when a thread waits inside a nested sequence of two or more monitor calls. • Monitor invariant: a predicate that captures the notion that “the state of monitor is consistent.” • Needs to be true initially and at monitor exit. • Monitors and semaphors are equally powerful.

  18. Signals • One signals a condition variable when some condition on which thread may be waiting has become true. • To make sure the condition is still true when the thread wakes up, the thread needs to switch as soon as the signal occurs: we need the urgent queue. • Induces unnecessary scheduling overhead. • Mesa – signals are hints, not absolutes:if not desired_condition wait(condition_variable)becomeswhile not desired_condition wait(condition_variable) • Modula-3 takes a similar approach. • Concurrent Pascal - signal operation causes an immediate return from the monitor operation in which it appears: • Preserves invariant and low overhead but precludes algorithms in which a thread does useful work after signaling a condition.

  19. Nested Monitor Calls • Usually a wait in a nested sequence of monitor operations: • Releases mutual exclusion on the innermost monitor. • Leaves the outer monitors locked. • Can lead to deadlock if the only another thread to reach a corresponding signal operation is through the same outer monitors: • The thread that entered the outer monitor first is waiting for the second thread to execute a signal operation but the second thread is waiting for the first to leave the monitor. • Deadlock: any situation in which a collection of threads are all waiting for each other, and none of them can proceed. • Solution - release exclusion on outer monitors when waiting in an inner one – adopted early uniprocessor implementations: • Requires that monitor invariant holds at any subroutine call that may result in a wait or (Hoare semantics) signal in a nested monitor. • May not all be know to the programmer.

  20. Conditional Critical Regions • Proposed as an alternative to semaphores by Brinch Hansen. • Critical region - a syntactically delimited critical section in which the code is permitted to access a protected variable: • Specifies a Boolean condition that must be true before control enters:region protected_variable, when Boolean_condition do …end region • No thread can access the protected variable except within a region statement. • Any thread that reaches a region statement waits until the condition is true and no other is currently in a region for the same variable. • Nesting regions: a deadlock is possible. • Languages – Edison: • Influenced synchronization mechanism of Ada 95, Java, and C#.

  21. Synchronization in Ada 95 • In addition to message passing in Ada 83, Ada 95 has a notion of protected object: • Three types of methods: functions, procedures, and entries. • Functions can only read the fields of the object. • Procedures and entries can read and write them. • An implicit reader-writer lock on the protected object ensures that potentially conflicting operations exclude one another in time. • Entry differs from procedures: • Can have a Boolean expression guard: the calling thread will wait for before beginning execution. • Three special forms of call: • Timed: abort after waiting for a specified amount of time. • Conditional: execute alternative code if the call cannot proceed now. • Asynchronous: execute alternative code now, abort if call can proceed. • Ada 95 shared memory sync: a hybrid of monitors and CCRs.

  22. Synchronization in Java • An object has implicit mutual exclusion lock: synchronized. • Synchronized statements that refer to different objects may proceed concurrently. • Within a synchronized statement or method, a thread can suspend itself by calling the predefined method wait. • Threads can be awoken for spurious reasons:while (!condition) { wait();} • Resuming a thread suspended on an object: • Some other thread must execute the predefined method notify from within a synchronized statement. • There is also notifyAll that awakes all threads. • Synchronization in Java is sort of a hybrid of monitors and CCRs (Java 3 will have true monitors) – similarly in C#.

  23. Lock Variables • C# and Java versions prior to 5: threads are never waiting for more than one condition. • Java 5 java.util.concurrent package provides a more general solution - explicit creation of Lock variables:Lock l = new ReentrantLock();l.lock();try { …} finally { l.unlock();} • Lacks the implicit release at the end of scope associated with synchronized methods and statements. • Java objects using only synchronized methods: monitors. • Java synchronized statements that begins with a wait in a loop resembles a CCR.

  24. The Java Memory Model • Specifies exactly: • Which operations are guaranteed to be ordered across threads. • For every read/write pair if the read is permitted to return the value written by the write. • Java thread is allowed : • Buffer or reorder its writes until the point at which it writes a volatile variable or leaves a monitor. • Keep cached copies of values written by other threads until it reads a volatile variable or enters a monitor. • The compiler can: • Reorder ordinary reads/writes in the absence of intrathread data dependences. • It cannot reorder volatile access, monitor entry, or monitor exit with respect to one another.

  25. Transactional Memory • Locks (semaphors, monitors, CCRs) make it easy to write data-race free programs but they do not scale: • Adding processors and threads: the lock becomes a bottleneck. • We can partition program data into equivalence classes: a critical section must acquire lock for every accessed equivalence class. • Different critical sections may locks in different orders: deadlock can result. • Enforcing a common order can be difficult. • Locks may be too low level a mechanism. • The mapping between locks and critical sections is an implementation detail from a semantic point of view: • We really want is a composable atomic construct: transactional memory (TM).

  26. Atomicity without Locks • Transactions have been used for atomicity in databases. • The basic idea of TM: • The programmer labels code blocks as atomic. • The underlying system takes responsibility for executing those blocks in parallel whenever possible. • If the code inside the atomic block can safely be rolled back in the event of conflict, then the implementation can be based on speculation. • Implementation: rather surprising amount of variety. • Challenges: • What should we do about operations inside transactions that cannot easily be rolled back (I/O, system calls)? • How to discourage programmers from creating large transactions? • How transactions interact with locks/nonblocking data structures?

  27. Implicit Synchronization • Thread operations on shared data are restricted in such a way that synchronization can be implicit in the operation themselves, rather than appearing as explicit operations: • Example: the forall loop of HPF and Fortran 95. • Dependence analysis: compiler identifies situations in which statements within a loop do not depend on one another and can proceed without synchronization. • Automatic parallelization: • Considerable success with well structured data-parallel program. • Thread level, for irregularly structured programs, is very difficult.

  28. Futures • Implicit synchronization without compiler analysis: • Multilisp Scheme: (future (my-function my-args)) • In a purely functional program, semantically neutral. • Executes its work in parallel until it detects an attempt to perform an operation that is too complex for the system to run safely in parallel. • Work in a future is suspended if it depends in some way on the current continuation, such as raising an exception. • C# 3.0/Parallel FX: Future class. • Java: Future class and Executor object. • CC++: single-assignment variable. • Linda: a set of subroutines that manipulate a shared abstraction called the tuple space. • Parallel logic programming- AND and OR (speculative) parallelism: fail to adhere to the deterministic search.

  29. Message Passing • Most concurrent programming on large multicomputers and net- works is currently based on messages. • To send/receive a message, one must generally specify where to send it to, or where to receive it from: communication partners need names for one another: • Addressing messages to processes: Hoare’s CSP (Communicating Sequential Processes). • Addressing messages to ports: Ada. • Addressing messages to channels: Occam. • Ada’s comparatively high-level semantics for parameter modes allows the same set of modes to be used for both subroutines and entries (rendezvous). • Some concurrent languages provide parameter modes specifically designed with remote invocation in mind.

  30. Summary • We focus on shared-memory programming models and on synchronization in particular. • We distinguish between atomicity and condition synchronization, and between busy-wait and scheduler-based implementation. • Busy-wait mechanisms include spin locks and barriers • Scheduler-based implementations include semaphors, monitors, and conditional critical regions. • Transactional memories sacrifice performance for the sake of programmability.

More Related