310 likes | 334 Views
CS 3304 Comparative Languages. Lecture 22: Concurrency - Fundamentals 5 April 2012. Introduction. Classic von Neumann (stored program) model of computing has single thread of control. Parallel programs have more than one thread of control. Motivations for concurrency:
E N D
CS 3304Comparative Languages • Lecture 22:Concurrency - Fundamentals • 5 April 2012
Introduction • Classic von Neumann (stored program) model of computing has single thread of control. • Parallel programs have more than one thread of control. • Motivations for concurrency: • To capture the logical structure of a problem. • To exploit extra processors, for speed. • To cope with separate physical devices. • Concurrent: any system in which two or more tasks may be underway (at an unpredictable point in their execution). • Concurrent and parallel: more than one task can by physically active at once (more than one processor). • Concurrent, parallel and distributed: processors are associated with the devices physically separated in the real world.
Process • A process or thread is a potentially-active execution context. • Processes/threads can come from: • Multiple CPUs. • Kernel-level multiplexing of single physical machine. • Language or library level multiplexing of kernel-level abstraction. • They can run: • In true parallel. • Unpredictably interleaved. • Run-until-block. • Most work focuses on the first two cases, which are equally difficult to deal with. • A process could be thought of as an abstraction of a physical processor.
Levels of Parallelism • Circuits and gates level: signals can propagate down the thousands connections at once. • Instruction-level parallelism (ILP): machine language programs. • Vector parallelism: operations repeated on every element of a very large program. • Thread-level: coarser-grain (multicore processors), represents a fundamental shift where parallelism must be written explicitly into high-level program structure.
Levels of Abstraction • “Black box” parallel libraries: parallelism under the hood. • Programmer’s specified mutually independent tasks. • Synchronization: serves to eliminate races between threads by controlling the ways in which their actions can interleave in time. • Race condition occurs whenever two or more threads are racing towards points in the code at which they touch some common object. • Behavior of the systems depends on which thread gets there first. • Synchronization makes some sequence of instructions (critical sections) appear to be atomic – happen all at once. • Implementing synchronization mechanisms may require a good understanding of the hardware and run-time system.
Race Conditions • A race condition occurs when actions in two processes are not synchronized and program behavior depends on the order in which the actions happen. • Race conditions are not all bad; sometimes any of the possible program outcomes are ok (e.g. workers taking things off a task queue). • Race conditions (we want to avoid race conditions): • Suppose processors A and B share memory, and both try to increment variable X at more or less the same time • Very few processors support arithmetic operations on memory, so each processor executes LOAD X INC STORE X • If both processors execute these instructions simultaneously X could go up by one or by two.
Synchronization • Synchronization is the act of ensuring that events in different processes happen in a desired order. • Synchronization can be used to eliminate race conditions • In our example we need to synchronize the increment operations to enforce mutual exclusion on access to X. • Most synchronization can be regarded as one of the following: • Mutual exclusion: making sure that only one process is executing a critical section (e.g., touching a variable) at a time. • Usually using a mutual exclusion lock (acquire/release). • Condition synchronization: making sure that a given process does not proceed until some condition holds (e.g., that a variable contains a given value).
Multithreaded Programs • Motivations: • Hardware: I/O, device handlers. • Simulation: discrete events. • Web-based applications: browsers. • Use of many threads ensure that comparatively fast operations do not wait for slow operations. • If a thread blocks, the implementation automatically switches to another thread. • Preemptive approach: the implementation automatically switches among threads even if there is no blocking to prevent any one thread from monopolizing the CPU.
Dispatch Loop Alternative • Dispatch loop: centralizes the handling of all delay-inducing events in a single dispatch loop. • Identify all tasks the corresponding state information. • Very complex to subdivide tasks and save state. • The principal problem - hides the algorithmic structure of the program: • Every distinct task could be described elegantly if not for the fact that we must return to the top of the dispatch loop at every delay-inducing operation. • Turns the program inside out: • The management of tasks is explicit. • The control flow within tasks implicit.
Multiprocessor Architecture • Single site (nondistributed) parallel computers: • Processors share access to common memory. • Processors must communicate with messages. • Shared-memory machines are typically referred to as multiprocessors: • Smaller (2-8 processors) are usually symmetric. • Larger employ a distributed-memory architecture where each memory bank is physically adjacent to a particular processor or a small group of processors. • Two main classes of programming notation: • Synchronized access to shared memory: a processors can read/write remote memory without the assistance of another processor. • Message passing between processes that don't share memory: requires the active participation of processors at both ends.
Shared Memory • To implement synchronization you have to have something that is atomic: • That means it happens all at once, as an indivisible action. • In most machines, reads and writes of individual memory locations are atomic (note that this is not trivial; memory and/or busses must be designed to arbitrate and serialize concurrent accesses). • In early machines, reads and writes of individual memory locations were all that was atomic. • To simplify the implementation of mutual exclusion, hardware designers began in the late 60's to build so-called read-modify-write, or fetch-and-phi, instructions into their machines.
Memory Coherence • Cache - has a serious problem on memory-shared machines: • A processor that has cached a particular memory location will not see changes made on that location by other processor. • Coherence problem: how to keep cached copies of a memory location consistent with one another. • Bus-based symmetric machines – relatively easy: leverages broadcast nature so when a processor needs to write a cache line, it request an exclusive copy, and waits for other processors to invalidate their copies. • No broadcast bus –difficult: notifications take time to pro-pagate, the updates order is important (consistency is hard). • Supercomputers: how to accommodate nonuniform access time and the lack of hardware support for shared memory across the full machine.
Concurrent Programming Fundamentals • Thread: an active entity that the programmer thinks of as running concurrently with other threads. • Built on top of one or more processes provided by the operating system: • Heavyweight process: has its own address space. • Lightweight processes: share an address space. • Task: a well defined unit of work that must be performed by some thread: • A collection of threads share a common “bag of tasks”. • Terminology inconsistent across systems and authors.
Communication and Synchronization • Communication - any mechanism that allows one thread to obtain information produced by another: • Shared memory: program’s variables accessible to multiple threads. • Message passing: threads have no common state. • Synchronization – any mechanism that allows the programmer to control the relative order in which operations occur on different threads. • Shared memory: not implicit, requires special constructs. • Message passing: implicit. • Synchronization implementation: • Spinning (busy-waiting): a thread runs in a loop reevaluating some condition (makes no sense on uniprocessor). • Blocking (scheduler-based): the waiting thread voluntarily relinquishes its processor to some other thread (needed a data structure associated with the synchronization action).
Languages and Libraries • Thread-level concurrency: • Explicitly concurrent language. • Compiler-supported extensions to traditional sequential languages. • Library package outside the language proper. • Support for concurrency in Algol 68. • Support for coroutines in Simula. • An explicitly concurrent programming language has the advantage of compiler support.
Thread Creation Syntax • Six principal options: • Co-begin. • Parallel loops. • Launch-at-Elaboration. • Fork/Join. • Implicit Receipt. • Early Reply. • The first two options delimit thread with special control-flow constructs. • SR language provides all six options. • Java, C# and most libraries: fork/join. • Ada: launch-at-elaboration and fork/join. • OpenMP: co-being and parallel loops. • RPC systems: implicit receipt.
Co-Begin • A compound statement that calls for concurrent execution:co-begin -- all n statements run concurrently stmt_1 stmt_2 … stmt_nend • The principal means of creating threads in Algol-60. • OpenMP:#pragma omp sections{# pragma omp section { printf(“thread 1 here\n”); }# pragma omp section {printf(“thread 2 here\n”); }}
Parallel Loops • Loops whose iteration can be executed concurrently. • OpenMP/C:pragma omp parallel forfor (int I = 0; I < 3; i++) { printf(“thread %d here\n”, i);} • C#/Parallel FX:Parallel.For(0, 3, i => { Console.WriteLine(″Thread ″ + i + ″ here″);}); • High performance Fortran:forall (i=1:n-1) A(i) = B(I) + C(i) A(i+1) = A(i) + A(i+1)end forall • Support for scheduling and data semantics: OpenMP.
Launch-at-Elaboration • The code for a thread is declared with syntax resembling that of a subroutine with no parameters. • When the declaration is elaborated, a thread is created to execute the code. • Ada:procedure P istask T is…end T;begin – P…end P; • Task T begins to execute as soon as control enters the procedure P. • When control reaches the end of P, it waits for the corresponding Ts to complete before returning.
Fork/Join • Fork (b) is more general compared to co-begin, parallel loops, and launch-at-elaboration (a) that have properly nested thread executions. • Join allows a thread to wait for thecompletion of a previously forked thread. • Ada support defining tasks types: sharedvariables used for communication. • Java: parameters passing at start time. • Objects of a class implementing Runnable interface are passed to an Executor object. • Cilk- prepends spawn to an ordinary function call:spawn foo(args); • Highly efficient mechanism for scheduling tasks.
Implicit Receipt • RPC systems: create a new thread automatically in response to an incoming request from some other address space. • A server binds a communication channel to a local thread body or subroutine. • When a request comes in, a new thread spring into existence to handle it. • Bind operation grants remote clients the ability to perform a fork within the server’s address space. • The process is not fully automatic.
Early Reply • Sequential subroutines can be done as a single thread or two threads. • A single thread which saves its current context (its program counter and registers), executes the subroutines, and returns to what it was doing before. • Two threads: One that executes the caller and another that executes the callee. • The call is essentially a fork/join pair. • The callee does not have to terminate, just complete its portion of work. • In languages like SR or Hermes the callee can execute a reply operation that returns the results to the caller without termination. • The portion of the of the callee prior to the reply is like the constructor of Java/C# thread.
Implementation of Threads • The threads: usually implemented on top of one or more processes provided by the operating system. • Every thread a separate process: • Processes are too expensive. • Requires a system call. • Provide features are seldom used (e.g., priorities). • All thread in a single process: • Precludes parallel execution on a multicore or multiprocessor machine. • If the currently running thread makes a system call that blocks, then none of the program’s other threads can run.
Two-Level Thread Implementation • User level threads on top of kernel-level processes: • Similar code appears at both level of the system: • The language run-time system implements threads on top of one or more processes. • The operating system implements processes on top of one or more physical processors. • The typical implementation starts with coroutines. • Turning coroutines into threads: • Hide the argument to transfer byimplementing scheduler. • Implement a preemptionmechanisms. • Allow data structure sharing.
Coroutines • Multiple execution contexts, only one of which is active. • Transfer (other): • Save all callee-saves registers on stack, including ra and fp: *current := sp current := other sp := *current • Pop all callee-saves registers (including ra, but not sp) • return (into different coroutine!) • other and current are pointers to context blocks: • Contains sp; may contain other stuff as well (priority, I/O status, etc.) • No need to change PC; always changes at the same place • Create new coroutine in a state that looks like it's blocked in transfer. (Or maybe let it execute and then "detach". That's basically early reply).
Uniprocessor Scheduling • A thread is either blocked or runnable: • current_thread: thread running “on a process”. • ready_list: a queue for runnable thread. • Waiting queues: queus for threads blocked waiting for conditions. • Fairness: each thread gets a frequent “slice” of the processor. • Cooperative multithreading: any long-running thread must yield the processor explicitly from time to time. • Schedulers: ability to "put a thread/process to sleep" and run something else: • Start with coroutines. • Make uniprocessorrun-until-block threads. • Add preemption. • Add multiple processors.
Run-until Block • Need to get rid of explicit argument to transfer. • ready_list: threads that are runnable but not running procedure reschedule: t : thread := dequeue(ready_list) transfer(t) • To do this safely, we need to save current_thread: • Suppose we're just relinquishing the processor for the sake of fairness (as in MacOS or Windows 3.1): procedure yield: enqueue(ready_list, current_thread) reschedule • Now suppose we're implementing synchronization: sleep_on(q) enqueue(q, current_thread) reschedule • Some other thread/process will move us to the ready list when we can continue.
Preemption • Use timer interrupts (in OS) or signals (in library package) to trigger involuntary yields. • Requires that we protect the scheduler data structures: procedure yield: disable_signals enqueue(ready_list, current) Reschedule re-enable_signals • Note that reschedule takes us to a different thread, possibly in code other than yield. • Every call to reschedule must be made with signals disabled, and must re-enable them upon its return: disable_signals if not <desired condition> sleep_on <condition queue> re-enable signals
Multiprocessors Scheduling • True or quasi parallelism introduces race between calls in separate OS processes. • Additional synchronization needed to make scheduler operations in separate processes atomic: procedure yield: disable_signals acquire(scheduler_lock) // spin lock enqueue(ready_list, current) reschedule release(scheduler_lock) re-enable_signals disable_signals acquire(scheduler_lock) // spin lock if not <desired condition> sleep_on <condition queue> release(scheduler_lock) re-enable signals
Summary • Six different constructs for creating threads: co-begin, parallel loops, launch-at-elaboration, fork/join, implicit receipt, and early reply. • Most concurrent programming systems implement their language- or library-level threads on top of a collection of OS-level processes. • OS implements its processes on top of a collection of hardware processors.