MIMD COMPUTERS

MIMD COMPUTERS Fundamentals of Parallel Processing

MIMD Computers or Multiprocessors There are several terms which are often used in a confusing way. Definition: Multiprocessors are computers capable of running multiple instruction streams simultaneously to cooperatively execute a single program. Definition: Multiprogramming is the sharing of computing equipment by many independent jobs. They interact only through their requests for the same resources. Multiprocessors can be used to multiprogram single stream programs. Definition: A process is a dynamic instance of an instruction stream. It is a combination of code and process state, for example program counter and the status words. Processes are also called tasks, threads, or virtual processors. Fundamentals of Parallel Processing

Definition: Multiprocessing is either: a) running a program on a multiprocessor (it may be a sequential one), [not of interest to us], or b) running a program consisting of multiple cooperating processes. Fundamentals of Parallel Processing

Two main types of MIMD or multiprocessor architectures. Shared memory multiprocessor Distributed memory multiprocessor Distributed memory multiprocessors are also known as explicit communication multiprocessors. Fundamentals of Parallel Processing

Notations: A summary of notations used in the following figures are given below: L: Link a component that transfers information from one place to another place. K: Controller, a component that evokes the operation of other components in the system. S: Switch, constructs a link between components. It has associated with it a set of possible links, it sets some and breaks other links to establish connection. T: Transducer, a component that changes the i-unit (information) used to encode a given meaning. They don’t change meaning, but format. Fundamentals of Parallel Processing

Some Example Configurations Fully Shared Memory Architecture: Fundamentals of Parallel Processing

Adding private memories to the previous configuration produces a hybrid architecture. Shared Plus Private Memory Architecture: Fundamentals of Parallel Processing

If local memories are managed by hardware, they are called cache. NUMA (Non-uniform Memory Access) Machines: There is an important impact on performance if some locations in shared memory take longer to access than others. UMA (Uniform Memory Access) Machines: Cluster: Is referred to connecting few processor shared memory multiprocessors, often called clusters, using a communication network accessed by send and receive instructions. The shared memory of a cluster is private WRT other clusters Fundamentals of Parallel Processing

Characteristics of Shared memory multiprocessors: Interprocessor communication is done in the memory interface by read and write instructions. Memory may be physically distributed, and reads and writes from different processors may take different amounts of time and may collide in the interconnection network. Memory latency (time to complete a read or write) may be long and variable. Messages through the interconnecting switch are the size of single memory words (or perhaps cache lines). Randomization of requests (as by interleaving words across memory modules) may be used to reduce the probability of collision. Fundamentals of Parallel Processing

Characteristics of Message passing multiprocessors: Interprocessor communication is done by software using data transmission instructions (send and receive). Read and write refer only to memory private to the processor issuing them. Data may be aggregated into long messages before being sent into the interconnecting switch. Large data transmissions may mask long and variable latency in the communications network. Global scheduling of communications can help avoid collisions between long messages. Fundamentals of Parallel Processing

Distributed memory multiprocessors are characterized by their network topologies Both Distributed and Shared memory multiprocessors use an Interconnection Network. The distinctions are often in the details of the low level switching protocol rather than in high level switch topology: Indirect Networks: often used in shared memory architectures, resources such as processors, memories and I/O devices are attached externally to a switch that may have a complex internal structure of interconnected switching nodes Direct Networks: more common to message passing architectures, associate resources with the individual nodes of a switching topology Fundamentals of Parallel Processing

Ring Topology An N processor ring topology can take up to N/2 steps to transmit a message from one processor to another (assuming bi-directional ring). Fundamentals of Parallel Processing

A rectangle mesh topology is also possible: An N processor mesh topology can take up to steps to transmit a message from one processor to another. Fundamentals of Parallel Processing

The hypercube architecture is another interconnection topology: Each processor connects directly with log2N others, whose indices are obtained by changing one bit of the binary number of the reference processor (gray code). Up to log2N steps are needed to transmit a message between processors. Fundamentals of Parallel Processing

Form of a four dimensional hypercube Fundamentals of Parallel Processing

Classification of real systems Overview of CM* Architecture, An early system Fundamentals of Parallel Processing

Five clusters with ten PEs each were built. The cm* system illustrates a mixture of shared and distributed memory ideas. There are 3 answers to the question: Is cm* a shared memory multiprocessor? 1. At the level of mcode in the K.map, there are explicit send and receive instructions and message passing software \ No it is not shared memory. 2. At the level of LSP-11 instruction set, the machine has shared memory. There are no send and receive instructions, any memory address could be accessed by any processor in the system \ Yes it is shared memory. 3. 2 operating systems were built for the machine. STAROS and MEDUSA. The processes which these operating systems supported could not share any memory. They communicated by making operating system calls to pass messages between processors \ No it is not shared memory. Fundamentals of Parallel Processing

The architecture of the Sequent Balance system similar to Symmetry and Encore Multimax illustrates another bus based architecture. Fundamentals of Parallel Processing

Fundamentals of Parallel Processing

Sequent Balance System bus: 80 Mbytes/second system bus. It links CPU, memory, and IO processors. Data and 32-bit addresses are time- multiplexed on the bus. Sustain transfer rate of 53 Mbytes/second. Multibus: Provides access to standard peripherals. SCSI: Small Computer System Interface, provides access to low-cost peripherals for entry-level configurations and for software distribution. Ethernet: Connect systems in a local area network. Fundamentals of Parallel Processing

Sequent Balance Atomic Lock Memory, ALM: User accessible hardware locks are available to allow mutual exclusion of shared data structures. There are 16K such hardware locks in a set. One or more sets can be installed in a machine, one/multibus adapter board. Each lock is a 32-bit double word. The least significant bit determines the state of a lock: locked (1), and unlocked (0) Reading the lock returns the value of this bit and sets it to 1, thus locking the lock. Writing 0 to a lock, unlocks it. Locks can support a variety of synchronization techniques including: busy waits, Counting/queuing semaphores, and barriers. Fundamentals of Parallel Processing

Alliant FX/8: Was designed to exploit parallelism found in scientific programs automatically. Up to 8 processors called Computational Elements (CE’s) and up to 12 Interactive Processors (IP’s) shared a global memory of up to 256 Mbytes. All accesses of CE’s and IP’s to the bus are through cache memory. There can be up to 521 Kbytes of cache shared by CE’s and up to 128 Kbytes of cache shared by IP’s. Every 3 Ip’s share 32 Kbytes of cache. CE’s are connected together directly through a concurrency control bus. Each IP contains a Motorola 68000 CPU. IP’s are used for interactive processes and IO. CE’s have custom chips to support M68000 instructions and floating point instructions (Weitek processor chip), vector arithmetic instructions, and concurrency instructions. The vector registers are 32-element long for both integers and floating point types Fundamentals of Parallel Processing

Programming Shared Memory Multiprocessors Key Features needed to Program Shared memory MIMD Computers: • Process Management: • Fork/Join • Create/Quit • Parbegin/Parend • Data Sharing: • Shared Variables • Private Variables • Synchronization: • Controlled-based: • Critical Sections • Barriers • Data-based: • Lock/Unlock • Produce/Consume Fundamentals of Parallel Processing

In the introduction to the MIMD pseudo code we presented minimal extensions for process management and data sharing to sequential pseudo codes. we saw: Fork/Join for basic process management Shared/Private storage class for data sharing by processes. We will discuss these in more details a little later, but Another essential mechanism for programming shared memory multiprocessors is synchronization. Synchronization guarantees some relationship between the rate of progress of the parallel processes. Fundamentals of Parallel Processing

Lets demonstrate why synchronization is absolutely essential: Example: Assume the following statement is being executed by n processes in a parallel program: where Sum: Shared variable, initially 0 Psum: Private variable. Assume further that P1 calculates Psum = 10 P2 calculates Psum = 3 Therefore, the final value of Sum must be 13. Fundamentals of Parallel Processing

At the assembly level Pi’s code i s: load Ri2, Sum; Ri2  Sum load Ri1, Psum; Ri1  Psum add Ri1, Ri2; Ri1  Ri1 + Ri2 store Sum, Ri1; Sum  Ri1 Where Rix refers to register x of process i. The following scenario is possible when two processes execute the statement concurrently: Fundamentals of Parallel Processing

Synchronization operations can be divided into 2 basic classes: • Control Oriented: Progress past some point in the program is controlled.(Critical Sections) • Data Oriented: Access to data item is controlled by the state of the data item. (Lock and Unlock) An important concept is atomicity. The word atomic is used in the sense of invisible. Definition: Let S be a set of processes and q be an operation, perhaps composite. q is atomic with respect to S iff for any process P S which shares variables with q, the state of these variables seen by P is either that before the start of q or that resulting from completion of q. In other words, states internal to q are invisible to processes of S. Fundamentals of Parallel Processing

Synchronization Examples Control Oriented Synchronization: Critical section is a simple control oriented based synchronization: Process 1 Process 2 • • • • • • • • • • • • • • Critical Critical code body1 code body2 End critical End critical Fundamentals of Parallel Processing

Software Solution: We first implement the critical section using software methods only. These solutions are all based on the fact that read/write (load/ store) are the atomic machine level (hardware) instructions available. We must ensure that only one process at a time is allowed in the critical section, and once a process is executing in its critical section, no other process is allowed to enter the critical section. Fundamentals of Parallel Processing

We first present a solution for 2 process execution only. Shared variable: Var want-in[0..1] of Boolean; turn: 0..1; Initially want-in[0] = want-in[1] = false turn = 0 Next, we present a software solution for N processes. Fundamentals of Parallel Processing

Bakery Algorithm (due to Leslie Lamport) Definitions/Notations: • Before a process enters its critical section, it receives a number. The process holding the smallest number is allowed to enter the critical section. • Two processes Pi and Pj may receive the same number. In this case if i < j, then Pi is served first. • The numbering scheme generates numbers in increasing order of enumeration. For example: 1, 2, 2, 3, 4, 4, 4, 5, 6, 7, 7,... • (A,B) < (C, D) if: 1. A < C or 2. A = C and B < D Fundamentals of Parallel Processing

Bakery Algorithm: Shared Data: VAR piknum: ARRAY[0..N-1] of BOOLEAN; number :ARRAY[0..N-1] of INTEGER; Initially piknum[i] = false, for i = 0, 1, ..., N-1 number[i] = 0, for i = 0, 1, ..., N-1 Fundamentals of Parallel Processing

Hardware Solutions: Most computers provide special instruction to ease implementation of critical section code. In general an instruction is needed that can read and modify the contents of a memory location in one cycle. These instructions, referred to as rmw (read-modify-write), can do more than just a read (load) or write (store) in one memory cycle. Test&Set -- is a machine-level instruction (implemented in hardware) that can test and modify the contents of a word in one memory cycle. Its operation can be described as follows: Function Test&Set(Var v: Boolean): Boolean; Begin Test&Set := v; v:= true; End. In other words Test&Set returns the old value of v and sets it to true regardless of its previous value. Fundamentals of Parallel Processing

Swap -- Is another such instruction. This instruction swaps the contents of two memory locations in one memory cycle. This instruction is common in IBM computers. Its operation can be described as follows: Procedure Swap(Var a, b : Boolean); Var temp: Boolean; Begin temp:= a; a:= b; b:= temp; End Now we can implement the critical section entry and exit sections using Test&Set and Swap instructions: Fundamentals of Parallel Processing

The implementation with Swap requires the use of two variables, one shared and one private: Fundamentals of Parallel Processing

The above implementation suffer from Busy Waiting. That is while a process is in its critical section, the other processes attempting to enter their critical section are waiting in either the While loop (Test$Set case) or in the Repeat loop (Swap case). The amount of busy waiting by processes is proportional to the number of processes to execute the critical section and to the length of the critical section. When fine-grain parallelism is used, then busy-waiting of processes may be the best performance solution. However, if most programs are designed with coarse-grain parallelism in mind, then busy-waiting becomes very costly in terms of performance and machine resources. The contention problems resulting from busy-waiting of other processes, will result in degraded performance of even the process that is executing in its critical section. Fundamentals of Parallel Processing

Semaphoresare one way to deal with cases with potentially large amounts of busy-waiting. Usage of semaphore operations can limit the amount of busy-waiting . Definition: Semaphore S is a • Shared Integer variable and • can only be accessed through 2 indivisible operations P(S) and V(S) P(S) : S:= S - 1; If S < 0 Then Block(S); V(S) : S:= S + 1; If S  0 Then Wakeup(S); • Block(S) -- results in the suspension of the process invoking it. • Wakeup(S) -- results in the resumption of exactly one process that has previously invoked Block(S). Note: P and V are executed atomically. Fundamentals of Parallel Processing

Given the above definition, the critical section entry and exit can be implemented using a semaphore as follows: Shared Var mutex : Semaphore; Initially mutex = 1 P(mutex); • • • Critical section V(mutex); Fundamentals of Parallel Processing

Semaphores are implemented using machine-level instructions such asTest&Set or Swap. Shared Var lock : Boolean Initially lock = false • P(S): While Test&Set(lock) Do { }; Begin S:= S - 1; If S < 0 Then Block process; lock := flase; End • V(S): While Test&Set(lock) Do { }; Begin S := S + 1; If S  0 Then Make a suspended process ready; lock := flase; End Fundamentals of Parallel Processing

Problem: Implement the semaphore operations using the Swap instruction. Fundamentals of Parallel Processing

Shared Var lock Initially lock = false Private Var key •P(S): key = true; Repeat Swap(lock, key); Until key = false; S := S - 1; If S < 0 Then Block process; lock := flase Fundamentals of Parallel Processing

Data Oriented Synchronization: • LOCK L - If LOCK L is set then wait; if it is clear, set it and proceed. • UNLOCK L - Unconditionally clear lock L. Using Test&Set LOCK and UNLOCK correspond to the following: LOCK L: Repeat y = Test&Set(L) Until y=0 UNLOCK L: L = 0 Fundamentals of Parallel Processing

Relationship between locks and critical sections: Critical sections are more like locks if we consider named critical sections. Execution inside a named critical section excludes simultaneous execution inside any other critical section of the same name. However there may be processes that are exectuing concurrently in critical sections of different names. A simple correspondance between locks and critical sections is: Critical max LOCK max critical code critical code End critical UNLOCK max Both synchronizations are used to solve the mutual exclusion problem. However, Fundamentals of Parallel Processing

Locks are more general than critical sections, since UNLOCK does not have to appear in the same process as LOCK. Fundamentals of Parallel Processing

Asynchronous Variables: A second type of data oriented synchronization. These variables have both a value and a state which is either full or empty. Asynchronous variables are accesses by two principal atomic operations: • Produce: Wait for the state of the Asynch variable to be empty, write a value, and set the state to full. • Consume: Wait for the state of the Asynch variable to be full, read the value, and set the state to emply. Fundamentals of Parallel Processing

A more complete set of operations on Asynchronous variables are those provided by the Force parallel programming language: Produce asynch var = expression Consume private var = asynch var Copy private var = asynch var - wait for full, read value, don’t change state. Void asynch var - Initialize the state to empty. Asynchronous variables can be imlemented in terms of critical sections. Fundamentals of Parallel Processing

Represent an asynchronous variable by the following data structure: V value Vf state -- true corresponds to full, false corresponds to empty Vn name Pseudo code to implement the Produce operation is: 1. L: Critical Vn 2. privf := Vf 3. If not(privf) Then 4. Begin 5. V := vlue-of-expression; 6. Vf := true; 7. End; 8. End critical 9. If privf Then goto L; Note: Private variable privf is used to obtain a copy of the shared Vf, the state of the asynch variable, before the process attempts to perform the Produce operation. This way if the test in statement number 3 reveals that the state is full, then the process returns to 1 and tries again. Fundamentals of Parallel Processing

MIMD COMPUTERS

MIMD COMPUTERS

Presentation Transcript

Computers

Computers

Computers

Computers, computers!

Introduction to MIMD architectures

Computer Architecture Shared Memory MIMD Architectures

TI C6701 VLIW MIMD

Disk Directed I/O for MIMD Multiprocessors

MIMD Computers

Computer Architecture Introduction to MIMD architectures

MIMD

MIMD Distributed Memory Architectures

MIMD

Super Computers ---Parallel Computers

SIMD-MIMD Real-Time Comparisons (Chapter 7)

COMPUTERS

Computers

Cache coherence, etc… - MIMD –

Computers

MIMD Shared Memory

Parallel MIMD Algorithm Design