1.45k likes | 1.88k Views
Parallel Computing Final Exam Review. What is Parallel computing?. Parallel computing involves performing parallel tasks using more than one computer. Example in real life with related principles -- book shelving in a library Single worker
E N D
What is Parallel computing? • Parallel computing involves performing parallel tasks using more than one computer. • Example in real life with related principles -- book shelving in a library • Single worker • P workers with each worker stacking n/p books, but with arbitration problem(many workers try to stack the next book in the same shelf.) • P workers with each worker stacking n/p books, but without arbitration problem (each worker work on a different set of shelves)
Important Issues in parallel computing • Task/Program Partitioning. • How to split a single task among the processors so that each processor performs the same amount of work, and all processors work collectively to complete the task. • Data Partitioning. • How to split the data evenly among the processors in such a way that processor interaction is minimized. • Communication/Arbitration. • How we allow communication among different processors and how we arbitrate communication related conflicts.
Challenges • Design of parallel computers so that we resolve the above issues. • Design, analysis and evaluation of parallel algorithms run on these machines. • Portability and scalability issues related to parallel programs and algorithms • Tools and libraries used in such systems.
Units of Measure in HPC • High Performance Computing (HPC) units are: —Flop: floating point operation —Flops/s: floating point operations per second —Bytes: size of data (a double precision floating point number is 8) • Typical sizes are millions, billions, trillions… Mega Mflop/s = 106 flop/sec Mbyte = 220 = 1048576 ~ 106 bytes Giga Gflop/s = 109 flop/sec Gbyte = 230 ~ 109 bytes Tera Tflop/s = 1012 flop/sec Tbyte = 240 ~ 1012 bytes Peta Pflop/s = 1015 flop/sec Pbyte = 250 ~ 1015 bytes Exa Eflop/s = 1018 flop/sec Ebyte = 260 ~ 1018 bytes Zetta Zflop/s = 1021 flop/sec Zbyte = 270 ~ 1021 bytes Yotta Yflop/s = 1024 flop/sec Ybyte = 280 ~ 1024 bytes • See www.top500.org for current list of fastest machines
What is a parallel computer? • A parallel computer is a collection of processors that cooperatively solve computationally intensive problems faster than other computers. • Parallel algorithms allow the efficient programming of parallel computers. • This way the waste of computational resources can be avoided. • Parallel computer v.s. Supercomputer • supercomputer refers to a general-purpose computer that can solve computational intensive problems faster than traditional computers. • A supercomputer may or may not be a parallel computer.
Flynn’s taxonomy of computer architectures (control mechanism) • Depending on the execution and data streams computer architectures can be distinguished into the following groups. • (1) SISD (Single Instruction Single Data) : This is a sequential computer. • (2) SIMD (Single Instruction Multiple Data) : This is a parallel machine like the TM CM-200. SIMD machines are suited for data-parallel programs where the same set of instructions are executed on a large data set. • Some of the earliest parallel computers such as the Illiac IV, MPP, DAP, CM-2, and MasPar MP-1 belonged to this class of machines • (3) MISD (Multiple Instructions Single Data) : Some consider a systolic array a member of this group. • (4) MIMD (Multiple Instructions Multiple Data) : All other parallel machines. A MIMD architecture can be an MPMD or an SPMD. In a Multiple Program Multiple Data organization, each processor executes its own program as opposed to a single program that is executed by all processors on a Single Program Multiple Data architecture. • Examples of such platforms include current generation Sun Ultra Servers, SGI Origin Servers, multiprocessor PCs, workstation clusters, and the IBM SP Note: Some consider CM-5 as a combination of a MIMD and SIMD as it contains control hardware that allows it to operatein a SIMD mode.
SIMD and MIMD Processors A typical SIMD architecture (a) and a typical MIMD architecture (b).
Taxonomy based on Address-Space Organization (memory distribution) • Message-Passing Architecture • In a distributed memory machine each processor has its own memory. Each processor can access its own memory faster than it can access the memory of a remote processor (NUMA for Non-Uniform Memory Access). This architecture is also known as message-passing architecture and such machines are commonly referred to as multicomputers. • Examples: Cray T3D/T3E, IBM SP1/SP2, workstation clusters. • Shared-Address-Space Architecture • Provides hardware support for read/write to a shared address space. Machines built this way are often called multiprocessors. • (1) A shared memorymachine has a single address space shared by all processors (UMA, for Uniform Memory Access). • The time taken by a processor to access any memory word in the system is identical. • Examples: SGI Power Challenge, SMP machines. • (2) A distributed shared memorysystem is a hybrid between the two previous ones. A global address space is shared among the processors but is distributed among them. Example: SGI Origin 2000 Note: The existence of a cache in shared-memory parallel machines cause cache coherence problems when a cached variable is modified by a processor and the shared-variable is requested by another processor. cc-NUMA for cachecoherent NUMA architectures (Origin 2000).
NUMA and UMA Shared-Address-Space Platforms Typical shared-address-space architectures: (a) Uniform-memory access shared-address-space computer; (b) Uniform-memory-access shared-address-space computer with caches and memories; (c) Non-uniform-memory-access shared-address-space computer with local memory only.
Message Passing vs. Shared Address Space Platforms • Message passing requires little hardware support, other than a network. • Shared address space platforms can easily emulate message passing. The reverse is more difficult to do (in an efficient manner).
Taxonomy based on processor granularity • The granularity sometimes refers to the power of individual processors. Sometimes is also used to denote the degree of parallelism. • (1) A coarse-grained architecture consists of (usually few) powerful processors (eg old Cray machines). • (2) a fine-grainedarchitecture consists of (usually many inexpensive) processors (eg TM CM-200, CM-2). • (3) a medium-grainedarchitecture is between the two (eg CM-5). • Process Granularityrefers to the amount of computation assigned to a particular processor of a parallel machine for a given parallel program. It also refers, within a single program, to the amount of computation performed before communication is issued. If the amount of computation is small (low degree of concurrency) a process is fine-grained. Otherwise granularity is coarse.
Taxonomy based on processor synchronization • (1) In a fully synchronous system a global clock is used to synchronize all operations performed by the processors. • (2) An asynchronous system lacks any synchronization facilities. Processor synchronization needs to be explicit in a user’s program. • (3) A bulk-synchronous system comes in between a fully synchronous and an asynchronous system. Synchronization of processors is required only at certain parts of the execution of a parallel program.
Physical Organization of Parallel Platforms – ideal architecture(PRAM) • The Parallel Random Access Machine (PRAM) is one of the simplest ways to model a parallel computer. • A PRAM consists of a collection of (sequential) processors that can synchronouslyaccess a global shared memory in unit time. Each processor can thus access its shared memory as fast (and efficiently) as it can access its own local memory. • The main advantages of the PRAM is its simplicity in capturing parallelism and abstracting away communication and synchronization issues related to parallel computing. • Processors are considered to be in abundance and unlimited in number. The resulting PRAM algorithms thus exhibit unlimited parallelism(number of processors used is a function of problem size). • The abstraction thus offered by the PRAM is a fully synchronous collection of processors and a shared memory which makes it popular for parallel algorithm design. • It is, however, this abstraction that also makes the PRAM unrealistic from a practical point of view. • Full synchronization offered by the PRAM is too expensive and time demanding in parallel machines currently in use. • Remote memory (i.e. shared memory) access is considerably more expensive in real machines than local memory access • UMA machines with unlimited parallelism are difficult to build.
Four Subclasses of PRAM • Depending on how concurrent access to a single memory cell (of the shared memory) is resolved, there are various PRAM variants. • ER (Exclusive Read) or EW (Exclusive Write) PRAMs do not allow concurrent access of the shared memory. • It is allowed, however, for CR (Concurrent Read) or CW (Concurrent Write) PRAMs. • Combining the rules for read and write access there are four PRAM variants: • EREW: • access to a memory location is exclusive. No concurrent read or write operations are allowed. • Weakest PRAM model • CREW • Multiple read accesses to a memory location are allowed. Multiple write accesses to a memory location are serialized. • ERCW • Multiple write accesses to a memory location are allowed. Multiple read accesses to a memory location are serialized. • Can simulate an EREW PRAM • CRCW • Allows multiple read and write accesses to a common memory location. • Most powerful PRAM model • Can simulate both EREW PRAM and CREW PRAM
Resolve concurrent write access • (1) in the arbitraryPRAM, if multiple processors write into a single shared memory cell, then an arbitrary processor succeeds in writing into this cell. • (2) in the commonPRAM, processors must write the same value into the shared memory cell. • (3) in the priorityPRAM the processor with the highest priority (smallest or largest indexed processor) succeeds in writing. • (4) in the combiningPRAM if more than one processors write into the same memory cell, the result written into it depends on the combining operator. If it is the sum operator, the sum of the values is written, if it is the maximum operator the maximum is written. Note: An algorithm designed for the common PRAM can be executed on a priority or arbitrary PRAM and exhibit similar complexity. The same holds for an arbitrary PRAM algorithm when run on a priority PRAM.
Innerconnection Networks for Parallel Computers • Interconnection networks carry data between processors and to memory. • Interconnects are made of switches and links (wires, fiber). • Interconnects are classified as static or dynamic. • Static networks • Consists of point-to-point communication links among processing nodes • Also referred to as direct networks • Dynamic networks • Built using switches (switching element) and links • Communication links are connected to one another dynamically by the switches to establish paths among processing nodes and memory banks.
Static and DynamicInterconnection Networks Classification of interconnection networks: (a) a static network; and (b) a dynamic network.
Network Topologies • Bus-Based Networks • The simplest network that consists a shared medium(bus) that is common to all the nodes. • The distance between any two nodes in the network is constant (O(1)). • Ideal for broadcasting information among nodes. • Scalable in terms of cost, but not scalable in terms of performance. • The bounded bandwidth of a bus places limitations on the overall performance as the number of nodes increases. • Typical bus-based machines are limited to dozens of nodes. • Sun Enterprise servers and Intel Pentium based shared-bus multiprocessors are examples of such architectures
Network Topologies • Crossbar Networks • Employs a grid of switches or switching nodes to connect p processors to b memory banks. • Nonblocking network: • the connection of a processing node to a memory bank doesnot block the connection of any other processing nodes to other memory banks. • The total number of switching nodes required is Θ(pb). (It is reasonable to assume b>=p) • Scalable in terms of performance • Not scalable in terms of cost. • Examples of machines that employ crossbars include the Sun Ultra HPC 10000 and the Fujitsu VPP500
Network Topologies: Multistage Network • Multistage Networks • Intermediate class of networks between bus-based network and crossbar network • Blocking networks: access to a memory bank by a processor may disallow access to another memory bank by another processor. • More scalable than the bus-based network in terms of performance, more scalable than crossbar network in terms of cost. The schematic of a typical multistage interconnection network.
Network Topologies: Multistage Omega Network • Omega network • Consists of log p stages, p is the number of inputs(processing nodes) and also the number of outputs(memory banks) • Each stage consists of an interconnection pattern that connects p inputs and p outputs: • Perfect shuffle(left rotation): • Each switch has two conncetion modes: • Pass-thought conncetion: the inputs are sent straight through to the outputs • Cross-over connection: the inputs to the switching node are crossed over and then sent out. • Has p/2*log p switching nodes
Network Topologies: Multistage Omega Network A complete Omega network with the perfect shuffle interconnects and switches can now be illustrated: A complete omega network connecting eight inputs and eight outputs. An omega network has p/2 × log p switching nodes, and the cost of such a network grows as (p log p).
Network Topologies: Multistage Omega Network – Routing • Let s be the binary representation of the source and d be that of the destination processor. • The data traverses the link to the first switching node. If the most significant bits of s and d are the same, then the data is routed in pass-through mode by the switch else, it switches to crossover. • This process is repeated for each of the log p switching stages. • Note that this is not a non-blocking switch.
Network Topologies: Multistage Omega Network – Routing An example of blocking in omega network: one of the messages (010 to 111 or 110 to 100) is blocked at link AB.
Network Topologies - Fixed Connection Networks (static) • Completely-connection Network • Star-Connected Network • Linear array • 2d-array or 2d-mesh or mesh • 3d-mesh • Complete Binary Tree (CBT) • 2d-Mesh of Trees • Hypercube
Evaluating Static Interconnection Network • One can view an interconnection network as a graph whose nodes correspond to processors and its edges to links connecting neighboring processors. The properties of these interconnection networks can be described in terms of a number of criteria. • (1) Set of processor nodes V . The cardinality of V is the number of processors p (also denoted by n). • (2) Set of edges E linking the processors. An edge e = (u, v) is represented by a pair (u, v) of nodes. If the graph G = (V,E) is directed, this means that there is a unidirectional link from processor u to v. If the graph is undirected, the link is bidirectional. In almost all networks that will be considered in this course communication links will be bidirectional. The exceptions will be clearly distinguished. • (3) The degree duof node u is the number of links containing u as an endpoint. If graph G is directed we distinguish between the out-degree of u (number of pairs (u, v) ∈ E, for any v ∈ V ) and similarly, the in-degree of u. T he degree d of graph G is the maximum of the degrees of its nodes i.e. d = maxudu.
Evaluating Static Interconnection Network (cont.) • (4) The diameter D of graph G is the maximum of the lengths of the shortest paths linking any two nodes of G. A shortest path between u and v is the path of minimal length linking u and v. We denote the length of this shortest path by duv. Then, D = maxu,vduv. The diameter of a graph G denotes the maximum delay (in terms of number of links traversed) that will be incurred when a packet is transmitted from one node to the other of the pair that contributes to D (i.e. from u to v or the other way around, if D = duv). Of course such a delay would hold if messages follow shortest paths (the case for most routing algorithms). • (5) latency is the total time to send a message including software overhead. Message latency is the time to send a zero-length message. • (6) bandwidth is the number of bits transmitted in unit time. • (7) bisection width is the number of links that need to be removed from G to split the nodes into two sets of about the same size (±1).
Network Topologies - Fixed Connection Networks (static) • Completely-connection Network • Each node has a direct communication link to every other node in the network. • Ideal in the sense that a node can send a message to another node in a single step. • Static counterpart of crossbar switching networks • Nonblocking • Star-Connected Network • One processor acts as the central processor. Every other processor has a communication link connecting it to this central processor. • Similar to bus-based network. • The central processor is the bottleneck.
Network Topologies: Completely Connected and Star Connected Networks Example of an 8-node completely connected network. (a) A completely-connected network of eight nodes; (b) a star connected network of nine nodes.
Network Topologies - Fixed Connection Networks (static) • Linear array • In a linear array, each node(except the two nodes at the ends) has two neighbors, one each to its left and right. • Extension: ring or 1-D torus(linear array with wraparound). • 2d-array or 2d-mesh or mesh • The processors are ordered to form a 2-dimensional structure (square) so that each processor is connected to its four neighbor (north, south, east, west) except perhaps for the processors of the boundary. • Extension of linear array to two-dimensions: Each dimension has p nodes with a node identified by a two-tuple (i,j). • 3d-mesh • A generalization of a 2d-mesh in three dimensions. Exercise: Find the characteristics of this network and its generalization in k dimensions (k > 2). • Complete Binary Tree (CBT) • There is only one path between any pair of two nodes • Static tree network: have a processing element at each node of the tree. • Dynamic tree network: nodes at intermediate levels are switching nodes and the leaf nodes are processing elements. • Communication bottleneck at higher levels of the tree. • Solution: increasing the number of communication links and switching nodes closer to the root.
Network Topologies: Linear Arrays Linear arrays: (a) with no wraparound links; (b) with wraparound link.
Network Topologies: Two- and Three Dimensional Meshes Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh with wraparound link (2-D torus); and (c) a 3-D mesh with no wraparound.
Network Topologies: Tree-Based Networks Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network.
Network Topologies: Fat Trees A fat tree network of 16 processing nodes.
Evaluating Network Topologies • Linear array • |V | = N, |E| = N −1, d = 2, D = N −1, bw = 1 (bisection width). • 2d-array or 2d-mesh or mesh • For |V | = N, we have a √N ×√N mesh structure, with |E| ≤ 2N = O(N), d = 4, D = 2√N − 2, bw = √N. • 3d-mesh • Exercise: Find the characteristics of this network and its generalization in k dimensions (k > 2). • Complete Binary Tree (CBT) on N = 2n leaves • For a complete binary tree on N leaves, we define the level of a node to be its distance from the root. The root is of level 0 and the number of nodes of level i is 2i. Then, |V | = 2N − 1, |E| ≤ 2N − 2 = O(N), d = 3, D = 2lgN, bw = 1(c)
Network Topologies - Fixed Connection Networks (static) • 2d-Mesh of Trees • An N2-leaf 2d-MOT consists of N2 nodes ordered as in a 2d-array N ×N (but without the links). The N rows and N columns of the 2d-MOT form N row CBT and N column CBTs respectively. • For such a network, |V | = N2+2N(N −1), |E| = O(N2), d = 3, D = 4lgN, bw = N. The 2d-MOT possesses an interesting decomposition property. If the 2N roots of the CBT’s are removed we get 4 N/2 × N/2 CBT s. • The 2d-Mesh of Trees (2d-MOT) combines the advantages of 2d-meshes and binary trees. A 2d-mesh has large bisection width but large diameter (√N). On the other hand a binary tree on N leaves has small bisection width but small diameter. The 2d-MOT has small diameter and large bisection width. • A 3d-MOTcan be defined similarly.
Network Topologies - Fixed Connection Networks (static) • Hypercube • The hypercube is the major representative of a class of networks that are called hypercubic networks. Other such networks is the butterfly, the shuffle-exchange graph, de-Bruijn graph, Cube-connected cycles etc. • Each vertex of an n-dimensional hypercube is represented by a binary string of length n. Therefore there are |V | = 2n= N vertices in such a hypercube. Two vertices are connected by an edge if their strings differ in exactly one bit position. • Let u = u1u2. . .ui . . . un. An edge is a dimension i edge if it links two nodes that differ in the i-th bit position. • This way vertex u is connected to vertex ui= u1u2. . . ūi . . .unwith a dimension i edge. Therefore |E| = N lg N/2 and d = lgN = n. • The hypercube is the first network examined so far that has degree that is not a constant but a very slowly growing function of N. The diameter of the hypercube is D = lgN. A path from node u to node v can be determined by correcting the bits of u to agree with those of v starting from dimension 1 in a “left-to-right” fashion. The bisection width of the hypercube is bw= N. This is a result of the following property of the hypercube. If all edges of dimension i are removed from an n dimensional hypercube, we get two hypercubes each one of dimension n − 1.
Network Topologies: Hypercubes and their Construction Construction of hypercubes from hypercubes of lower dimension.
Performance Metrics for Parallel Systems • Number of processing elements p • Execution Time • Parallel runtime: the time that elapses from the moment a parallel computation starts to the moment the last processing element finishes execution. • Ts: serial runtime • Tp: parallel runtime • Total Parallel Overhead T0 • Total time collectively spent by all the processing elements – running time required by the fastest known sequential algorithm for solving the same problem on a single processing element. • T0=pTp-Ts
Performance Metrics for Parallel Systems • Speedup S: • The ratio of the serial runtime of the best sequential algorithm for solving a problem to the time taken by the parallel algorithm to solve the same problem on p processing elements. • S=Ts(best)/Tp • Example: adding n numbers: Tp=Θ(logn), Ts= Θ(n), S= Θ(n/logn) • Theoretically, speedup can never exceed the number of processing elements p(S<=p). • Proof: Assume a speedup is greater than p, then each processing element can spend less than time Ts/p solving the problem. In this case, a single processing element could emulate the p processing elements and solve the problem in fewer than Ts units of time. This is a contradiction because speedup, by definition, is computed with respect to the best sequential algorithm. • Superlinear speedup: In practice, a speedup greater than p is sometimes observed, this usually happens when the work performed by a serial algorithm is greater than its parallel formulation or due to hardware features that put the serial implementation at a disadvantage.
Example for Superlinear speedup • Superlinear speedup: • Example1: Superlinear effects from caches: With the problem instance size of A and 64KB cache, the cache hit rate is 80%. Assume latency to cache of 2ns and latency of DRAM of 100ns, then memory access time is 2*0.8+100*0.2=21.6ns. If the computation is memory bound and performs one FLOP/memory access, this corresponds to a processing rate of 46.3 MFLOPS. With the problem instance size of A/2 and 64KB cache, the cache hit rate is higher, i.e., 90%, 8% the remaining data comes from local DRAM and the other 2% comes from the remote DRAM with latency of 400ns, then memory access time is 2*0.9+100*0.08+400*0.02=17.8. The corresponding execution rate at each processor is 56.18MFLOPS, and for two processors the total processing rate is 112.36MFLOPS. Then the speedup will be 112.36/46.3=2.43!
Example for Superlinear speedup • Superlinear speedup: • Example2: Superlinear effects due to exploratory decomposition: explore leaf nodes of an unstructured tree. Each leaf has a label associated with it and the objective is to find a node with a specified label, say ‘S’. The solution node is the rightmost leaf in the tree. A serial formulation of this problem based on depth-first tree traversal explores the entire tree, i.e. all 14 nodes, time is 14 units time. Now a parallel formulation in which the left subtree is explored by processing element 0 and the right subtree is explored by processing element 1. The total work done by the parallel algorithm is only 9 nodes and corresponding parallel time is 5 units time. Then the speedup is 14/5=2.8.
Performance Metrics for Parallel Systems(cont.) • Efficiency E • Ratio of speedup to the number of processing element. • E=S/p • A measure of the fraction of time for which a processing element is usefully employed. • Examples: adding n numbers on n processing elements: Tp=Θ(logn), Ts= Θ(n), S= Θ(n/logn), E= Θ(1/logn) • Cost(also called Work or processor-time product) W • Product of parallel runtime and the number of processing elements used. • W=Tp*p • Examples: adding n numbers on n processing elements: W= Θ(nlogn). • Cost-optimal: if the cost of solving a problem on a parallel computer has the same asymptotic growth(in Θ terms) as a function of the input size as the fastest-known sequential algorithm on a single processing element. • Problem Size W2 • The number of basic computation steps in the best sequential algorithm to solve the problem on a single processing element. • W2=Ts of the fastest known algorithm to solve the problem on a sequential computer.
Parallel vs Sequential Computing: Amdahl’s • Theorem 0.1 (Amdahl’s Law) Let f, 0 ≤ f ≤ 1, be the fraction of a computation that is inherently sequential. Then the maximum obtainable speedup S on p processors is S ≤1/(f + (1 − f)/p) • Proof. Let T be the sequential running time for the named computation. fT is the time spent on the inherently sequential part of the program. On p processors the remaining computation, if fully parallelizable, would achieve a running time of at most (1−f)T/p. This way the running time of the parallel program on p processors is the sum of the execution time of the sequential and parallel components that is, fT + (1 − f)T/p. The maximum allowable speedup is therefore S ≤ T/(fT + (1 − f)T/p) and the result is proven.
Amdahl’s Law • Amdahl used this observation to advocate the building of even more powerful sequential machines as one cannot gain much by using parallel machines. For example if f = 10%, then S ≤ 10 as p → ∞. The underlying assumption in Amdahl’s Law is that the sequential component of a program is a constant fraction of the whole program. In many instances as problem size increases the fraction of computation that is inherently sequential decreases with time. In many cases even a speedup of 10 is quite significant by itself. • In addition Amdahl’s law is based on the concept that parallel computing always tries to minimize parallel time. In some cases a parallel computer is used to increase the problem size that can be solved in a fixed amount of time. For example in weather prediction this would increase the accuracy of say a three-day forecast or would allow a more accurate five-day forecast.
Parallel vs Sequential Computing: Gustaffson’s Law • Theorem 0.2 (Gustafson’s Law) Let the execution time of a parallel algorithm consist of a sequential segment fT and a parallel segment (1 − f)T and the sequential segment is constant. The scaled speedup of the algorithm is then. S =(fT + (1 − f)Tp)/(fT + (1 − f)T) = f + (1 − f)p • For f = 0.05, we get S = 19.05, whereas Amdahl’s law gives an S ≤ 10.26. 1 proc p proc fT fT (1-f)Tp (1-f)T T(f+(1-f)p) T • Amdahl’s Law assumes that problem size is fixed when it deals with scalability. Gustafson’s Law assumes that running time is fixed.
Brent’s Scheduling Principle(Emulations) • Suppose we have an unlimited parallelism efficient parallel algorithm, i.e. an algorithm that runs on zillions of processors. In practice zillions of processors may not available. Suppose we have only p processors. A question that arises is what can we do to “run” the efficient zillion processor algorithm on our limited machine. • One answer is emulation: simulate the zillion processor algorithm on the p processor machine. • Theorem 0.3 (Brent’s Principle) Let the execution time of a parallel algorithm requires m operations and runs in parallel time t. Then running this algorithm on a limited processor machine with only p processors would require time m/p + t. • Proof: Let mi be the number of computational operations at the i-th step, i.e. .If we assign the p processors on the i-th step to work on these mi operations they can conclude in time . Thus the total running time on p processors would be
The Message Passing Interface (MPI): Introduction • The Message-Passing Interface (MPI)is an attempt to create a standard to allow tasks executing on multiple processors to communicate through some standardized communication primitives. • It defines a standard library for message passing that one can use to develop message-passing program using C or Fortran. • The MPI standard define both the syntax and the semantics of these functional interface to message passing. • MPI comes intro a variety of flavors, freely available such as LAM-MPI and MPIch, and also commercial versions such as Critical Software’s WMPI. • It supporst message-passing on a variety of platforms from Linux-based or Windows-based PC to supercomputer and multiprocessor systems. • After the introduction of MPI whose functionality includes a set of 125 functions, a revision of the standard took place that added C++ support, external memory accessibility and also Remote Memory Access (similar to BSP’s put and get capability)to the standard. The resulting standard is known as MPI-2 and has grown to almost 241 functions.