270 likes | 518 Views
Bulk Synchronous Processing (BSP) Model. Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie. Outline. The Model Computation on BSP Model Automatic Memory Management Matrix Multiplication Computational Analysis BSP vs. PRAM BSPRAM. The Model.
E N D
Bulk Synchronous Processing (BSP) Model • Course: CSC 8350 • Instructor: Dr. Sushil Prasad • Presented by: Chris Moultrie
Outline • The Model • Computation on BSP Model • Automatic Memory Management • Matrix Multiplication • Computational Analysis • BSP vs. PRAM • BSPRAM
The Model • This model was proposed by Leslie Valiant in 1990. • Combination of 3 attributes: • Components: to perform processing and/or memory functions • Router:to deliver messages among components • Periodicity parameter L: to facilitate synchronization at regular intervals of L time units.
Computation on BSP Model • A computation consists of several supersteps • Asuperstep consists of: • A computation where each processor uses only locally held values • A global message transmission from each processor to any subset of others • A barrier synchronization • At the end of a superstep, the transmitted messages become available as local data for the next superstep
Continued.. • The components can be seen as processors • The router can be seen as inter-connection network • The periodicity parameter can be seen as barrier. Virtual Processors Local Computation Global Communication Barrier Synchronization
Components (Processors) • No need for programmers to manage memory, assign communication and perform low-level synchronization. • This is achieved by programs written with sufficient parallel slackness. • When programs written for v virtual processors are run on p real processors with v >> p (e.g. v = p log p) then there is parallel slackness. • Parallel slackness makes work distribution more balanced (than in cases such as v=p or v < p).
Barrier Synchronization • After each period of L time units (periodicity parameter), a global check is made to determine whether each processor has completed its task. • If all processors have completed the superstep the machine proceeds to next superstep • Otherwise, the next period of L units is allocated to the unfinished superstep. • Synchronization can be switched off for a subset of processors. However, they can still send messages over the network.
Continued.. • What is the optimal value for L? • The lower bound is set by the hardware. • The upper bound is set by the software, which in turn defines the granularity of the system. • When each processor has an independent task of approximately L steps, only then optimal processor utilization can be achieved.
The Network (Router) • The network delivers messages point to point. • It assumes no combining, duplicating or broadcasting facilities. • It basically realizes arbitrary h-relations • That is each processor sends at most h messages and receives at most h messages.
Continued.. • If ĝ is network throughput when it is in continuous operation and s is the latency or startup cost then h-relation is ĝh + s. • If ĝh > s, then we can let g = 2ĝ and the cost of a h-relation becomes gh (an overestimate of at most 2). • h-relations therefore can be realized in gh time for h larger than some h0 • If L > gh0 then every h-relation for h < h0 will cost as if it were a h0 relation.
Continued.. • Value of g is dictated by the network design. • By increasing the bandwidth of network connections and providing better switching the value of g is kept low. • As p increases, the required communication can increase with p2 and to maintain a fixed or low g, network costs increase similarly.
Automatic Memory Management • Random distribution, equally frequent access • If p accesses are made to p components, one component will get about (log p/log log p) accesses with high probability. Which will need Ω (log p/ log log p) time units. • If (p log p) accesses are made the probability is high that each processor will get no more than (3log p) and the time requirement will be O(log p) • In general, if pf(p) accesses are made, where f(p) grows faster than (log p) the worst-case access will exceed the average rate by even smaller factors. • To make the mapping from symbolic address to physical address efficiently computable hashing can be used.
Matrix Multiplication n/√p X n/√p • n = 16 • p = 4 • Each processor has to perform 2n3/p additions and multiplications, and receives 2n2/√p messages. • Every processor has 2n2/p elements which are to be sent at most √p times. • This may be achieved by data replication at source when g = O(n/√p) and L = O(n3/p) provided h is suitably small. n X n
Matrix Multiplication on Hypercube • Let us assume that in g units of time a packet can traverse one edge of the hypercube. • That is, a packet takes O (g log p) time to go to an arbitrary destination. • In the previous example, the computational (local) bounds stay intact when implemented on a hypercube. • The communication now becomes O(n logp/ √p). • Therefore, L = O(n3/p) suffice, if network can realize the h-relation (g log p) given above.
Computational Analysis • The execution time for one superstep Si of a BSP program consisting of S supersteps is given by: • wi +ghi + L • Where, wi is the largest amount of work done and hi is the largest number of messages sent or received during superstep Si. • The execution time of entire program is • W + gH + LS • Where, W = Σ(wi) and H = Σ(hi), for i = 0 to s-1
BSP vs. PRAM • BSP can be regarded as a generalization of the PRAM model. • If the BSP architecture has a small value of g (g=1), then it can be regarded as PRAM. • Hashing can be used to automatically achieve efficient memory management • The value of L determines the degree of parallel slackness required to achieve optimal efficiency. • If L = g = 1 … corresponds to idealized PRAM where no slackness is required.
BSPRAM • Variant of BSP, intended to support shared-memory style programming. • There are two levels of memory • Local memory of individual processors • Shared global memory • The network is implemented as a random-access shared memory unit. • As in BSP the computation proceeds in supersteps. • A superstep consists of an input phase, a local computation phase, and an output phase.
Continued.. • In the input phase a processor can read data from the main memory; in the output phase it can write data to the main memory. • The processors are synchronized between supersteps. • The computation within a superstep are asynchronous. • There are two types of BSPRAM, EREW BSPRAM, in which every cell of memory can be read from and written to only once in every superstep, and CRCW BSPRAM, which has no such restriction on memory access.
Computational Analysis • We will assume for the sake of convenience that if a value x is being written to a memory cell containing value y, the result may be determined by any function f(x,y) computable in O(1) time. • Similarly if values x1, x2, x3,….., xm are being written to a main memory cell containing the value y, the result may be determined by any prescribed function f(x1⊕….⊕Xm,y) where ⊕ is a commutative and associative operator and both f and ⊕ are computable in time O(1).
Continued.. • The computation cost is similar to BSP and can be given by • w + hg + l, where w is the total number of local operations performed by each processor, and h is defined as a sum of number of data units read from and written to the main memory. g and l are fixed parameter of computer. • We write BSPRAM(p,g,l) to denote a BSPRAM with the given values for p,g, and l • An Asynchronous EREW PRAM charges a unit cost for global read/write operation, d units for communication startup and B units for synchronization, which is equivalent to EREW BSPRAM (p,1,d+B)
Simulation • For efficient BSPRAM simulation on BSP some extra “parallelism” is necessary. • A BSPRAM algorithm has a slackness σ, if the communication cost of each of its supersteps is at least σ. • An optimal randomized simulation on BSP (p,g,l) can be achieved for • Theorem (i) Any EREW BSPRAM (p,g,l) algorithm with slackness σ ≥ log p; • Theorem (ii) Any CRCW BSPRAM (p,g,l) algorithm with slackness σ ≥ pε for some ε > 0.
Continued.. • A BSPRAM algorithm is said to be communication-oblivious, if the sequence of communication and synchronization operations executed by any processor are the same for any size input but no such restrictions are made for local computation. • A BSPRAM algorithm is said to have granularity ˠ if all memory cells used by the algorithm can be partitioned into granules of size at least ˠ. • σ ≥ ˠ
Matrix multiplication on BSPRAM • We need to multiply two matrices X and Y and output the result matrix Z. • Zik = Xij + Yjk, for j = 1,…,n. here 1 ≤ i,k ≤ n • Initialization: Zik <- 0 for i,k = 1,…,n • Computation: Vijk <- XijYjk,Zik <- Zik + Vijk for all i,j,k, 1 ≤ i,j,k ≤ n • Computation of different triples i,j,k is independent and therefore can be performed in parallel.
Continued.. • The array V =(Vijk) is represented as a cube of volume n3 in integer three-dimensional space • The matrices are represented as projections of the cube. • Computation of point Vijk requires the input of its X and Y projections xij and yjk, and the output of its Z projection Zik.
Continued.. • In order to provide a communication efficient BSP algorithm, the array V must be divided into p regular cubic blocks of size n/p1/3 • There will be n/p2/3 such partitions for each matrix. • Each processor can compute a block product sequentially. • Cost analysis: • W = O(n3/p), H = O(n2/ p2/3), S = O(1) • The algorithm is oblivious with σ = ˠ = n2/p2/3
References • Leslie G. Valiant, A bridging model for parallel computation, Communications of the ACM, 1990 • Alexandre Tiskin, The bulk-synchronous parallel random access machine, Theoretical computer science, 1998 - Elsevier