800 likes | 901 Views
Frédéric Gava. Le modèle BSP B ulk- S ynchronous P arallel. Background. Implicit. Explicit. BSP. Automatic parallelization. skeletons. Data-parallelism. Parallel extensions. Concurrent programming. Parallel programming. The BSP model. Unit of synchronization. P/M. P/M. P/M.
E N D
Frédéric Gava Le modèle BSP Bulk-Synchronous Parallel
Background Implicit Explicit BSP Automatic parallelization skeletons Data-parallelism Parallel extensions Concurrent programming Parallel programming
The BSP model Unit of synchronization P/M P/M P/M P/M P/M Network BSP architecture: Characterized by: • pNumber of processors • rProcessors speed • LGlobal synchronization • gPhase of communication (1 word at most sent of received by each processor)
Model of execution wi Super-step i ghi L wi+1 Super-step i+1 ghi+1 L Beginning of the super-step i Local computing on each processor Global (collective) communications between processors Global synchronization : exchanged data available for the next super-step Cost(i) = (max0x<p wxi) + hig + L
Modèle de coût • Coût(programme)=somme des coûts des super-étapes • BSP computation: scalable, portable, predictable • BSP algorithm design: minimising W (temps de calculs), H (communications), S (nombre de super-étapes) • Coût(programme) = W + g*H+ S*L • g et L sont calculables (benchmark) d’où possibilité de prédiction • Main principles: • Load-balancing minimises W • data locality minimises H • coarse granularity minimises S • In genral, data locality good, network locality bad! • Typically, problem size n>>>p (slackness) • Input/output distribution even, but otherwise arbitrary
A libertarian model • No master : • Homogeneous power of the nodes • Global (collective) decision procedure instead • No god : • Confluence (no divine intervention) • Cost predictable • Scalable performances • Practiced but confined
Advantages and drawbacks • Advantages : • Allows cost prediction and deadlock free • Structuring execution and thus bulk-sending ; it can be very efficient (sending one file of 1000 bytes performs better than sending 1000 file of 1 byte) in many architectures (multi-cores, clusters, etc.) • Abstract architecture = portable • … ? • Drawbacks : • Some algorithmic patterns don’t feet well in the BSP model : pipeline etc. • Some problem are difficult (impossible) to feet to a Coarse-grain execution model (fine-grained parallelism) • Abstract architecture = don’t take care of some efficient possibilities of some architecture (cluster of multi-core, grid) and thus need other libraries or model of execution • … ?
Example : broadcast Direct broadcast (one super-step): 0 1 2 BSP cost = png + L Broadcast with2 super-steps: BSP cost = 2ng + 2L
Algorithmes BSP • Matrices : multiplication, inversion, décomposition, algèbre linéaire, etc. • Matrices creuses : idem. • Graphes : plus court chemin, décomposition, etc. • Géométrie : diagramme de Voronoi, intersection de polygones, etc. • FFT : Fast Fournier Transformation • Recherches de motifs • Etc.
Parallel prefixes On a processor On another processor • If we suppose associative operator (+) • a+(b+c)=(a+b)+c or better • a+(b+(c+d))=(a+b) + (c+d) • Example :
Parallel Prefixes • Classical log(p) super-steps method : 0 1 2 3 Cost = log(p) × ( Time(op)+Size(d)×g+L)
Parallel Prefixes • Divide-and-conquer method : 0 1 2 3
Our parallel machine • Cluster of PCs • Pentium IV 2.8 Ghz • 512 Mb RAM • A front-end Pentium IV 2.8 Ghz, 512 Mb RAM • Gigabit Ethernet cards and switch, • Ubuntu 7.04 as OS
How to read bench • There are many manners to publish benchs : • Tables • Graphics • The goal is to say « it is a good parallel method, see my benchs » but it is often easy to arrange the presentation of the graphics to hide the problems • Using graphics (from the simple to hide to the hardest) : • Increase size of data and see for some number of processors • Increase number of processors to a typical size of data • Acceleration, i.e, Time(seq)/Time(par) • Efficienty , i.e, Acceleration/Number of processors • Increase number of processors and size of the data
Super-linear acceleration ? • Better than theoretical acceleration. Possible if data feet more on caches memories than in the RAM due to a small number of data on each processor. • Why the impact of caches ? Mainly, each processor has a little of memory call cache. Access to this memory is (all most) twice faster that RAM accesses. • Take for example, multiplication of matrices
“Fast” multiplication • A straight-forward C implementation of res=mult(A,B) (of size N*N) can look like this : for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) for (k = 0; k < N; ++k) res[i][j] += a[i][k] * b[k][j]; • Considerer the following equation : where “T” is the transposition of matrix “b”
“Fast” multiplication • One can implement this equation in C as : double tmp[N][N]; for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) tmp[i][j] = b[j][i]; for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) for (k = 0; k < N; ++k) res[i][j] += a[i][k] * tmp[j][k]; where tmp is the transpose of “b” • This new multiplication if 2 time fasters. With other caches optimisations, one can have a 64 faster programs without modifying really the algorithm.
Presentation • We have a set of body • coordinate in 2D or 3D • point masse • The classic N-body problem is to calculate the gravitational energy of N point masses that is : • Quadratique complexity… • In practice, N is very big and sometime, it is impossible to keep the set in the main memory
Parallel methods Cost of the systolic method : • Each processor has a sub-part of the original set • Parallel method one each processsor : • compute local interactions • compute interactions with other point masses • parallel prefixes of the local interactions • For 2) simple parallel methods : • using a total exchange of the sub-sets • using a systolic loop
Systolic loop 0 1 2 3
Parallel methods • There exist many better algorithms than this one • Especially, considering when computing all interactions is not needed (distancing molecules) • One classic algorithm is to divide the space into-subspace, and computed recursively the n-body on each sub-space (so have sub-sub-spaces) and only consider, interactions between these sub-spaces. Stop the recursion when there are at most two molecules in the sub-space • That introduces n*log(n) computations
Presentation • Classic : find the prime number by enumeration • Pure functional implementation using list • Complexity : n×log(n)/log(log(n)) • We used : • elim:int listintint list which deletes from a list all the integers multiple of the given parameter • final elim:int listint listint list iterates elim • seq_generate:intintint list which returns the list of integers between 2 bounds • select:intint listint list which gives the first prime numbers of a list.
Parallel methods 11,12,13,14,15 16,17,18,19,20 21,22,23,24,25 • Simple Parallel methods : • using a kind of scan • using a direct sieve • using a recursive one • Different partitions of data • per block (for scan) : • cyclic distribution : 11,14,17,20,23 12,15,18,21,24 13,16,19,22,25
Scan version • Method using a scan : • Each processor computes a local sieve (the processor 0 contains thus the first prime numbers) • then our scan is applied and we eliminate on processor i the integers that are multiple of integers of processors i−1, i−2, etc. • Cost : as a scan (logarithmic)
Direct version • Method : • each processor computes a local sieve • then integers that are less to are globally exchanged and a new sieve is applied to this list of integers (thus giving prime numbers) • each processor eliminates, in its own list, integers that are multiples of this first primes
Inductive version • Recursive method by induction over n : • We suppose that the inductive step gives the th first primes • we perform a total exchange on them to eliminates the non-primes. • End of this induction comes from the BSP cost: we end when n is small enough so that the sequential methods is faster than the parallel one • Cost :
Presentation • Each processor has listed set of data (array, list, etc.) • The goal is that : • data on each processor are ordored. • data on processor i are smaller than data on processor i+1 • good balancing • Parallel sorting is not very efficient due to too many communications • But usefull and more efficient than gather all the data in one processor and then sort them
Tiskin’s Sampling Sort Local sort 1,2,7,11,14,16,20 4,6,9,12,13,18,21 3,5,8,10,15,17,19 Select first samples (p+1 elements with at last first and last ones) 1,2,7,11,14,16,20 4,6,9,12,13,18,21 3,5,8,10,15,17,19 Total exchange of first sample 1,7,14,20,4,9,13,21,3,8,15,19 1,7,14,20,4,9,13,21,3,8,15,19 1,7,14,etc. Local sort of samples (each processor) 1,3,4,7,8,9,13,14,15,19,20,21 1,3,4,7,8,9,13,14,15,19,20,21 1,3,4,7,etc. 0 1 2 1,11,16,7,14,2,20 18,9,13,21,6,12,4 15,5,19,3,17,8,10
Tiskin’s Sampling Sort Select second samples (p+1 elements with at last first and last ones) 1,3,4,7,8,9,13,14,15,19,20,21 1,3,4,7,etc. 1,3,4,7,8,9,13,14,15,19,20,21 Interval for processor 0 Interval for processor 1 Interval for processor 2 1,2,7,11,14,16,20 4,6,9,12,13,18,21 3,5,8,10,15,17,19 1,2,7 4,6 3,5 11,14 9,12,13 8,10 16,20 18,21 15,17,19 Fusion of receveid and sorted elements 1,2,3,4,5,6,7 8,9,10,11,12,13,14 15,16,17,18,19,20,21