170 likes | 360 Views
Embarrassingly Parallel (or pleasantly parallel). Definition: Problems that scale well to thousands of processors. Characteristics Domain divisible into a large number of independent parts. Little or no communication between processors Processor performs the same calculation independently
E N D
Embarrassingly Parallel (or pleasantly parallel) Definition: Problems that scale well to thousands of processors Characteristics • Domain divisible into a large number of independent parts. • Little or no communication between processors • Processor performs the same calculation independently • “Nearly embarrassingly parallel” • Communication is limited to distributing and gathering data • Computation dominates the communication
Embarrassingly Parallel Examples P0 P1 P2 Embarrassingly Parallel Application Send Data P0 P1 P2 P3 Receive Data Nearly Embarrassingly Parallel Application
Low Level Image Processing Note: Does not include communication to a graphics adapter • Storage • A two dimensional array of pixels. • One bit, one byte, or three bytes may represent pixels • Operations may only involve local data • Image Applications • Shift: newX=x+delta; newY=y+delta • Scale: newX = x*scale; newY = y*scale • Rotate a point about the originnewX = x cosF+y sinF; newY=-xsinF+ycosF • ClipnewX = x if minx<=x< maxx; 0 otherwisenewY = y if miny<=y<=maxy; 0 otherwise
Non-trivial Image Processing • Smoothing • A function that captures important patterns, while eliminating noise or artifacts • Linear smoothing: Apply a linear transformation to a picture • Convolution: Pnew(x,y) = ∑j=0,m-1∑k=0,n-1 P(x,y,j,k)old f(j,k) • Edge Detection • A function that searches for discontinuities or variations in depth, surface, or color • Purpose: Significantly reduce follow-up processing • Uses: Pattern recognition and computer vision • One approach: differentiate to identify large changes • Pattern Matching • Match an image against a template or a group of features • Example: ∑i=0,X∑j=1,Y (Picture(x+I,y+i) – Template(x,y)) Note: This is another digital signal processing application
Array Storage Cow –major (left most dimensions) are stored one after another Column-major (right most dimensions) are stored one after another • The C language stores arrays in row-major order, Matlab and Fortran use column-major order • Loops can be extremely slow in C if the outer loop processes columns due to the system memory cache operation • int A[2][3] = { {1, 2, 3}, {4, 5, 6} }; In memory: 1 2 3 4 5 6 • int A[2][3][2] = {{{1,2}, {3,4}, {5,6}}, {{7,8}, {9,10}, {11,12}}}; • In memory: 1 2 3 4 5 6 7 8 9 10 11 12 • Translate multi-dimension indices to single dimension offsets • Two Dimensions: offset = row*COLS + column • Three Dimensions: offset = i*DIM2*DIM3 + j*DIM3+ k • What is the formula for four dimensions?
Process Partitioning 1024 Note: 128 rows per displayed cell Pixel 2053 Row 2, column 5 Rows 0: 0-1023 2: 2048-3071 Pixel 21 Rows 0: 0-7 2: 8-15 3:16-23 768 Note: 128 columns per displayed cell Partitioning might assign groups of rows or columns to processors
Typical Static Partitioning • Master • Scatter or broadcast the image along with assigned processor rows • Gather the updated data back and perform final updates if necessary • Slave • Receive Data • Compute translated coordinates • Perform collective gather operation • Questions • How does the master decide how much to assign to each processor? • Is the load balanced (all processors working equally)? • Notes on the Text shift example • Employs individual sends/receives, which is much slower • However, if coordinate positions change or results do not represent contiguous pixel positions, this might be required
Mandelbrot Set Definition: Those points C = (x,y) = x + iy in the complex plane that iterate with a function (normally: zn+1 = zn2 + C) converge to a finite value Implementation Z0 = 0 + 0 * i For each (x,y) from [-2,+2] Iterate zn until either • The iteration stops when the iteration count reaches a limit (in the set) • Zn is out of bounds ( |zn|>2(not in the set) Save the iteration count which will map to a display color • Complex plane Display • horizontal axis: real values • vertical axis: imaginary values
Scaling and Zooming • Display range of points • From cmin = xmin + iymin to cmax = xmax + iymax • Display range of pixels • From the pixel at (0,0) to the pixel at (ROWS, COLUMNS) • Pseudo code For row= rowmin to rowend For col = 0 to COLUMNS cy = ymin+(ymax-ymin)* row/ROWS cx = xmin+(xmax-xmin)* col/COLUMNS color = mandelbrot(cx, cy) picture[COLUMNS*row+col] = color
Pseudo code (mandelbrot(cx,cy)) SET z = zreal + i*zimaginary = 0 + i0 SET iterations = 0 DO SET z = z2 + C // temp = zreal; zreal=zreal2–zimaginary2 + cx // zimaginary = 2 * temp * zimaginary + cy SET value = zreal2 + zimaginary2 iterations++ WHILE value<=4 and iterations<max RETURN iterations Notes: • The final iteration count determines each point’s color • Some points converge quickly; others slowly, and others not at all • Non-converging points are in the Mandelbrot Set (black on the previous slide) • Note 4½ = 2, so we don't need to compute the square root when setting value
Parallel Implementation Both the Static and Dynamic algorithms are examples of load balancing • Load-balancing • Algorithms used to avoid processors from becoming idle • Note: A balanced load does NOT require even same work loads • Static Approach • The load is assigned once at the start of the run • Mandelbrot: assign each processor a group of rows • Deficiencies: Not load balanced • Dynamic Approach • The load is dynamically assigned during the run • Mandelbrot: Slaves ask for work when they complete a section
The Dynamic Approach • The Master's work is increased somewhat • Must send rows when receive requests from slaves • Must be responsive to slave requests. A separate thread might help or the master can make use of MPI's asynchronous receive calls. • Termination • Slaves terminate when receiving "no work" indication in received messages • The master must not terminate until all of the slaves complete • Partitioning of the load • Master receives blocks of pixels, Slaves receive ranges of (x,y) ranges • Partitions can be in columns or in rows. Which is better? • Refinement: Ask for work before completion (double buffering)
Monte Carlo Methods Pseudo-code (Throw darts to converge at a solution) • Compute a definite integralWhile more iterations needed pick a random point total += f(x) result = 1/iterations * total • Calculation of PI While more iterations needed Randomly pick a point If point is in circle within++Compute PI = 4 * within / iterationsUsing the upper right quadrant eliminates the 4 in the equation Note: Parallel programs shouldn't use the standard random number generator 1/N ∑1N f(pick.x) (xmax – xmin)
Computation of PI ∫(1-x2)1/2dx = π; -1<=x<=1 ∫(1-x2)1/2dx = π/4; 0<=x<=1 Within if (point.x2 + point.y2) ≤ 1 Total points/Points within = Total Area/Area in shape • Questions: • How to handle the boundary condition? • What is the best accuracy that we can achieve?
Parallel Random Number Generator • Numbers of a pseudo random sequence are • Uniformly, large period, repeatable, statistically independent • Each processor must generate a unique sequence • Accuracy depends upon random sequence precision • Sequential linear generator(a and m are prime; c=0) • Xi+1 = (axi +c) mod m (ex: a=16807, m=231 – 1, c=0) • Many other generators are possible • Parallel linear generator with unique sequences • Xi+k = (Axi + C) mod m where k is the "jump" constant • A=ak, C=c (ak-1 +ak-2 + …+a1 +a0) • if k = P, we can compute A and C and the first k random numbers to get started x1 x2 xP-1 xP-1 xP xP+1 x2P-2 x2P-1 Parallel Random Sequence