High-Performance Grid Computing and Research Networking

High-Performance Grid Computing and Research Networking Classic Examples of Shared Memory Program Presented by Yuming Zhang Instructor: S. Masoud Sadjadi http://www.cs.fiu.edu/~sadjadi/Teaching/ sadjadi At cs Dot fiu Dot edu

Acknowledgements • The content of many of the slides in this lecture notes have been adopted from the online resources prepared previously by the people listed below. Many thanks! • Henri Casanova • Principles of High Performance Computing • http://navet.ics.hawaii.edu/~casanova • henric@hawaii.edu

Domain Decomposition • Now that we know how to create and manage threads, we need to decide which thread does what • This is really the art of parallel computing • Fortunately, in shared memory, it is often quite simple • We’ll look at three examples • “Embarrassingly” parallel application • load-balancing issue • “Non-embarrassingly parallel” application • thread synchronization issue • Shark & Fish simulation • load-balancing AND thread synchronization issue

Embarrassingly Parallel • Embarrassingly parallel applications • Consists of a set of elementary computations • These computations can be done in any order • They are said to be “independent” • Sometimes referred to as “pleasantly” parallel • Trivial Example:Compute all values of a function of two variables over a 2-D domain • function f(x,y) = <requires many flops> • domain = (]0,10],]0,10]) • domain resolution = 0.001 • number of points = (10 / 0.001)2 = 108 • number of processors and of threads = 4 • each thread performs 25x106 function evaluations • No need for critical sections • No shared output

Mandelbrot Set • In many cases, the “cost” of computing f varies with its input • Example: Mandelbrot • For each complex number c • Define the series • Z0 = 0 • Zn+1= Zn2 + c • If the series converges, put a black dot at point c • i.e., if it hasn’t diverged after many iterations • If one partitions the domain in 4 squares among 4 threads, some of the threads will have much more work to do than others

Mandelbrot and Load Balancing • The problem with partitioning the domain into 4 identical tiles is that it leads to load imbalance • i.e., suboptimal use of the hardware resources • Solution: • do not partition the domain in as many tiles as threads • instead use many more tiles than threads • Then have each thread operate as follows • compute a tile • when done “request” another tile • until there are no tiles left to compute • This is called a “master-worker” execution • confusing terminology that will make more sense when we do distributed memory programming

Mandelbrot implementation • Conceptually very simple, but how do we write code to do it? • Pthreads • Use some shared (protected) counter that keeps track of the next tile • the “keeping track” can be easy or difficult depending of the shape of the tiles • Threads read and update the counter each time • When the counter goes over some predefined value terminate • OpenMP • Could be done in the same way • But OpenMP provides tons of convenient ways to do parallel loops • including “dynamic” scheduling strategies, which do exactly what we need! • Just write the code as a loop over the tiles • Add the proper pragma • And you’re done

Dependent Computations • In many applications, things are not so simple: elementary computations may not be independent • otherwise parallel computing would be pretty easy • A common example: • Consider a (1-D, 2-D, ...) domain that consists of “cells” • Each cell holds some “state”, for example: • temperature, pressure, humidity, wind velocity • RGB color value • The application consists of rule(s) that must be applied to update the cell states • possibly over-and-over in an iterative fashion • CFD, game of life, image processing, etc. • Such applications are often termed Stencil Applications • We have already talked about one example: Heat Transfer

Dependent Computations • Really simple: • Cell values: one floating point number • Program written with two arrays: • f_old • f_new • One simple loop: f_new[i] = f_old[i] + ... • In more “real” cases, the domain in 2-D (or worse), there are more terms, and the values on the right hand side can be at time step m+1 as well • Example from: http://ocw.mit.edu/NR/rdonlyres/Nuclear-Engineering/22-00JIntroduction-to-Modeling-and-SimulationSpring2002/55114EA2-9B81-4FD8-90D5-5F64F21D23D0/0/lecture_16.pdf

Wavefront Pattern • Data elements are laid out as multidimensional grids representing a logical plane or space. • The dependency between the elements, often formulated by dynamic programming, results in computations known as wavefront. 2-D domain Example stencil shapes i-1,j-1 i-1,j i,j-1 i,j

The Longest-Common-Subsequence Problem • LCS • Given two sequences A=<a1,a2,…,an> and B=<b1,b2,…,bn>, find the longest sequence that is a subsequence of both A and B. • If A =<c,a,d,b,r,z> and B =<a,s,b,z>, the longest common subsequence of A and B is <a,b,z>. • a valuable tool in finding valuable information regarding amino acid sequences in biological genes. • Determine F[n, m] • Let F[i, j] be the length of the longest common subsequence of the first i elements of A and the first j elements of B.

LCS Wavefront F[i-1,j-1] F[i-1,j] F[i,j-1] F[i,j] The computation starts from F[0,0] and starts filling out the memoization space table diagonally.

One example Computing the LCS of amino acid sequences <H, E, A, G, A, W, G, H,E> and <P, A, W, H, E, A, E>. F[n, m] = 5 is the answer.

Wavefront computation • How can we parallelize a wavefront computation? • We have seen that the computation consists in computing 2n-1 antidiagonals, in sequence. • Computations within each antidiagonal are independent, and can be done in a multithreaded fashion • Algorithm: • for each antidiagonal • use multiple threads to compute its elements • one may need to use a variable number of threads because some diagonals are very small, while some can be large • can be implemented with a single array

Wavefront computation • What about cache efficiency? • After all, reading only one element from an anti diagonal at a time is probably not good • They are not contiguous in memory! • Solution: blocking • Just like matrix multiply p0 p1 p2 p3

Wavefront computation • What about cache efficiency? • After all, reading only one element from a diagonal at a time is probably not good • Solution: blocking • Just like matrix multiply p0 p1 p2 p3 1

Wavefront computation • What about cache efficiency? • After all, reading only one element from a diagonal at a time is probably not good • Solution: blocking • Just like matrix multiply p0 p1 p2 p3 2 2

Wavefront computation • What about cache efficiency? • After all, reading only one element from a diagonal at a time is probably not good • Solution: blocking • Just like matrix multiply p0 p1 p2 p3 3 3 3

Wavefront computation • What about cache efficiency? • After all, reading only one element from a diagonal at a time is probably not good • Solution: blocking • Just like matrix multiply p0 p1 p2 p3 1 3 4 2 2 3 4 5 3 4 5 6 6 4 5 7

Workload Partitioning • First the matrix is divided into parts of adjacent columns equal to the numbers of clusters. • Afterwards the part within each cluster is partitioned. The computation is then performed in the same way.

Performance Modeling • One thing we’ll need to do often in HPC is building performance models • Given simple assumptions regarding the underlying architecture • e.g., ignore cache effects • Come up with an analytical formula for the parallel speed-up • Let’s try it on this simple application • Let N be the (square) matrix size • Let p be the number of threads/cores, which is fixed

Performance Modeling T3 T2 T0 T1 • What if we use p2 blocks? • We assume that p divides N (N > p) • Then the computation proceeds in 2p-1 phases • each phase lasts as long as the time to compute one block (because of concurrency), Tb • Therefore • Parallel time = (2p-1) Tb • Sequential time = p2 Tb • Parallel speedup = p2 / (2p - 1) • Parallel efficiency = p / (2p -1) • Example: • p=2, speedup = 4/3, efficiency = 66% • p=4, speedup = 16/7, efficiency = 57% • p=8, speedup = 64/17, efficiency = 53% • Asymptotically: efficiency = 50%

Performance Modeling 1 0 3 2 1 0 • What if we use (bxp)2 blocks? • b some integer between 1 and N/p • We assume that p divides N (N > p) • But performance modeling becomes more complicated • The computation still proceeds in 2bp-1 phases • But a thread can have more than one block to compute during a phase! • During phase i, there are • i blocks to compute for i=1,..,bp • 2bp-i blocks to compute for i=bp+1,...,2bp-1 • If there are x (>0) blocks to compute in a phase, then the execution time for that phase is: (x-1)/p + 1) • Assuming Tb = 1 • Therefore, the parallel execution time is

Performance Modeling

Performance Modeling • Example: N=1000, p = 4

Performance Modeling • When b gets larger, speedup increases and tends to p • Since b <= N/p, best speed-up: Np / (N + p -1) • When N is large compared to p, speedup is very close to p • Therefore, use a block size of 1, meaning no blocking! • We’re back to where we started because our performance model ignores cache effects! • Trade-off: • From a parallel efficiency perspective: small block size • From a cache efficiency perspective: big block size • Possible rule of thumb: use the biggest block size that fits in the L1 cache (L2 cache?) • Lesson: full performance modeling is difficult • We could add the cache behavior, but think of a dual-core machine with shared L2 cache, etc. • In practice: do performance modeling for asymptotic behaviors, and then do experiments to find out what works best

Sharks and Fish • Simulation of a population of preys and predators • Each entity follows some behavior • Preys move and breed • Predators move, hunt, and breed • Given initial populations, nature of the entity behaviors (e.g., probability of breeding, probability of successful hunting), what do populations look like after some time? • This is something computational ecologists do all the time to study ecosystems

Sharks and Fish • There are several possibilities to implement such a simulation • A simple one is to do something that looks like “the game of life” • A 2-D domain, with NxN cells (each cell can be described by many environmental parameters) • Each cell in the domain can hold a shark or a fish • The simulation is iterative • There are several rules for movement, breeding, preying • Why do it in parallel? • Many entities • Entity interactions may be complex • How can one write this in parallel with threads and shared memory?

Space partitioning • One solution is the divide the 2-D domain between threads • Each thread deals with the entities in its domain

Space partitioning • One solution is the divide the 2-D domain between threads • Each thread deals with the entities in its domain 4 threads

Move conflict? • Threads can make decisions that will lead to conflicts!

Dealing with conflicts • Concept of shadow cells Only entities in the red regions may cause a conflict • One possible implementation • Each thread deals with its green region • Thread 1 deals with its red region • Thread 2 deals with its red region • Thread 3 deals with its red region • Thread 4 deals with its red region • Repeat • Will still prevent some types of moves • No swapping of location • The implementer must make choices

Load Balancing • What if all the fish end up in the same region? • because they move • because they breed • Then one thread has much more work to do that the others • Solution: dynamic repartitioning • Modify the partitioning so that the load is balanced • But perhaps one good idea would be to not do domain partitioning at all! • How about doing entity partitioning • Better load balancing, but more difficult to deal with conflicts • May use locks, but high overhead

Conclusion • Main lessons • There are many classes of applications, with many domain partitioning schemes • Performance modeling is fun but inherently limited • It’s all about trade-offs • overhead - load balancing • parallelism - cache usage • etc. • Remember, this is the easy side of parallel computing • Things will become much more complex in distributed memory programming

High-Performance Grid Computing and Research Networking