440 likes | 553 Views
Decomposition, Locality and (finishing synchronization lecture) Barriers. Kathy Yelick yelick@cs.berkeley.edu www.cs.berkeley.edu/~yelick/cs194f07. Lecture Schedule for Next Two Weeks. Fri 9/14 11-12:00 Mon 9/17 10:30-11:30 Discussion (11:30-12 as needed)
E N D
Decomposition, Locality and (finishing synchronization lecture) Barriers Kathy Yelick yelick@cs.berkeley.edu www.cs.berkeley.edu/~yelick/cs194f07 CS194 Lecture
Lecture Schedule for Next Two Weeks • Fri 9/14 11-12:00 • Mon 9/17 10:30-11:30 Discussion (11:30-12 as needed) • Wed 9/19 10:30-12:00 Decomposition and Locality • Fri 9/21 10:30-12:00 NVIDIA Lecture • Mon 9/24 10:30-12:00 NVIDIA lecture (David Kirk) • Tue 9/25 3-:4:30 NVIDIA research talk in Woz (optional) • Wed 9/26 10:30-11:30 Discussion • Fri 9/27 11-12 Lecture topic TBD CS194 Lecture
Outline • Task Decomposition • Review parallel decomposition and task graphs • Styles of decomposition • Extending task graphs with interaction information • Example: Sharks and Fish (Wator) • Parallel vs. sequential versions • Data decomposition • Partitioning rectangular grids (like matrix multiply) • Ghost regions (unlike matrix multiply) • Bulk-synchronous programming • Barrier synchronization CS194 Lecture
Designing Parallel Algorithms • Parallel software design starts with decomposition • Decomposition Techniques • Recursive Decomposition • Data Decomposition: Input, Output, or Intermediate Data • And others • Characteristics of Tasks and Interactions • Task Generation, Granularity, and Context • Characteristics of Task Interactions. CS194 Lecture
Recursive Decomposition: Example • Consider parallel Quicksort. Once the array is partitioned, each subarry can be processed i parallel Source: Ananth Grama CS194 Lecture
Data Decomposition • Identify the data on which computations are performed. • Partition this data across various tasks. • This partitioning induces a decomposition of the of the computation, often using the following rule: • Owner Computes Rule: the thread assigned to a particular data item is responsible for all computation associated with it. • The owner computes rule is especially common on output data. Why? Source: Ananth Grama CS194 Lecture
Input Data Decomposition • Recall: applying a function square to the elements of array A, then computing its sum: • Conceptualize a decomposition as a task dependency graph: • A directed graph with • Nodes corresponding to tasks • Edges indicating dependencies, that the result of one task is required for processing the next. sqr (A[0]) sqr(A[1]) sqr(A[2]) sqr(A[n]) … sum CS194 Lecture
Output Data Decomposition: Example Consider matrix multiply: Task 1: Task 2: Task 3: Task 4: Source: Ananth Grama CS194 Lecture
A Model Problem: Sharks and Fish • Illustration of parallel programming • Original version (discrete event only) proposed by Geoffrey Fox • Called WATOR • Basic idea: sharks and fish living in an ocean • Rules for breeding, eating, and death • Could also add: forces in the ocean, between creatures, etc. • Ocean is toroidal (2D donut) CS194 Lecture
Serial vs. Parallel • Updating left-to-right row-wise order, we get a serial algorithm • Cannot be parallelized, because of dependencies, so instead we use a “red-black” order forall black points grid(i,j) update … forall red points grid(i,j) update … • For 3D or general graph, use graph coloring • Can use repeated Maximal Independent Sets to color • Graph(T) is bipartite => 2 colorable (red and black) • Nodes for each color can be updated simultaneously • Can also use two copies (old and new), but in both cases • Note we have changed behavior of original algorithm ! CS194 Lecture
P4 Repeat compute locally to update local system barrier() exchange state info with neighbors until done simulating P1 P2 P3 P5 P6 P7 P8 P9 Parallelism in Wator • The simulation is synchronous • use two copies of the grid (old and new). • the value of each new grid cell depends only on 9 cells (itself plus 8 neighbors) in old grid. • simulation proceeds in timesteps-- each cell is updated at every step. • Easy to parallelize by dividing physical domain: Domain Decomposition • Locality is achieved by using large patches of the ocean • Only boundary values from neighboring patches are needed. • How to pick shapes of domains? CS194 Lecture
Regular Meshes (eg Game of Life) • Suppose graph is nxn mesh with connection NSEW neighbors • Which has less communication (less potential false sharing)? • Minimizing communication on mesh minimizing “surface to volume ratio” of partition 2*n*(p1/2 –1) edge crossings n*(p-1) edge crossings CS194 Lecture
Ghost Nodes • Overview of (possible!) Memory Hierarchy Optimization • Normally done with 1-wide ghost • Can you imagine a reason for wider? To compute green Copy yellow Compute blue CS194 Lecture
Bulk Synchronous Coputation • With this decomposition, computation is mostly independent • Need to synchronize between phases • Picking up from last time: barrier synchronization CS194 Lecture
Barriers • Software algorithms implemented using locks, flags, counters • Hardware barriers • Wired-AND line separate from address/data bus • Set input high when arrive, wait for output to be high to leave • In practice, multiple wires to allow reuse • Useful when barriers are global and very frequent • Difficult to support arbitrary subset of processors • even harder with multiple processes per processor • Difficult to dynamically change number and identity of participants • e.g. latter due to process migration • Not common today on bus-based machines
A Simple Centralized Barrier • Shared counter maintains number of processes that have arrived: increment when arrive, check until =procs struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name; BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) /* reset if first */ bar_name.flag = 0; /* to reach */ mycount = ++bar_name.counter; /* mycount is private */ UNLOCK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = 0; /* reset for next barrier */ bar_name.flag = 1; /* release waiters */ } else while (bar_name.flag == 0) {}; /* busy wait */ } What is the problem if barriers are done back-to-back?
A Working Centralized Barrier • Consecutively entering the same barrier doesn’t work • Must prevent process from entering until all have left previous instance • Could use another counter, but increases latency and contention • Sense reversal: wait for flag to take different value consecutive times • Toggle this value only when all processes reach BARRIER (bar_name, p) { local_sense = !(local_sense); /* toggle private sense variable */ LOCK(bar_name.lock); mycount = ++bar_name.counter; /* mycount is private */ if (bar_name.counter == p) UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/ else { UNLOCK(bar_name.lock); while (bar_name.flag != local_sense) {}; } }
Centralized Barrier Performance • Latency • Centralized has critical path length at least proportional to p • Traffic • About 3p bus transactions • Storage Cost • Very low: centralized counter and flag • Fairness • Same processor should not always be last to exit barrier • No such bias in centralized • Key problems for centralized barrier are latency and traffic • Especially with distributed memory, traffic goes to same node
Improved Barrier Algorithms for a Bus • Separate arrival and exit trees, and use sense reversal • Valuable in distributed network: communicate along different paths • On bus, all traffic goes on same bus, and no less total traffic • Higher latency (log p steps of work, and O(p) serialized bus xactions) • Advantage on bus is use of ordinary reads/writes instead of locks • Software combining tree • Only k processors access the same location, where k is degree of tree
Combining Tree Barrier CS194 Lecture
Combining Tree Barrier CS194 Lecture
Tree-Based Barrier Summary CS194 Lecture
Tree-Based Barrier Cost Latency Traffic Memory CS194 Lecture
Dissemination Based Barrier CS194 Lecture
Dissemination Barrier CS194 Lecture
Latency Traffic Memory CS194 Lecture
Barrier Performance on SGI Challenge • Centralized does quite well • Will discuss fancier barrier algorithms for distributed machines • Helpful hardware support: piggybacking of reads misses on bus • Also for spinning on highly contended locks
Synchronization Summary • Rich interaction of hardware-software tradeoffs • Must evaluate hardware primitives and software algorithms together • primitives determine which algorithms perform well • Evaluation methodology is challenging • Use of delays, microbenchmarks • Should use both microbenchmarks and real workloads • Simple software algorithms with common hardware primitives do well on bus • Will see more sophisticated techniques for distributed machines • Hardware support still subject of debate • Theoretical research argues for swap or compare&swap, not fetch&op • Algorithms that ensure constant-time access, but complex
References • John Mellor-Crummey’s 422 Lecture Notes CS194 Lecture
Tree-Based Computation • The broadcast and reduction operations in MPI are a good example of tree-based algorithms • For reductions: take n inputs and produce 1 output • For broadcast: take 1 input and produce n outputs • What can we say about such computations in general? CS194 Lecture
A log n lower bound to compute any function of n variables • Assume we can only use binary operations, one per time unit • After 1 time unit, an output can only depend on two inputs • Use induction to show that after k time units, an output can only depend on 2k inputs • After log2 n time units, output depends on at most n inputs • A binary tree performs such a computation CS194 Lecture
Broadcasts and Reductions on Trees CS194 Lecture
Parallel Prefix, or Scan • If “+” is an associative operator, and x[0],…,x[p-1] are input data then parallel prefix operation computes • Notation: j:k mean x[j]+x[j+1]+…+x[k], blue is final value y[j] = x[0] + x[1] + … + x[j] for j=0,1,…,p-1 CS194 Lecture
Mapping Parallel Prefix onto a Tree - Details • Up-the-tree phase (from leaves to root) 1) Get values L and R from left and right children 2) Save L in a local register Lsave 3) Pass sum L+R to parent By induction, Lsave = sum of all leaves in left subtree • Down the tree phase (from root to leaves) 1) Get value S from parent (the root gets 0) 2) Send S to the left child 3) Send S + Lsave to the right child • By induction, S = sum of all leaves to left of subtree rooted at the parent CS194 Lecture
E.g., Fibonacci via Matrix Multiply Prefix • Consider computing of the Fibbonacci numbers: Fn+1 = Fn + Fn-1 • Each step can be viewed as a matrix multiplication: Can compute all Fn by matmul_prefix on [ , , , , , , , , ] then select the upper left entry Derived from: Alan Edelman, MIT CS194 Lecture
Adding two n-bit integers in O(log n) time • Let a = a[n-1]a[n-2]…a[0] and b = b[n-1]b[n-2]…b[0] be two n-bit binary numbers • We want their sum s = a+b = s[n]s[n-1]…s[0] • Challenge: compute all c[i] in O(log n) time via parallel prefix • Used in all computers to implement addition - Carry look-ahead c[-1] = 0 … rightmost carry bit for i = 0 to n-1 c[i] = ( (a[i] xor b[i]) and c[i-1] ) or ( a[i] and b[i] ) ... next carry bit s[i] = ( a[i] xor b[i] ) xor c[i-1] for all (0 <= i <= n-1) p[i] = a[i] xor b[i] … propagate bit for all (0 <= i <= n-1) g[i] = a[i] and b[i] … generate bit c[i] = ( p[i] and c[i-1] ) or g[i] = p[i] g[i] * c[i-1] = C[i] * c[i-1] 1 1 0 1 1 1 … 2-by-2 Boolean matrix multiplication (associative) = C[i] * C[i-1] * … C[0] * 0 1 … evaluate each P[i] = C[i] * C[i-1] * … * C[0] by parallel prefix CS194 Lecture
Other applications of scans • There are several applications of scans, some more obvious than others • add multi-precision numbers (represented as array of numbers) • evaluate recurrences, expressions • solve tridiagonal systems (numerically unstable!) • implement bucket sort and radix sort • to dynamically allocate processors • to search for regular expression (e.g., grep) • Names: +\ (APL), cumsum (Matlab), MPI_SCAN • Note: 2n operations used when only n-1 needed CS194 Lecture
Evaluating arbitrary expressions • Let E be an arbitrary expression formed from +, -, *, /, parentheses, and n variables, where each appearance of each variable is counted separately • Can think of E as arbitrary expression tree with n leaves (the variables) and internal nodes labeled by +, -, * and / • Theorem (Brent): E can be evaluated in O(log n) time, if we reorganize it using laws of commutativity, associativity and distributivity • Sketch of (modern) proof: evaluate expression tree E greedily by • collapsing all leaves into their parents at each time step • evaluating all “chains” in E with parallel prefix CS194 Lecture
Multiplying n-by-n matrices in O(log n) time • For all (1 <= i,j,k <= n) P(i,j,k) = A(i,k) * B(k,j) • cost = 1 time unit, using n^3 processors • For all (1 <= I,j <= n) C(i,j) = S P(i,j,k) • cost = O(log n) time, using a tree with n^3 / 2 processors n k =1 CS194 Lecture
xi = fi(xi-1) = (ai * xi-1 + bi )/( ci * xi-1 + di ) can be written as xi = numi / deni = (ai * numi-1 + bi * deni-1)/(ci * numi-1 + di * deni-1) or numi = ai bi * numi-1 = Mi * numi-1 = Mi * Mi-1 * … * M1* num0 demi ci di deni-1 deni-1 den0 Can use parallel prefix with 2-by-2 matrix multiplication Evaluating recurrences • Let xi = fi(xi-1), fi a rational function, x0 given • How fast can we compute xn? • Theorem (Kung): Suppose degree(fi) = d for all i • If d=1, xn can be evaluated in O(log n) using parallel prefix • If d>1, evaluating xn takes W(n) time, i.e. no speedup is possible • Sketch of proof when d=1 • Sketch of proof when d>1 • degree(xi) as a function of x0 is di • After k parallel steps, degree(anything) <= 2k • Computing xi take W(i) steps CS194 Lecture
Summary • Message passing programming • Maps well to large-scale parallel hardware (clusters) • Most popular programming model for these machines • A few primitives are enough to get started • send/receive or broadcast/reduce plus initialization • More subtle semantics to manage message buffers to avoid copying and speed up communication • Tree-based algorithms • Elegant model that is a key piece of data-parallel programming • Most common are broadcast/reduce • Parallel prefix (aka scan) has produces partial answers and can be used for many surprising applications • Some of these or more theoretical than practical interest CS194 Lecture
Extra Slides CS194 Lecture
2 Inverting Dense n-by-n matrices in O(log n) time • Lemma 1: Cayley-Hamilton Theorem • expression for A-1 via characteristic polynomial in A • Lemma 2: Newton’s Identities • Triangular system of equations for coefficients of characteristic polynomial, matrix entries = sk • Lemma 3: sk = trace(Ak) = S Ak [i,i] = S [li (A)]k • Csanky’s Algorithm (1976) • Completely numerically unstable n n i=1 i=1 1) Compute the powers A2, A3, …,An-1 by parallel prefix cost = O(log2 n) 2) Compute the traces sk = trace(Ak) cost = O(log n) 3) Solve Newton identities for coefficients of characteristic polynomial cost = O(log2 n) 4) Evaluate A-1 using Cayley-Hamilton Theorem cost = O(log n) CS194 Lecture
Summary of tree algorithms • Lots of problems can be done quickly - in theory - using trees • Some algorithms are widely used • broadcasts, reductions, parallel prefix • carry look ahead addition • Some are of theoretical interest only • Csanky’s method for matrix inversion • Solving general tridiagonals (without pivoting) • Both numerically unstable • Csanky needs too many processors • Embedded in various systems • CM-5 hardware control network • MPI, Split-C, Titanium, NESL, other languages CS194 Lecture