230 likes | 352 Views
Parallel Prefix and Data Parallel Operations. Motivation: basic parallel operations which occurs repeatedly . Let ) be an associative operation. (a 1 ) a 2 ) ) a 3 = a 1 ) (a 2 ) a 3 ) How to compute (a 1 ) a 2 ) …. ) a n ) in parallel in O(logn) time?. Approach 1.
E N D
Parallel Prefix and Data Parallel Operations Motivation:basic parallel operations which occurs repeatedly. Let ) be an associative operation. (a1) a2) ) a3 = a1) (a2) a3 ) How to compute (a1) a2 )…. ) an ) in parallel in O(logn) time?
Approach 1 a0 a1 a2 a3 a4 a5 a6 a7 [0:0] [0:1] [1:2] [2:3] [3:4] [4:5] [5:6] [6:7] d=1 [0:0] [0:1] [0:2] [0:3] [1:4] [2:5] [3:6] [4:7] d=2 [0:0] [0:1] [0:2] [0:3] [0:4] [0:5] [0:6] [0:7] d=4 Assume that n = 2k for i = 0 to k-1 for j = 0 to n-1-2i do in parallel x[j+ 2i ] = x[j] + x[j+ 2i ]
St R Sr Sl How to do on Tree Architecture? for each node if there is a signal from left and right St <- Sl + Sr if there is a signal R, send R to both its children if the node is a leaf and there is a signal R, X <- X + R
How to do on a Hypercube A complete binary tree can be embedded into a hypercube Simpler solution: each node computes prefix and total sum for i = 0 to k-1 for j = 0 to n-1 do in parallel x[j] = x[j] + sum[ji] if i-th bit of j = 1 sum[j] = sum[j] + sum[ji], where ji and j have the same binary number representation except their i-th bit, where the i-th bit of ji is the complement of the i-bit of j.
a0 a1 a2 a3 a4 a5 a6 a7 X SUM [0:0] [0:3] [0:0] [0:1] [0:0] [0:7] [0:1] [0:7] [0:1] [0:1] [0:1] [0:3] [2:2] [0:7] [2:2] [2:3] [2:2] [0:3] [2:3] [2:3] [2:3] [0:7] [2:3] [0:3] [4:4] [4:5] [4:4] [4:7] [0:4] [0:7] [4:5] [4:5] [4:5] [4:7] [0:5] [0:7] [0:6] [0:7] [6:6] [6:7] [4:6] [4:7] [6:7] [6:7] [4:7] [4:7] [0:7] [0:7] d=1 X SUM X SUM d=2 d=4 Prefix on Hypercube for i = 0 to k-1 for j = 0 to n-1 do in parallel x[j] = x[j] + sum[ji] if i-th bit of j = 1 sum[j] = sum[j] + sum[ji],
Applications of Data Parallel Operations Any associative operations: Examples: • min, max, add • adding two binary numbers • finite state automata • radix sort • segmented prefix sum • routing • packing • unpacking • broadcast (copy-scan) • solving recurrence equations • straight line computation (parallel arithmetic evaluation)
Adding two n bit numbers as parallel prefix • a = an-1 …. a0 • b = bn-1 …. b0 • s = a + b • note that si = ai bi ci-1 • to compute ci define g and p as: gi = ai bi , pi = ai bi • define as : (g,p) (g’,p’) = (g (p g’), p p’) Then carry bit ci can be computed by: (g,p) (g’,p’) = (g (p g’), p p’) (Gi, Pi) = (gi,pi) (gi-1, pi-1) … (g0,p0) and Gi = ci
a15 b15 a14 b14 a13 b13 a12 b12 a11 b11 a10 b10 a9 b9 a8 b8 a7 b7 a6 b6 a5 b5 a4 b4 a3 b3 a2 b2 a1 b1 a0 b0 Hardware circuit of recursive look-ahead adder
b b q2 q0 q1 c c b c b c c q1’ q2’ q3’ q2 q0 qr q1 qr q0 q1 qr q0 q2 q0 qr q1’ q2’ q3’ q0 q1 qr q0 qr q2 q1’ q2’ q3’ q0 q1 qr q0 qr qr q0 q1 qr q0 qr q2 q1 qr q0 q0->q2 q1->q0 q2->qr Parsing a regular language (q0,b) = q2, (q0,c) = q1, (q1,b) = q0, (q1,c) = qr, (q2,b) = qr, (q2,c) = q0 qr: reject state b
before Segment boundary 1 2 3 4 5 6 7 8 after 1 3 3 7 12 18 7 15 Segmented Prefix operation
’ b | b a a b | b | a | (a b) | b Segmented Prefix computation Let be any associative operation. For segmented operation of , define ’ as follows: Then ’ is associative and we can compute segmented operation in O(logn) time.
Enumerating Data = [5 6 3 1 8 3 7 5 9 2] active procs = [1 0 1 1 0 0 1 0 1 0] enumerated = [0 x 1 2 x x 3 x 4 0]
packing data = [5 6 3 1 8 3 7 5 9 2] active procs = [1 0 1 1 0 0 1 0 1 0] enumerated = [0 x 1 2 x x 3 x 4 x] packed data = [5 3 1 7 9 x x x x x]
Packing and Unpacking on Hypercube Packing • adjust bit 0 • adjust bit 1 • adjust bit 2 • ... • adjust bit k-1 Unpacking • adjust bit k-1 • adjust bit k-2 • ... • adjust bit 1 • adjust bit 0 How about in the order of adjust bit 0, 1, ..., k-1 for packing?
Unpacking Address 0 1 2 3 4 5 6 7 8 9 data = [6 2 3 5 9 x x x x x] active procs = [1 0 1 1 0 0 1 0 1 0] enumerated = [0 x 1 2 x x 3 x 4 x] destination = [0 2 3 6 8 x x x x x] unpacked data = [6 x 2 3 x x 5 x 9 x]
Copy Scan (broadcast) address 0 1 2 3 4 5 6 7 8 9 data = [ 6 2 3 5 9 4 1 7 8 10] segmented bit = [ 1 0 1 1 0 0 1 0 1 0] result = [ 6 6 3 5 5 5 1 1 8 8]
Radix Sort for j = k-1 to 0 // x has k bits for all i in [0 .. n-1] do parallel { if j-th bit of x[i] is 0 { y[i] = enumerate c = count } if j-th bit of x[i] is 1 y [i] <- enumerate + c x [y[i]] = x [i] } Radix sort another code for j = k-1 to 0 // x has k bits for all i in [0 .. n-1] do parallel { pack left x[i] if j-th bit of x[i] pack right x[i] if j-th bit of x[i] }
Quick Sort 1. Pick a pivot p 2. Broadcast p 3. For all PE i, compare A[i] with p { if A[i] <p, pack left A[i] in the segment if A[i] >= p, pack right A[i] in the segment } 4. Mark the segment boundary 5. Each segment, quick sort recursively
Solving Linear Recurrence Equations fn=an-1fn-1 + an-2fn-2 fn fn-1
22 14 13 18 13 1 2 3 4 5 6 7 10 18 25 27 28 13 22 7 9 3 7 7 18 5 7 11 Pointer Jumping and Tree Computation How to compute a prefix on a linked list? If NEXT[i] != NILL then X[i] <- X[i] + X[NEXT[i]] NEXT[i] <- NEXT[NEXT[i]] How to make 1 3 6 10 15 21 28 order?
Each node 1 Leaf node 1 Application: Tree computation Pre-order numbering Can be applied to in order, post order number of children, depth etc. Bi-component, etc also
Recurrence Equation Example: LU decomposition on a triangular matrix