350 likes | 367 Views
Learn the fundamentals of scatter, gather, and parallel scan algorithms, their implementation in various DSLs, and practical applications. Explore how to build complex algorithms from these parallel building blocks.
E N D
Gather/Scatter,Parallel Scan,and Applications Lee Barford v2011-09-15
Outline Confidentiality Label • Continue to add parallel algorithm building blocks • Scatter and gather algorithms: array deferencing in parallel • Parallel scan algorithm: what it does • How it is implemented will be covered in a paper next week • Other algorithms built on parallel scan & scatter/gather • Demonstrate usefulness of parallel scan and scatter/gather • Illustrations of how to “think in parallel” • Building more complex parallel algorithms from simpler building blocks
Scatter: Parallel assignment to array elements Confidentiality Label • Vector-indexed assignment appears in a number of DSLs: • Matlab, Scilab, Numeric Python, R, Fortran 90 • A reasonable definition of assignment to a vector, indexed by a vector: • x[v] = y • x, y vectors of compatible types; v vector of index type; v & y same length • x[v[0]] = y[0]; x[v[1]] = y[1]; x[v[3]] = y[2]; … • Undefined result if v[i]==v[j] for any i, j • Corresponding abstract parallel operation is called “scatter”: • scatter(v, y, x)
Scatter: Code example Confidentiality Label int values[10] = {1, 0, 1, 0, 1, 0, 1, 0, 1, 0}; intmap[10] = {0, 5, 1, 6, 2, 7, 3, 8, 4, 9}; thrust::device_vector<int>d_values(values, values + 10); thrust::device_vector<int>d_map(map, map + 10); thrust::device_vector<int>d_output(10); thrust::scatter(d_values.begin(), d_values.end(), d_map.begin(), d_output.begin()); // d_output is now {1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
Scatter_if: Scatter when predicate array element is true Confidentiality Label int V[8] = {10, 20, 30, 40, 50, 60, 70, 80}; intM[8] = {0, 5, 1, 6, 2, 7, 3, 4}; intS[8] = {1, 0, 1, 0, 1, 0, 1, 0}; intD[8] = {0, 0, 0, 0, 0, 0, 0, 0}; thrust::scatter_if(V, V + 8, M, S, D); // D contains [10, 30, 50, 70, 0, 0, 0, 0];
Gather: Build a vector from a vector-indexed vector Confidentiality Label • In matrix/vector DSLs: y = x[v] • y[0] = x[v[0]]; y[1] = x[v[1]]; … • x and y must be compatible types • v must be an index type • y must be able to store at least length(v) elements • Corresponding abstract parallel operation is called “scatter”: • gather(v, x, y)
Gather: Code example Confidentiality Label #include <thrust/gather.h> #include <thrust/device_vector.h> // mark even indices with a 1; odd indices with a 0int values[10] = {1, 0, 1, 0, 1, 0, 1, 0, 1, 0}; map[10] = {0, 2, 4, 6, 8, 1, 3, 5, 7, 9}; thrust::device_vector<int>d_values(values, values + 10); thrust::device_vector<int>d_map(map, map + 10); thrust::device_vector<int>d_output(10); thrust::gather(d_map.begin(), d_map.end(), d_values.begin(), d_output.begin()); // d_output is now {1, 1, 1, 1, 1, 0, 0, 0, 0, 0}
Other forms of scatter and gather in Thrust Confidentiality Label • Scatter_if, where predicate is a unary function <T,bool> • Gather_if, where predicate is an array • Gather_if, where predicate is a unary function <T,bool>
GPU Skeptics Say: • “GPUs are massively data parallel. Therefore GPUs are no good on algorithms with extensive dependencies among the input data.” • “My problem is inherently serial” • Partial Answer: Parallel scan is good at handling many such dependencies in parallel. • “Parallel scan” is also called “parallel prefix”
Parallel scan scan(+, [1 2 3 -2]) = [1 1+2 1+2+3 1+2+3+(-2)] = [1 3 6 4 ] Implementation may non-deterministically evaluate partial sums with any possible parentheses. 1+(2+(3+(-2))) (1+2)+(3+(-2)) 1+ ((2+3)+(-2)) (1+(2+3))+(-2) … • Properties of integers under + required: • Closed: a, b are integers a+b is an integer • Associative: (a+b)+c = a+(b+c) • Can use parallel scan for any set & operation w/these properties (that is, any semigroup) I will only use the word “semigroup” for “closed and associative.” 10
Example of a finite semigroup Q: How do I know it’s associative? A: Method of exhaustion: I tried all 27 possibilities. • scan(Å1, [I I L I II H H I I I]) = [I I L LLL H HHH H] • Replaces an I with the last L or H • “Remembers” whether L or H was last Confidentiality Label
Parallel scan on finite semigroups computes the same things as finite sequential machines do f Linear # states H/H Exp # elements g L/L H/H Machine(.) scan(Å1, .]) L/L I/I scan(Å1, v]) = Machine(v) L/L Since all computers are finite sequential machines, parallel scan is a “universal” parallel algorithm. BUT it is not always practical. Confidentiality Label
What about parallel scan over “infinite” semigroups? Confidentiality Label • “Infinite”: integers, floats • Can do more than with finite sets • I don’t know of a nice characterization like in the finite case
Unique Problem: Pack the array x so that it contains the unique values in x. x is assumed to be sorted. unique(x) 0 1 2 3 4 5 6 7 Input x: -7 3 7 7 10 10 10 11 b = map(ix[i]!=x[i+1], x) 1 1 1 0 1 0 0 1 ix = scan(+,b) ix’ = map(ii-1, ix) 1 2 3 3 4 4 4 5 =Output length 0 1 2 2 3 3 3 4 0 1 2 3 4 scatter_if(ix’, b, x, y) -7 3 7 10 11 Output is still sorted Confidentiality Label
Copy_if (a/k/a stream compaction) Problem: Make a packed array y containing the values of the elements x[i] so that pred(x[i]) is true. copy_if(x, y, pred) 0 1 2 3 4 5 6 7 Input x: -7 3 17 0 10 13 1 9 Input pred: 0 0 1 0 1 1 0 0 ix = scan(+,b) 0 0 1 1 2 3 3 3 =Output length ix’ = map(ii-1, ix) -1 -1 0 0 1 2 2 2 0 1 2 scatter_if(ix’, pred, x, y) 17 10 13 copy_if is stable: Relative order in y is the same as in x
Other similar algorithms in Thrust Confidentiality Label • unique_by_key(k, v): k=keys, v=values • Retains those k[i]’s & v[i]’s that have unique k[i]’s, and packs k & v • remove_if(x, pred): Remove x[i] from x if pred[x[i]], and packs • remove(x, z): Remove x[i] from x if x[i]==z[i] and pack • unique_copy(), unique_by_key_copy, remove_if_copy(), remove_copy(): • Same as unique(), unique_by_key(), remove(), remove_if() except put output into another output iterator instead of overwriting the first input
Segmented parallel scan Confidentiality Label Begin the sum over again when indicated by an predicate or predicate vector: x = [1 7 2 3 2 8 13] s = [0 0 1 0 0 1 0] segmented_scan(+, x, s) = [1 8 2 3 5 8 21] segmented_scan(Å, x, s) = scan(Å2, ((x1,s1), (x2,s2),…)) (Å associative Å2is associative.)
Segmented parallel scan: Applications Confidentiality Label • Sparse matrix algorithms • Graph algorithms
Inclusive vs exclusive scan Confidentiality Label Inclusive scan: The first output equals the first input (what we’ve used so far) scan(+, [1 2 3 -2], init=0) = [1 1+2 1+2+3 1+2+3-2] Exclusive scan starts with an initial value. scan(+, [1 2 3 -2], init=0) = [0 0+1 0+1+2 0+1+2+3]
Scans in Thrust Plus others where the Associative Operator defaults to +. Confidentiality Label
Segmented scans in Thrust + 6 other versions with other parameter combinations Confidentiality Label
Transformed scans in Thrust:Increases computation per memory read Confidentiality Label
Thrust set operations Confidentiality Label • Operate on sorted inputs • Produce sorted outputs • Union, intersection, difference • Symmetric difference(A,B) = {e | e in A or a in B but not both}
Summary • Scatter and gather: • Parallel assignment • Parallel rearrangement • Parallel removal of irrelevant entries • Scan & segmented scan: • One way to parallelize many (but not all) seemingly difficult to parallelize algorithms • Parsing, graph algorithms, sparse matrix algorithms, … • Can take some creativity to develop an associative operation (semigroup) for a particular algorithm • All of the algorithms described are in Thrust
Theorem: There is a computable, 1-1 correspondence f between finite semigroups (S, Å1) and XXXXXX finite state machines so that (S, Å1) and f((S, Å1) always compute the same output from same input. Confidentiality Label
Parallel CPU Scan Algorithm scan(+, [0 1 2 3 -2 7]) = [0 1 3 6 4 11] input: 0 1 2 3 -2 7 pass 1: 0+0=0 0+1=1 2 2+3=5 -2 -2+7=5 recursion: scan(+, [1, 5, 5]) = [1, 6, 11] pass 2: 6-2=4 4+7=11 0+0=0 0+1=1 1+2=3 3+3=6 • Only properties of integers under + used: • Left identity, 0: 0+a = a • Associativity: (a+b)+c = a+(b+c) • Can use parallel scan for any set & operation w/these properties 27
Gather: Example use of scan(+,.) (a/k/a stream compaction) Problem: Make a packed array containing the indices of the elements >=10 0 1 2 3 4 5 6 7 Input x: -7 3 17 0 10 13 1 9 b = map(x[i]>=10) 0 0 1 0 1 1 0 0 ix = scan(+,b) 0 0 1 1 2 3 3 3 =Output length map(if b[i] then y[ix-1]=b[i]) 0 1 2 Output y: 17 10 13
Application of segmented scan:Sparse matrix-vector multiplication (SpMV) Confidentiality Label • One of the most important operations in scientific code • Serial: typically graph-like, using adjacency lists • Sparse Matrix has cN non-zeros, for some c<<N • [ 1 2 0 0 ] • [ 0 3 9 0 ] • [ 0 1 4 0 ] • A = [ 1 2 3 9 1 4 ] The nonzero elements • SA = [ 1 0 1 0 1 0] 1 when start a new column • JA = [ 0 1 1 2 1 2 ] JA[i] is the column index for A[i]
SpMV as a segmented scan Confidentiality Label (a, v) Å (a’, v’) = (a+a’*v’, v’)
Parallel scan: summary Confidentiality Label • Way to parallelize many algorithms that seem at first to have difficult dependencies among the inputs • Using parallel scan effectively can take some forethought • “Subgroup design” is not necessarily straightforward
Other Applications of Scan • Sorting • Sparse matrix multiply, iterative solve • ODE’s & PDE’s: CAD & simulation: EE, MechE, physics, chem/biochem • Optimization • Linear programming • Nonlinear programming • Other graph algorithms
Conclusion Scan is one tool that can be used to parallelize lots of algorithms that seem at first to be inherently serial Usually, other patterns (map, reduce,…) are combined with scan to create a complete algorithm Understanding use & performance of scan is one key to judging applicability and performance of GPUs to non-graphics problems
Parallel Transition Localization Samples = vector of numbers Vector s of S’s • Set S with an operatorÅ • 0Å ÎS such that 0Å Å a=a • Å is associative parallel_scan(Å, [s0, s1, s2, …]) = [c0, c1, c2, …] = c Identify transitions in c Transitions Barford, “Parallel Transition Localization,” Proc. IEEE I2MTC, 2010.
Implementation Core 1 Core 2 Core 3 x0x1 x2x3 x4x5 pass 1: s0s0Ås1 s2s2Ås3 s4s4Ås5 recursion: cumsum(Å, [s0Ås1, s2Ås3, s4Ås5) = [c1, c3, c5] c3Ås4c3Ås4Ås5 s0s0Ås1 c1Ås2c1Ås2Ås3 pass 2: Transition? Transition? Transition? Transition? Transition? Transition? • Samples read twice • Other memory needed: constant amount per core (no temp arrays) Confidentiality Label 35