630 likes | 808 Views
New Abstractions For Data Parallel Programming. James C. Brodman Department of Computer Science brodman2@illinois.edu In collaboration with: George Almási , Basilio Fraguela , María Garzarán , David Padua. Outline. Introduction Hierarchically Tiled Arrays
E N D
New Abstractions For Data Parallel Programming James C. Brodman Department of Computer Science brodman2@illinois.edu In collaboration with: George Almási, BasilioFraguela, MaríaGarzarán, David Padua
Outline • Introduction • Hierarchically Tiled Arrays • Additional Abstractions for Data Parallel Programming • Conclusions
Going Beyond Arrays • Parallel programming has been well studied for numerical programs • Shared/Distributed Memory APIs, Array languages • Hierarchically Tiled Arrays (HTAs) • However, many important problems today are non numerical • Examine non-numerical programs and find new abstractions • Data Structures • Parallel Primitives
Array Languages • Many numerical programs were written using Array languages • Popular among scientists and engineers. • Fortran 90 and successors • MATLAB • Parallelism not the reason for this notation.
Array Languages • Convenient notation for linear algebra and other algorithms • More compact • Higher level of abstraction do i=1,n do i=1,n do j=1,n do j=1,n C(i,j)= A(i,j)+B(i,j) S = S + A(i,j) end do end do end do end do C = A + B S += sum(A)
Data Parallel Programming • Array languages seem a natural fit for parallelism • Parallel programming with aggregate-based or loop-based languages is data centric, or, data parallel • Phrased in terms of performing the same operation on multiple pieces of data • Contrast with task parallelism where parallel tasks may perform completely different operations • Many reasons to prefer data parallel programming over task parallel approaches
Data Parallel Advantages • Data parallel programming is scalable • Scales with increasing number of processors by increasing the size of the data • Data parallel programs based on array operations resemble conventional, serial programs. • Parallelism is encapsulated. • Parallelism is structured • Portable • Can run on any class of machine for which the appropriate operators are implemented • Shared/Distributed Memory, Vector Intrinsics, GPUs Operations implemented as parallel loops in shared memory Operations implemented as messages if distributed memory Operations implemented with vector intrinsics for SIMD
Data Parallel Advantages • Data parallel programming can both: • Enforce determinacy • Encapsulate non-determinacy • Data parallel programming facilitates autotuning
Numerical Programs and Tiling • Blocking/Tiling is important: • Data Distribution • Locality • Parallelism • Who is responsible for tiling: • The Compiler? • The Programmer?
Tiling and Compilers (Matrix Multiplication) Intel MKL Clearly, the Compiler isn’t doing a good job at tiling MFLOPS 20x icc -O3 -xT icc -O3 Matrix Size
Tiling and Array Languages • Another option is to leave it up to the programmer • What does the code look like? • Notation can get complicated • Additional Dimensions • Arrays of Arrays • Operators not built to handle tiling
Hierarchically Tiled Arrays • The complexity of the tiling problem directly motivates the Hierarchically Tiled Array (HTA) • Makes tiles first class objects • Referenced explicitly • Extended array operations to operate with tiles
Hierarchically Tiled Arrays Distributed Multicore Locality
Higher Level Operations • Many operators part of the library • Map, reduce, circular shift, replicate, transpose, etc • Programmers can create new complex parallel operators through the primitives hmap (and MapReduce) • Applies user defined operators to each tile of the HTA • And corresponding tiles if multiple HTAs are involved • Operator applied in parallel across tiles
User Defined Operations hmap( F(), X, Y) X F() Y
HTA Examples • We can handle many basic types of computations using the HTA library • Cannon’s Matrix Multiplication • Sparse Matrix/Vector Multiplication • We also support more complicated computations • Recursive parallelism • Dynamic partitioning
Cannon's Matrix Multiplication A00 A01 A02 B00 B01 B02 A10 A11 A12 initial skew B10 B11 B12 A20 A21 A22 B20 B21 B22 A00 B00 A01 B11 A02 B22 A00 B00 A01 B11 A02 B22 A12 B21 A12 B21 A11 B10 A10 B02 A11 B10 A10 B02 shift-multiply-add A22 B20 A20 B01 A21 B12 A22 B20 A20 B01 A21 B12
Cannon's Matrix Multiplication HTA A, B, C do i = 1:m // initial skew A(i,:) = circ_shift( A(i,:), [ 0, -(i-1)] ) // shift rows left B(:,i) = circ_shift( B(:,i), [ -(i-1), 0 ] ) // shift rows up do i = 1:n // main loop C = C + A * B // matrix add. and mult. A = circ_shift( A, [ 0 -1 ] ) B = circ_shift( B, [ -1 0 ] )
Sparse Matrix/Vector Multiplication Transpose Replicate Reduce(+) .* =
Sparse Matrix/Vector Multiplication Sparse_HTA A HTA In, Res Res = transpose( In ) Res = replicate( Res, [3 1] ) // replicate Res = map( *, A, Res) // element-by-element mult. Res = reduce( +, Res, [0 1] ) // row reduction
User Defined Operations - Merge Merge(HTA input1, HTA input2, HTA output ) { … if (output.size() < THRESHOLD) SerialMerge( input1, input2, output ) else { i = input1.size() / 2 input1.addPartition( i ) j = h2.location_first_gt( input1[i] ) input2.addPartition(j) k = i + j output.addPartition(k) hmap( Merge(), input1, input2, output ) } … } input1 input2 input1 Dynamic Partitioning input2 Merge Merge
Advantages of tiling as a first class object for optimization • HTAs have been implemented as C++ and MATLAB libraries. • For shared and distributed memory machines • A GPU version is planned • Implemented several benchmark suites. • Performance is competitive with OpenMP, MPI, and TBB counterparts • Furthermore, the HTA notation produces code more readable than other notations. It significantly reduces number of lines of code.
Advantages of tiling as a first class object Lines of code EP CG MG FT LU Lines of Code. HTA vs. MPI
Performance Results With basic compiler optimizations, can match Fortran/MPI MG FT IS CG
Extending Data Parallel Programming • Many of today’s programs are amenable to data parallelism but not with today’s abstractions • Need to identify new primitives to extend data parallelism to these types of programs • Non numerical • Non deterministic • Traditionally task parallel
New Data Structures for Non Numerical Computations • Operations on aggregates do not have to be confined to arrays • Trees • Graphs • Sets
Sets • Sets are a possible aggregate to consider for data parallelism • Have been examined before (The Connection Machine – Hillis) • What primitives do we need? • Map – apply some function to every element of a set • Reduce – apply reductions across a set or multiple sets (Union, Intersection, etc) • MapReduce • Scan – perform a prefix operation on sets
What problem domains can be solved in parallel using set operations • We have studied several areas including • Search • Datamining • Mesh Refinement • In all cases, it was possible to obtain a highly parallel and readable version using set operations
Example – Search – 15 Puzzle • 4x4 grid of tiles with a “hole” • Slide tiles to go from a start state to the Goal • States (puzzle configurations) and transitions (moves) form a graph • Solve using a Best-First Search
Parallel Search Algorithms • Best-First search uses a heuristic to guide the search to examine “good” nodes first • If the search space is very large, prefer nodes that are closer to a solution over nodes less likely to quickly reach the goal • Ex. The 15 puzzle search space size is ~16! • For the puzzle, the heuristic function takes a state and gives it a score • Better scores are likely to lead to solutions more quickly • Metric is sum of: • Steps so far • Sum of distances of each tile from its final position
Parallel Search Algorithms Expand expand W W select select
Parallel Search Algorithms Search( initial_state ) work_list.add( initial_state ) while ( work_list not empty ) n = SELECT( work_list ) If ( n contains GOAL ) break work_list = work_list – n successors = expand( n ) update( work_list, successors ) • The implementation of SELECT determines the type of search: • ALL Breadth-First • DEEPEST Depth-First • BEST (Heuristic) Best-First Code looks sequential Operators can be parallel
Parallel Search Algorithms • One way to efficiently implement the parallel operators is to used tiled setsand use a map primitive (as before we used tiled arrays and HTA’s hmap) • Want to tile for same reasons as before: • Data distribution • Locality • Parallelism
Mapping and Tiled Sets • Cannot create a tiled set as easily as a tiled array • Specifying tiling is trivial for Arrays • A Tiled Set requires two parameters: • The number of tiles • A mapping function that takes a piece of data from the set and specifies a destination tile number 3 2 4 Tile 1 Set
Locality vs Load Balance • Choosing a “good” mapping function is important as it affects: • Load Balance • Locality • Load imbalance can occur if data is not evenly mapped to tiles • One possible solution is Overdecomposition • Compromise between extra overhead and better load balance • Specify more tiles than processors and have a “smart” runtime (i.e. Cilk’s and Intel TBB’s Task Stealing, CHARM++)
Tiled Sets and Locality • The mapping function affects locality • Ideally, all the red nodes would end up in the original tile • Shared Memory – new nodes in cache for the next iteration • Distributed Memory – minimizes communication for mapping new nodes • However, this is not always the case expand select
Tiled Sets and Locality expand MapReduce Mapping Function Set Union
Non Numerical Computations • Many non numerical computations amenable to data parallelism when it is properly extended • Search, etc • Tiling can benefit Sets just as it does Arrays when properly extended • Mapping function explicit • “Quality” of mapping important
Non Deterministic Computations • Many non deterministic problems could be amenable to data parallelism with the proper extensions • Need new primitives that can either: • Enforce determinacy • Encapsulate the non determinacy • Two examples • Vector operations with indirect indices • Delaunay Mesh Refinement
Vector Operations with Indirect Indices • Consider A( X(i) ) += V(i) : • Fully parallel if X does not contain duplicate values • Potential races if duplicates exist • One possible way to parallelize is to annotate that all updates to A must be atomic
Vector Operations with Indirect Indices • A( X(i) ) += V(i) : • Let A represent the balances of accounts • Let the values of X represent the indices of specific accounts in A • Let V be a series of transactions sorted chronologically • If the bank imposes penalties for negative balances, the transactions associated with an individual account cannot be reordered • Can be successfully parallelized if the programmer can specify that updates are not commutative • Allow the parallel update of different accounts, but serialize updates to the same account • Inspector/Executor
Delaunay Mesh Refinement • Given a mesh of triangles, want to refine the mesh such that all the triangles meet certain properties • The circumcircle of any triangle does not contain points of any other triangle • The minimum degree of any triangle is at least a certain size • Can be written as a sequence of data parallel operators • Given a set of triangles, find those that are “bad” • For each bad triangle, calculate the affected neighboring triangles, or cavity • For each set of triangles, remove the bad triangle and its neighbors and replace them with new triangles • Might create new bad triangles • Repeat until the mesh contains no bad triangles