New Abstractions For Data Parallel Programming

New Abstractions For Data Parallel Programming James C. Brodman Department of Computer Science brodman2@illinois.edu In collaboration with: George Almási, BasilioFraguela, MaríaGarzarán, David Padua

Outline • Introduction • Hierarchically Tiled Arrays • Additional Abstractions for Data Parallel Programming • Conclusions

1. Introduction

Going Beyond Arrays • Parallel programming has been well studied for numerical programs • Shared/Distributed Memory APIs, Array languages • Hierarchically Tiled Arrays (HTAs) • However, many important problems today are non numerical • Examine non-numerical programs and find new abstractions • Data Structures • Parallel Primitives

Array Languages • Many numerical programs were written using Array languages • Popular among scientists and engineers. • Fortran 90 and successors • MATLAB • Parallelism not the reason for this notation.

Array Languages • Convenient notation for linear algebra and other algorithms • More compact • Higher level of abstraction do i=1,n do i=1,n do j=1,n do j=1,n C(i,j)= A(i,j)+B(i,j) S = S + A(i,j) end do end do end do end do C = A + B S += sum(A)

Data Parallel Programming • Array languages seem a natural fit for parallelism • Parallel programming with aggregate-based or loop-based languages is data centric, or, data parallel • Phrased in terms of performing the same operation on multiple pieces of data • Contrast with task parallelism where parallel tasks may perform completely different operations • Many reasons to prefer data parallel programming over task parallel approaches

Data Parallel Advantages • Data parallel programming is scalable • Scales with increasing number of processors by increasing the size of the data • Data parallel programs based on array operations resemble conventional, serial programs. • Parallelism is encapsulated. • Parallelism is structured • Portable • Can run on any class of machine for which the appropriate operators are implemented • Shared/Distributed Memory, Vector Intrinsics, GPUs Operations implemented as parallel loops in shared memory Operations implemented as messages if distributed memory Operations implemented with vector intrinsics for SIMD

Data Parallel Advantages • Data parallel programming can both: • Enforce determinacy • Encapsulate non-determinacy • Data parallel programming facilitates autotuning

2. Hierarchically Tiled Arrays

Numerical Programs and Tiling • Blocking/Tiling is important: • Data Distribution • Locality • Parallelism • Who is responsible for tiling: • The Compiler? • The Programmer?

Tiling and Compilers (Matrix Multiplication) Intel MKL Clearly, the Compiler isn’t doing a good job at tiling MFLOPS 20x icc -O3 -xT icc -O3 Matrix Size

Tiling and Array Languages • Another option is to leave it up to the programmer • What does the code look like? • Notation can get complicated • Additional Dimensions • Arrays of Arrays • Operators not built to handle tiling

Hierarchically Tiled Arrays • The complexity of the tiling problem directly motivates the Hierarchically Tiled Array (HTA) • Makes tiles first class objects • Referenced explicitly • Extended array operations to operate with tiles

Hierarchically Tiled Arrays Distributed Multicore Locality

Higher Level Operations • Many operators part of the library • Map, reduce, circular shift, replicate, transpose, etc • Programmers can create new complex parallel operators through the primitives hmap (and MapReduce) • Applies user defined operators to each tile of the HTA • And corresponding tiles if multiple HTAs are involved • Operator applied in parallel across tiles

User Defined Operations hmap( F(), X, Y) X F() Y

HTA Examples • We can handle many basic types of computations using the HTA library • Cannon’s Matrix Multiplication • Sparse Matrix/Vector Multiplication • We also support more complicated computations • Recursive parallelism • Dynamic partitioning

Cannon's Matrix Multiplication A00 A01 A02 B00 B01 B02 A10 A11 A12 initial skew B10 B11 B12 A20 A21 A22 B20 B21 B22 A00 B00 A01 B11 A02 B22 A00 B00 A01 B11 A02 B22 A12 B21 A12 B21 A11 B10 A10 B02 A11 B10 A10 B02 shift-multiply-add A22 B20 A20 B01 A21 B12 A22 B20 A20 B01 A21 B12

Cannon's Matrix Multiplication HTA A, B, C do i = 1:m // initial skew A(i,:) = circ_shift( A(i,:), [ 0, -(i-1)] ) // shift rows left B(:,i) = circ_shift( B(:,i), [ -(i-1), 0 ] ) // shift rows up do i = 1:n // main loop C = C + A * B // matrix add. and mult. A = circ_shift( A, [ 0 -1 ] ) B = circ_shift( B, [ -1 0 ] )

Sparse Matrix/Vector Multiplication * =

Sparse Matrix/Vector Multiplication Transpose Replicate Reduce(+) .* =

Sparse Matrix/Vector Multiplication Sparse_HTA A HTA In, Res Res = transpose( In ) Res = replicate( Res, [3 1] ) // replicate Res = map( *, A, Res) // element-by-element mult. Res = reduce( +, Res, [0 1] ) // row reduction

User Defined Operations - Merge Merge(HTA input1, HTA input2, HTA output ) { … if (output.size() < THRESHOLD) SerialMerge( input1, input2, output ) else { i = input1.size() / 2 input1.addPartition( i ) j = h2.location_first_gt( input1[i] ) input2.addPartition(j) k = i + j output.addPartition(k) hmap( Merge(), input1, input2, output ) } … } input1 input2 input1 Dynamic Partitioning input2 Merge Merge

Advantages of tiling as a first class object for optimization • HTAs have been implemented as C++ and MATLAB libraries. • For shared and distributed memory machines • A GPU version is planned • Implemented several benchmark suites. • Performance is competitive with OpenMP, MPI, and TBB counterparts • Furthermore, the HTA notation produces code more readable than other notations. It significantly reduces number of lines of code.

Advantages of tiling as a first class object Lines of code EP CG MG FT LU Lines of Code. HTA vs. MPI

Performance Results With basic compiler optimizations, can match Fortran/MPI MG FT IS CG

3. Additional Abstractions for Data Parallel Programming

Extending Data Parallel Programming • Many of today’s programs are amenable to data parallelism but not with today’s abstractions • Need to identify new primitives to extend data parallelism to these types of programs • Non numerical • Non deterministic • Traditionally task parallel

3.1. Non Numerical Computations

New Data Structures for Non Numerical Computations • Operations on aggregates do not have to be confined to arrays • Trees • Graphs • Sets

Sets • Sets are a possible aggregate to consider for data parallelism • Have been examined before (The Connection Machine – Hillis) • What primitives do we need? • Map – apply some function to every element of a set • Reduce – apply reductions across a set or multiple sets (Union, Intersection, etc) • MapReduce • Scan – perform a prefix operation on sets

What problem domains can be solved in parallel using set operations • We have studied several areas including • Search • Datamining • Mesh Refinement • In all cases, it was possible to obtain a highly parallel and readable version using set operations

Example – Search – 15 Puzzle • 4x4 grid of tiles with a “hole” • Slide tiles to go from a start state to the Goal • States (puzzle configurations) and transitions (moves) form a graph • Solve using a Best-First Search

Example - Search

Parallel Search Algorithms • Best-First search uses a heuristic to guide the search to examine “good” nodes first • If the search space is very large, prefer nodes that are closer to a solution over nodes less likely to quickly reach the goal • Ex. The 15 puzzle search space size is ~16! • For the puzzle, the heuristic function takes a state and gives it a score • Better scores are likely to lead to solutions more quickly • Metric is sum of: • Steps so far • Sum of distances of each tile from its final position

Parallel Search Algorithms Expand expand W W select select

Parallel Search Algorithms Search( initial_state ) work_list.add( initial_state ) while ( work_list not empty ) n = SELECT( work_list ) If ( n contains GOAL ) break work_list = work_list – n successors = expand( n ) update( work_list, successors ) • The implementation of SELECT determines the type of search: • ALL  Breadth-First • DEEPEST  Depth-First • BEST (Heuristic)  Best-First Code looks sequential Operators can be parallel

Parallel Search Algorithms • One way to efficiently implement the parallel operators is to used tiled setsand use a map primitive (as before we used tiled arrays and HTA’s hmap) • Want to tile for same reasons as before: • Data distribution • Locality • Parallelism

Mapping and Tiled Sets • Cannot create a tiled set as easily as a tiled array • Specifying tiling is trivial for Arrays • A Tiled Set requires two parameters: • The number of tiles • A mapping function that takes a piece of data from the set and specifies a destination tile number 3 2 4 Tile 1 Set

Locality vs Load Balance • Choosing a “good” mapping function is important as it affects: • Load Balance • Locality • Load imbalance can occur if data is not evenly mapped to tiles • One possible solution is Overdecomposition • Compromise between extra overhead and better load balance • Specify more tiles than processors and have a “smart” runtime (i.e. Cilk’s and Intel TBB’s Task Stealing, CHARM++)

Tiled Sets and Locality • The mapping function affects locality • Ideally, all the red nodes would end up in the original tile • Shared Memory – new nodes in cache for the next iteration • Distributed Memory – minimizes communication for mapping new nodes • However, this is not always the case expand select

Tiled Sets and Locality expand MapReduce Mapping Function Set Union

15 Puzzle Performance

Non Numerical Computations • Many non numerical computations amenable to data parallelism when it is properly extended • Search, etc • Tiling can benefit Sets just as it does Arrays when properly extended • Mapping function explicit • “Quality” of mapping important

3.2. Non Deterministic Computations

Non Deterministic Computations • Many non deterministic problems could be amenable to data parallelism with the proper extensions • Need new primitives that can either: • Enforce determinacy • Encapsulate the non determinacy • Two examples • Vector operations with indirect indices • Delaunay Mesh Refinement

Vector Operations with Indirect Indices • Consider A( X(i) ) += V(i) : • Fully parallel if X does not contain duplicate values • Potential races if duplicates exist • One possible way to parallelize is to annotate that all updates to A must be atomic

Vector Operations with Indirect Indices • A( X(i) ) += V(i) : • Let A represent the balances of accounts • Let the values of X represent the indices of specific accounts in A • Let V be a series of transactions sorted chronologically • If the bank imposes penalties for negative balances, the transactions associated with an individual account cannot be reordered • Can be successfully parallelized if the programmer can specify that updates are not commutative • Allow the parallel update of different accounts, but serialize updates to the same account • Inspector/Executor

Delaunay Mesh Refinement • Given a mesh of triangles, want to refine the mesh such that all the triangles meet certain properties • The circumcircle of any triangle does not contain points of any other triangle • The minimum degree of any triangle is at least a certain size • Can be written as a sequence of data parallel operators • Given a set of triangles, find those that are “bad” • For each bad triangle, calculate the affected neighboring triangles, or cavity • For each set of triangles, remove the bad triangle and its neighbors and replace them with new triangles • Might create new bad triangles • Repeat until the mesh contains no bad triangles

New Abstractions For Data Parallel Programming

New Abstractions For Data Parallel Programming

Presentation Transcript

Programming Abstractions for Multicore Clouds

Operating System Abstractions for GPU Programming

PARALLEL programming

Data-parallel Abstractions for Irregular Applications

Parallel Programming Abstractions

CSE332: Data Abstractions Lecture 20 : Parallel Prefix and Parallel Sorting

Potential for parallel computers/parallel programming

Programming R-T Abstractions

Data Abstractions

Programming Abstractions for Software-Defined Networks

Programming Abstractions for Approximate Computing

Parallel Programming

Parallel Programming

Programming Abstractions for Software-Defined Networks

Aspects of practical parallel programming Parallel programming models Data parallel

OO Abstractions for Distributed Programming

Programming Abstractions