Structure-driven Optimizations for Amorphous Data-parallel Programs

Structure-driven Optimizations for Amorphous Data-parallel Programs Mario Méndez-Lojo1Donald Nguyen1 Dimitrios Prountzos1Xin Sui1 M. Amber Hassaan1 Milind Kulkarni2 Martin Burtscher1Keshav Pingali1 1The University of Texas at Austin (USA) 2Purdue University (USA)

Irregular algorithms • Operate on pointer-based data structures like graphs • mesh refinements, min. spanning tree, max-flow… • Plenty of available parallelism [Kulkarni et al., PPoPP’09] • Baseline Galois system [Kulkarni et al., PLDI’07] • uses speculation to exploit this parallelism • may have high overheads for some algorithms • Solution explored in paper • exploit algorithmic structure to reduce overheads of baseline system • We will show: • common algorithmic structures • optimizations that exploit those structures • performance results

Operator formulation of algorithms i3 • Algorithm = repeated application of operator to graph • active node: • node where computation is needed • activity: • application of operator to active node • can add/remove nodes from graph • neighborhood: • set of nodes and edges read/written to perform activity • can be distinct from neighbors in graph • Focus: algorithms in which order of activities does not matter • Amorphous data-parallelism • parallel execution of activities, subject to neighborhood constraints i1 i2 i4 i5 : active node : neighborhood

Delaunay mesh refinement • Iterative refinement to remove badly shaped triangles: add initial bad triangles to workset while workset is not empty: pick a bad triangle find its cavity retriangulate cavity add new bad triangles to workset • Multiple valid solutions • Parallelism: • bad triangles with cavities that do not overlap can be processed in parallel • parallelism is dependent on runtime values • compilers cannot find this parallelism

Baseline execution model • Parallel execution model • shared-memory • optimistic execution of Galois iterators • Implementation • threads get active nodes from workset • apply operator to them • Neighborhood independence • each node/edge has an associated token • graph operations → acquire tokens on read/written nodes • token owned by another thread → conflict → activity rolled back • software TLS/TM variety main() … for node: workset … … … … i3 i1 i2 i4 i5 program concurrent graph

Sources of overhead • Dynamic assignment of work • the centralized worksetrequires synchronization • Enforcing neighborhood constraints • acquiring/releasing tokens on neighborhood • Copying data for rollbacks • when an activity modifies a graph element • Aborted activities • work is wasted + roll back the activity ≡ ≡ ≡ R W R activity time

Proposed optimizations • Baseline execution model is very general • many irregular algorithms do not need its full generality • “optimize the common case” • Identify general-purpose optimizations and evaluate their performance impact • Optimizations • cautious • one-shot • iteration coalescing

Cautious Algorithmic structure: operator reads all elements of its neighborhood before modifying any of them • conflicts detected before modifications occur Optimizations: • Enforcing neighborhood constraints • token acquisition unnecessary after first modification • Copying data for rollbacks examples: Delaunay refinement, Boruvka minimum spanning tree, etc. ≡ ≡ R W R activity time

One-shot Algorithmic structure: neighborhood can be predicted before activity begins Optimizations: • Enforcing neighborhood constraints • token acquisition only necessary when activity starts • Copying data for rollbacks • Aborted activities • waste little computation examples: preflow-push, survey propagation, stencil codes like Jacobi, etc. ≡ ≡ R W R activity time

Iteration coalescing • Iteration coalescing = data-centric loop chunking • place new active nodes in thread-local worksets • release tokens only on abort/commit • Algorithmic structure: • activities generate new active nodes • same token acquired many times across related activities Benefits: • Dynamic assignment of work • less contention • Enforcing neighborhood constraints • locality: thread probably owns the token ≡ ≡ R W R R W R

Iteration coalescing • Iteration coalescing = data-centric loop chunking • place new active nodes in thread-local worksets • release tokens only on abort/commit Benefits: • Dynamic assignment of work • less contention • Enforcing neighborhood constraints • locality: thread probably owns the token Drawback: • number of tokens held by thread increases • higher conflict ratio

Evaluation • Experiments on Niagara (8 cores, 2 threads/core) • Average % improvementover baseline • Delaunay refinement • cautious → 15% • cautious + coalescing → 18% • one-shot not applicable • max speedup: 8.4x • Boruvka (MST) • cautious → 22% • one-shot → 8% • coalescing has no impact • max speedup: 2.7x

Evaluation • Experiments on Niagara (8 cores, 2 threads/core) • Average % improvement over baseline • Preflow-push (max flow) • cautious → 33% • one-shot→ 44% • one-shot + coalescing → 59% • max speedup: 1.12x • Survey propagation (SAT) • baseline times out • one-shot → 28% over cautious • max speedup: 1.04x

Conclusions • There is structure in irregular algorithms • Our optimizations exploit this structure • cautious • one-shot • iteration coalescing • The evaluation confirms the importance of reducing the overheads of speculation • other optimizations waiting to be discovered?

Thank you!धन्यवाद! slides available at http://www.ices.utexas.edu/~marioml

Structure-driven Optimizations for Amorphous Data-parallel Programs

Structure-driven Optimizations for Amorphous Data-parallel Programs

Presentation Transcript

Parallel Structure

Parallel Structure

Analyses and Optimizations for Multithreaded Programs

Verifying Parallel Optimizations with PTRANS

Parallel Structure

ParaMeter : A profiling tool for amorphous data-parallel programs

Parallel Structure

Collaboration and Governance Structure for Data Driven Projects

Parallel Programs

Parallel I/O Optimizations

Parallel Structure

Parallel I/O Optimizations

Parallel Structure

Parallel Structure

Parallel Structure

Communication Optimizations for Parallel Computing Using Data Access Information

Improving Programs through Data-Driven Decisions:

Parallel Structure

Parallel Structure

Parallel Structure