150 likes | 308 Views
Structure-driven Optimizations for Amorphous Data-parallel Programs. Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan 1 Milind Kulkarni 2 Martin Burtscher 1 Keshav Pingali 1 1 The University of Texas at Austin (USA) 2 Purdue University (USA).
E N D
Structure-driven Optimizations for Amorphous Data-parallel Programs Mario Méndez-Lojo1Donald Nguyen1 Dimitrios Prountzos1Xin Sui1 M. Amber Hassaan1 Milind Kulkarni2 Martin Burtscher1Keshav Pingali1 1The University of Texas at Austin (USA) 2Purdue University (USA)
Irregular algorithms • Operate on pointer-based data structures like graphs • mesh refinements, min. spanning tree, max-flow… • Plenty of available parallelism [Kulkarni et al., PPoPP’09] • Baseline Galois system [Kulkarni et al., PLDI’07] • uses speculation to exploit this parallelism • may have high overheads for some algorithms • Solution explored in paper • exploit algorithmic structure to reduce overheads of baseline system • We will show: • common algorithmic structures • optimizations that exploit those structures • performance results
Operator formulation of algorithms i3 • Algorithm = repeated application of operator to graph • active node: • node where computation is needed • activity: • application of operator to active node • can add/remove nodes from graph • neighborhood: • set of nodes and edges read/written to perform activity • can be distinct from neighbors in graph • Focus: algorithms in which order of activities does not matter • Amorphous data-parallelism • parallel execution of activities, subject to neighborhood constraints i1 i2 i4 i5 : active node : neighborhood
Delaunay mesh refinement • Iterative refinement to remove badly shaped triangles: add initial bad triangles to workset while workset is not empty: pick a bad triangle find its cavity retriangulate cavity add new bad triangles to workset • Multiple valid solutions • Parallelism: • bad triangles with cavities that do not overlap can be processed in parallel • parallelism is dependent on runtime values • compilers cannot find this parallelism
Baseline execution model • Parallel execution model • shared-memory • optimistic execution of Galois iterators • Implementation • threads get active nodes from workset • apply operator to them • Neighborhood independence • each node/edge has an associated token • graph operations → acquire tokens on read/written nodes • token owned by another thread → conflict → activity rolled back • software TLS/TM variety main() … for node: workset … … … … i3 i1 i2 i4 i5 program concurrent graph
Sources of overhead • Dynamic assignment of work • the centralized worksetrequires synchronization • Enforcing neighborhood constraints • acquiring/releasing tokens on neighborhood • Copying data for rollbacks • when an activity modifies a graph element • Aborted activities • work is wasted + roll back the activity ≡ ≡ ≡ R W R activity time
Proposed optimizations • Baseline execution model is very general • many irregular algorithms do not need its full generality • “optimize the common case” • Identify general-purpose optimizations and evaluate their performance impact • Optimizations • cautious • one-shot • iteration coalescing
Cautious Algorithmic structure: operator reads all elements of its neighborhood before modifying any of them • conflicts detected before modifications occur Optimizations: • Enforcing neighborhood constraints • token acquisition unnecessary after first modification • Copying data for rollbacks examples: Delaunay refinement, Boruvka minimum spanning tree, etc. ≡ ≡ R W R activity time
One-shot Algorithmic structure: neighborhood can be predicted before activity begins Optimizations: • Enforcing neighborhood constraints • token acquisition only necessary when activity starts • Copying data for rollbacks • Aborted activities • waste little computation examples: preflow-push, survey propagation, stencil codes like Jacobi, etc. ≡ ≡ R W R activity time
Iteration coalescing • Iteration coalescing = data-centric loop chunking • place new active nodes in thread-local worksets • release tokens only on abort/commit • Algorithmic structure: • activities generate new active nodes • same token acquired many times across related activities Benefits: • Dynamic assignment of work • less contention • Enforcing neighborhood constraints • locality: thread probably owns the token ≡ ≡ R W R R W R
Iteration coalescing • Iteration coalescing = data-centric loop chunking • place new active nodes in thread-local worksets • release tokens only on abort/commit Benefits: • Dynamic assignment of work • less contention • Enforcing neighborhood constraints • locality: thread probably owns the token Drawback: • number of tokens held by thread increases • higher conflict ratio
Evaluation • Experiments on Niagara (8 cores, 2 threads/core) • Average % improvementover baseline • Delaunay refinement • cautious → 15% • cautious + coalescing → 18% • one-shot not applicable • max speedup: 8.4x • Boruvka (MST) • cautious → 22% • one-shot → 8% • coalescing has no impact • max speedup: 2.7x
Evaluation • Experiments on Niagara (8 cores, 2 threads/core) • Average % improvement over baseline • Preflow-push (max flow) • cautious → 33% • one-shot→ 44% • one-shot + coalescing → 59% • max speedup: 1.12x • Survey propagation (SAT) • baseline times out • one-shot → 28% over cautious • max speedup: 1.04x
Conclusions • There is structure in irregular algorithms • Our optimizations exploit this structure • cautious • one-shot • iteration coalescing • The evaluation confirms the importance of reducing the overheads of speculation • other optimizations waiting to be discovered?
Thank you!धन्यवाद! slides available at http://www.ices.utexas.edu/~marioml