Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized Object-Based Programs

Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized Object-Based Programs Pedro C. Diniz Martin C. Rinard University of California, Santa Barbara Santa Barbara, California 93106 {martin,pedro}@cs.ucsb.edu http://www.cs.ucsb.edu/~{martin,pedro}

Goal Eliminate Synchronization Overhead in Parallel Object-Based Programs • Basic Idea • Interprocedural Synchronization Analysis • Automatically Eliminate Synchronization Constructs • Context: Parallelizing Compiler for C++ • Irregular Computations • Dynamic Data Structures • Commutativity Analysis

Structure of Talk • Commutativity Analysis • Model of Computation • Example • Basic Approach • Synchronization Optimization Techniques • Data Lock Coarsening • Computation Lock Coarsening • Experimental Results • Future Work • Self-Tuning Code

1 3 0 0 1 1 1 0 0 1 Model of Computation operations objects executing operation new object state operation initial object state invoked operations

1 0 2 3 0 0 4 0 Graph Traversal Example class graph { int val, sum; graph *left, *right; }; void graph::traverse(int v) { sum += v; if (left !=NULL) left->traverse(val); if (right!=NULL) right->traverse(val); } • Goal • Executeleft and righttraverse operations in parallel

1 1 0 0 2 3 2 3 0 0 0 0 1 4 4 1 0 0 2 3 0 0 1 1 4 1 1 0 2 3 2 3 1 1 0 0 4 4 0 0 Parallel Traversal

1 1 1 1 2 3 2 3 1 1 1 1 1 1 4 4 1 1 0 2 2 3 2 3 1 1 1 1 1 1 4 4 5 5 0 5 2 3 2 3 1 1 5 5 4 4 0 Commuting Operations in Parallel Traversal 3

Commutativity Analysis • Compiler Chooses A Computation to Parallelize • In Example: Entire graph::traverse Computation • Compiler Computes Extent of the Computation • Representation of all Operations in Computation • Current Representation: Set of Methods • In Example: { graph::traverse } • Do All Pairs of Operations in Extent Commute? • No - Generate Serial Code • Yes - Generate Parallel Code • In Example: All Pairs Commute

Driver Version Code Generation In Example class graph { lock mutex; int val, sum; graph *left, *right; }; Class Declaration void graph::traverse(int v){ parallel_traverse(v); wait(); }

Parallel Version In Example void graph::parallel_traverse(int v) { mutex.acquire(); sum += v; mutex.release(); if (left != NULL) spawn(left->parallel_traverse(val)); if (right != NULL) spawn(right->parallel_traverse(val)); }

Commutativity Testing

Commutativity Testing Conditions • Do Two Operations A and B Commute? • Compiler Considers Two Execution Orders • A;B - A executes before B • B;A - B executes before A • Compiler Must Check Two Conditions Instance Variables New values of instance variables are same in both execution orders Invoked Operations A and B together directly invoke same set of operations in both execution orders

Commutativity Testing Algorithm • Symbolic Execution: • Compiler Executes Operations • Computes with Expressions not Values • Compiler Symbolically Executes Operations In Both Execution Orders • Expressions for New Values of Instance Variables • Expressions for Multiset of Invoked Operations • Compiler Simplifies, Compares Corresponding Expressions • If All Equal - Operations Commute • If Not All Equal - Operations May Not Commute

Commutativity Testing In Example • Two Operations r->traverse(v1) and r->traverse(v2) • In Order r->traverse(v1);r->traverse(v2) Instance Variables Newsum= (sum+v1)+v2 • Invoked Operations • if(right!=NULL,right->traverse(val)), • if(left!=NULL,left->traverse(val)), • if(right!=NULL,right->traverse(val)), • if(left!=NULL,left->traverse(val)) • In Order r->traverse(v2);r->traverse(v1) Instance Variables New sum= (sum+v2)+v1 • Invoked Operations • if(right!=NULL,right->traverse(val)), • if(left!=NULL,left->traverse(val)), • if(right!=NULL,right->traverse(val)), • if(left!=NULL,left->traverse(val))

Compiler Structure Computation Selection Entire Computation of Each Method Extent Computation Traverse Call Graph to Extract Extent All Pairs of Operations In Extent Commutativity Testing All Operations Commute Operations May Not Commute Generate Serial Code Generate Parallel Code

Traditional Approach • Data Dependence Analysis • Analyzes Reads and Writes • Independent Pieces of Code Execute in Parallel • Demonstrated Success for Array-Based Programs

Data Dependence Analysis in Example • For Data Dependence Analysis To Succeed in Example • left andrighttraverse Must Be Independent • left andright Subgraphs Must Be Disjoint • Graph Must Be a Tree • Depends on Global Topology of Data Structure • Analyze Code that Builds Data Structure • Extract and Propagate Topology Information • Fails For Graphs

Properties of Commutativity Analysis • Oblivious to Data Structure Topology • Wide Range of Computations • Irregular Computations with Dynamic Data Structures • Lists, Trees and Graphs • Updates to Central Data Structure • General Reductions • Key Issue in Code Generation • Operations Must Execute Atomically • Compiler Automatically Inserts Locking Constructs

Synchronization Optimizations

Default Code Generation Strategy • Each Object Has its Own Mutual Exclusion Lock • Each Operation Acquires and Releases Lock class graph { lock mutex; int val, sum; graph *left, *right; }; void graph::parallel_traverse(int v) { mutex.acquire(); sum += v; mutex.release(); if (left != NULL) spawn(left->parallel_traverse(val)); if (right != NULL) spawn(right->parallel_traverse(val)); }

Data Lock Coarsening Transformation • Give Multiple Objects the Same Lock • Current Policy: Nested Objects Use the Lock in Enclosing Object • Find Sequences of Operations • Access Different Objects • Acquire and Release Same Lock • Transformed Code • Acquires Lock Once At Beginning of Sequence • Releases Lock Once At End of Sequence • Original Code • Each Operation Acquires and Releases Lock

Data Lock Coarsening Example Original Code Transformed Code class vector { lock mutex; double val[NDIM]; } void vector::add(double *v){ mutex.acquire(); for(int i=0; i < NDIM; i++) val[i] += v[i]; mutex.release(); } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; mutex.release(); acc.add(v); } class vector { double val[NDIM]; } void vector::add(double *v){ for(int i=0; i < NDIM; i++) val[i] += v[i]; } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; acc.add(v); mutex.release(); }

Data Lock Coarsening Tradeoff • Advantage: • Reduces Number of Executed Acquires and Releases • Reduces Acquire and Release Overhead • Disadvantage: May Cause False Exclusion • Multiple Parallel Operations Access Different Objects • But Operations Attempt to Acquire Same Lock • Result: Operations Execute Serially

False Exclusion Original After Data Lock Coarsening Processor 0 L.acquire() A->op() L.release() Processor 1 M.acquire() B->op() M.release() Processor 0 L.acquire() A->op() L.release() Processor 1 L.acquire() . . B->op() L.release() False Exclusion Time

Computation Lock Coarsening Transformation • Finds Sequences of Operations • Acquire and Release Same Lock • Transformed Code • Acquires Lock Once at Beginning of Sequence • Releases Lock Once at End of Sequence • Original Code • Acquires and Releases Lock Once for Each Operation • Result • Replaces Multiple Mutual Exclusion Regions With • One Large Mutual Exclusion Region

Computation Lock Coarsening Example Original Code Optimized Code class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; acc.add(v); mutex.release(); } void body::loopsub(body *b){ int i; for (i = 0; i < N; i++) { this->gravsub(b+i); } } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; p = computeInter(b,v); phi -= p; acc.add(v); } void body::loopsub(body *b){ int i; mutex.acquire(); for (i = 0; i < N; i++) { this->gravsub(b+i); } mutex.release(); }

Computation Lock Coarsening Tradeoff • Advantage: • Reduces Number of Executed Acquires and Releases • Reduces Acquire and Release Overhead • Disadvantage: May Introduce False Contention • Multiple Processors Attempt to Acquire Same Lock • Processor Holding the Lock is Executing Code that was Originally in No Mutual Exclusion Region

False Contention Original After Computation Lock Coarsening Processor 0 L.acquire() A->op() L.release() L.acquire() A->op() L.release() Processor 1 L.acquire() A->op() L.release() Processor 0 L.acquire() A->op() A->op() L.release() Processor 1 L.acquire() . . . . . A->op() L.release() Local Computation False Contention

Managing Tradeoff: Lock Coarsening Policies • To Manage Tradeoff, Compiler Must Successfully • Reduce Lock Overhead by Increasing Lock Granularity • Avoid Excessive False Exclusion and False Contention • Original Policy • Use Original Lock Algorithm • Bounded Policy • Apply Transformation Unless Transformed Code • Holds Lock During a Recursive Call, or • Holds Lock During a Loop that Invokes Operations • Aggressive Policy • Always Apply Transformation

Experimental Results

Methodology • Built Prototype Compiler • Integrated Lock Coarsening Transformations into Prototype • Acquired Two Complete Applications • Barnes-Hut N-Body Solver • Water Code • Automatically Parallelized Applications • Generated A Version of Each Application for Each Policy • Original • Bounded • Aggressive • Ran Applications on Stanford DASH Machine

Applications • Barnes-Hut • O(NlgN) N-Body Solver • Space Subdivision Tree • 1500 Lines of C++ Code • Water • Simulates Liquid Water • O(N^2) Algorithm • 1850 Lines of C++ Code

Lock Overhead Percentage of Time that the Single Processor Execution Spends Acquiring and Releasing Mutual Exculsion Locks 60 60 Original 40 40 Bounded Percentage Lock Overhead Percentage Lock Overhead 20 20 Original Bounded Aggressive Aggressive 0 0 Barnes-Hut (16K Particles) Water (512 Molecules)

Contention Overhead for Barnes-Hut Percentage of Time that Processors Spend Waiting to Acquire Locks Held by Other Processors 100 100 100 Original Bounded Aggressive 75 75 75 50 50 Contention Percentage 50 25 25 25 0 0 0 0 4 8 12 16 0 4 8 12 16 0 4 8 12 16 Processors Processors Processors

Contention Overhead for Water Percentage of Time that Processors Spend Waiting to Acquire Locks Held by Other Processors 100 100 100 Original Bounded 75 75 75 Contention Percentage 50 50 50 Aggressive 25 25 25 0 0 0 0 4 8 12 16 0 4 8 12 16 0 4 8 12 16 Processors Processors Processors

Speedup Ideal Aggressive Bounded Original Ideal Aggressive Bounded Original 16 16 12 12 8 Speedup 8 Speedup 4 4 0 0 0 4 8 12 16 0 4 8 12 16 Processors Processors Barnes-Hut (16K Particles) Water (512 Molecules)

Recent Work: Choosing Best Policy • Best Policy May Depend On • Topology of Data Structures • Dynamic Schedule Of Computation • Information Required to Choose Best Policy Unavailable at Compile Time • Complications • Different Phases May Have Different Best Policy • In Same Phase, Best Policy May Change Over Time

Solution: Generate Self-Tuning Code • Sampling Phase: Measures Performance of Different Policies • Production Phase: Uses Best Policy From Sampling Phase • Periodically Resample to Discover Changes in Best Policy • Guaranteed Performance Bounds Original Bounded Overhead Aggressive Time Sampling Phase Sampling Phase Production Phase

Conclusion • Synchronization Optimizations • Data Lock Coarsening • Computation Lock Coarsening • Integrated into Prototype Parallelizing Compiler • Object-Based Programs with Dynamic Data Structures • Commutativity Analysis • Experimental Results • Optimizations Have a Significant Performance Impact • With Optimizations, Applications Perform Well

Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized Object-Based Programs