1 / 39

Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized Object-Based Programs

Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized Object-Based Programs. Pedro C. Diniz Martin C. Rinard University of California, Santa Barbara Santa Barbara, California 93106 {martin,pedro}@cs.ucsb.edu http://www.cs.ucsb.edu/~{martin,pedro}. Goal.

lproper
Download Presentation

Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized Object-Based Programs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lock Coarsening: Eliminating Lock Overhead in Automatically Parallelized Object-Based Programs Pedro C. Diniz Martin C. Rinard University of California, Santa Barbara Santa Barbara, California 93106 {martin,pedro}@cs.ucsb.edu http://www.cs.ucsb.edu/~{martin,pedro}

  2. Goal Eliminate Synchronization Overhead in Parallel Object-Based Programs • Basic Idea • Interprocedural Synchronization Analysis • Automatically Eliminate Synchronization Constructs • Context: Parallelizing Compiler for C++ • Irregular Computations • Dynamic Data Structures • Commutativity Analysis

  3. Structure of Talk • Commutativity Analysis • Model of Computation • Example • Basic Approach • Synchronization Optimization Techniques • Data Lock Coarsening • Computation Lock Coarsening • Experimental Results • Future Work • Self-Tuning Code

  4. 1 3 0 0 1 1 1 0 0 1 Model of Computation operations objects executing operation new object state operation initial object state invoked operations

  5. 1 0 2 3 0 0 4 0 Graph Traversal Example class graph { int val, sum; graph *left, *right; }; void graph::traverse(int v) { sum += v; if (left !=NULL) left->traverse(val); if (right!=NULL) right->traverse(val); } • Goal • Executeleft and righttraverse operations in parallel

  6. 1 1 0 0 2 3 2 3 0 0 0 0 1 4 4 1 0 0 2 3 0 0 1 1 4 1 1 0 2 3 2 3 1 1 0 0 4 4 0 0 Parallel Traversal

  7. 1 1 1 1 2 3 2 3 1 1 1 1 1 1 4 4 1 1 0 2 2 3 2 3 1 1 1 1 1 1 4 4 5 5 0 5 2 3 2 3 1 1 5 5 4 4 0 Commuting Operations in Parallel Traversal 3

  8. Commutativity Analysis • Compiler Chooses A Computation to Parallelize • In Example: Entire graph::traverse Computation • Compiler Computes Extent of the Computation • Representation of all Operations in Computation • Current Representation: Set of Methods • In Example: { graph::traverse } • Do All Pairs of Operations in Extent Commute? • No - Generate Serial Code • Yes - Generate Parallel Code • In Example: All Pairs Commute

  9. Driver Version Code Generation In Example class graph { lock mutex; int val, sum; graph *left, *right; }; Class Declaration void graph::traverse(int v){ parallel_traverse(v); wait(); }

  10. Parallel Version In Example void graph::parallel_traverse(int v) { mutex.acquire(); sum += v; mutex.release(); if (left != NULL) spawn(left->parallel_traverse(val)); if (right != NULL) spawn(right->parallel_traverse(val)); }

  11. Commutativity Testing

  12. Commutativity Testing Conditions • Do Two Operations A and B Commute? • Compiler Considers Two Execution Orders • A;B - A executes before B • B;A - B executes before A • Compiler Must Check Two Conditions Instance Variables New values of instance variables are same in both execution orders Invoked Operations A and B together directly invoke same set of operations in both execution orders

  13. Commutativity Testing Algorithm • Symbolic Execution: • Compiler Executes Operations • Computes with Expressions not Values • Compiler Symbolically Executes Operations In Both Execution Orders • Expressions for New Values of Instance Variables • Expressions for Multiset of Invoked Operations • Compiler Simplifies, Compares Corresponding Expressions • If All Equal - Operations Commute • If Not All Equal - Operations May Not Commute

  14. Commutativity Testing In Example • Two Operations r->traverse(v1) and r->traverse(v2) • In Order r->traverse(v1);r->traverse(v2) Instance Variables Newsum= (sum+v1)+v2 • Invoked Operations • if(right!=NULL,right->traverse(val)), • if(left!=NULL,left->traverse(val)), • if(right!=NULL,right->traverse(val)), • if(left!=NULL,left->traverse(val)) • In Order r->traverse(v2);r->traverse(v1) Instance Variables New sum= (sum+v2)+v1 • Invoked Operations • if(right!=NULL,right->traverse(val)), • if(left!=NULL,left->traverse(val)), • if(right!=NULL,right->traverse(val)), • if(left!=NULL,left->traverse(val))

  15. Compiler Structure Computation Selection Entire Computation of Each Method Extent Computation Traverse Call Graph to Extract Extent All Pairs of Operations In Extent Commutativity Testing All Operations Commute Operations May Not Commute Generate Serial Code Generate Parallel Code

  16. Traditional Approach • Data Dependence Analysis • Analyzes Reads and Writes • Independent Pieces of Code Execute in Parallel • Demonstrated Success for Array-Based Programs

  17. Data Dependence Analysis in Example • For Data Dependence Analysis To Succeed in Example • left andrighttraverse Must Be Independent • left andright Subgraphs Must Be Disjoint • Graph Must Be a Tree • Depends on Global Topology of Data Structure • Analyze Code that Builds Data Structure • Extract and Propagate Topology Information • Fails For Graphs

  18. Properties of Commutativity Analysis • Oblivious to Data Structure Topology • Wide Range of Computations • Irregular Computations with Dynamic Data Structures • Lists, Trees and Graphs • Updates to Central Data Structure • General Reductions • Key Issue in Code Generation • Operations Must Execute Atomically • Compiler Automatically Inserts Locking Constructs

  19. Synchronization Optimizations

  20. Default Code Generation Strategy • Each Object Has its Own Mutual Exclusion Lock • Each Operation Acquires and Releases Lock class graph { lock mutex; int val, sum; graph *left, *right; }; void graph::parallel_traverse(int v) { mutex.acquire(); sum += v; mutex.release(); if (left != NULL) spawn(left->parallel_traverse(val)); if (right != NULL) spawn(right->parallel_traverse(val)); }

  21. Data Lock Coarsening Transformation • Give Multiple Objects the Same Lock • Current Policy: Nested Objects Use the Lock in Enclosing Object • Find Sequences of Operations • Access Different Objects • Acquire and Release Same Lock • Transformed Code • Acquires Lock Once At Beginning of Sequence • Releases Lock Once At End of Sequence • Original Code • Each Operation Acquires and Releases Lock

  22. Data Lock Coarsening Example Original Code Transformed Code class vector { lock mutex; double val[NDIM]; } void vector::add(double *v){ mutex.acquire(); for(int i=0; i < NDIM; i++) val[i] += v[i]; mutex.release(); } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; mutex.release(); acc.add(v); } class vector { double val[NDIM]; } void vector::add(double *v){ for(int i=0; i < NDIM; i++) val[i] += v[i]; } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; acc.add(v); mutex.release(); }

  23. Data Lock Coarsening Tradeoff • Advantage: • Reduces Number of Executed Acquires and Releases • Reduces Acquire and Release Overhead • Disadvantage: May Cause False Exclusion • Multiple Parallel Operations Access Different Objects • But Operations Attempt to Acquire Same Lock • Result: Operations Execute Serially

  24. False Exclusion Original After Data Lock Coarsening Processor 0 L.acquire() A->op() L.release() Processor 1 M.acquire() B->op() M.release() Processor 0 L.acquire() A->op() L.release() Processor 1 L.acquire() . . B->op() L.release() False Exclusion Time

  25. Computation Lock Coarsening Transformation • Finds Sequences of Operations • Acquire and Release Same Lock • Transformed Code • Acquires Lock Once at Beginning of Sequence • Releases Lock Once at End of Sequence • Original Code • Acquires and Releases Lock Once for Each Operation • Result • Replaces Multiple Mutual Exclusion Regions With • One Large Mutual Exclusion Region

  26. Computation Lock Coarsening Example Original Code Optimized Code class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; acc.add(v); mutex.release(); } void body::loopsub(body *b){ int i; for (i = 0; i < N; i++) { this->gravsub(b+i); } } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; p = computeInter(b,v); phi -= p; acc.add(v); } void body::loopsub(body *b){ int i; mutex.acquire(); for (i = 0; i < N; i++) { this->gravsub(b+i); } mutex.release(); }

  27. Computation Lock Coarsening Tradeoff • Advantage: • Reduces Number of Executed Acquires and Releases • Reduces Acquire and Release Overhead • Disadvantage: May Introduce False Contention • Multiple Processors Attempt to Acquire Same Lock • Processor Holding the Lock is Executing Code that was Originally in No Mutual Exclusion Region

  28. False Contention Original After Computation Lock Coarsening Processor 0 L.acquire() A->op() L.release() L.acquire() A->op() L.release() Processor 1 L.acquire() A->op() L.release() Processor 0 L.acquire() A->op() A->op() L.release() Processor 1 L.acquire() . . . . . A->op() L.release() Local Computation False Contention

  29. Managing Tradeoff: Lock Coarsening Policies • To Manage Tradeoff, Compiler Must Successfully • Reduce Lock Overhead by Increasing Lock Granularity • Avoid Excessive False Exclusion and False Contention • Original Policy • Use Original Lock Algorithm • Bounded Policy • Apply Transformation Unless Transformed Code • Holds Lock During a Recursive Call, or • Holds Lock During a Loop that Invokes Operations • Aggressive Policy • Always Apply Transformation

  30. Experimental Results

  31. Methodology • Built Prototype Compiler • Integrated Lock Coarsening Transformations into Prototype • Acquired Two Complete Applications • Barnes-Hut N-Body Solver • Water Code • Automatically Parallelized Applications • Generated A Version of Each Application for Each Policy • Original • Bounded • Aggressive • Ran Applications on Stanford DASH Machine

  32. Applications • Barnes-Hut • O(NlgN) N-Body Solver • Space Subdivision Tree • 1500 Lines of C++ Code • Water • Simulates Liquid Water • O(N^2) Algorithm • 1850 Lines of C++ Code

  33. Lock Overhead Percentage of Time that the Single Processor Execution Spends Acquiring and Releasing Mutual Exculsion Locks 60 60 Original 40 40 Bounded Percentage Lock Overhead Percentage Lock Overhead 20 20 Original Bounded Aggressive Aggressive 0 0 Barnes-Hut (16K Particles) Water (512 Molecules)

  34. Contention Overhead for Barnes-Hut Percentage of Time that Processors Spend Waiting to Acquire Locks Held by Other Processors 100 100 100 Original Bounded Aggressive 75 75 75 50 50 Contention Percentage 50 25 25 25 0 0 0 0 4 8 12 16 0 4 8 12 16 0 4 8 12 16 Processors Processors Processors

  35. Contention Overhead for Water Percentage of Time that Processors Spend Waiting to Acquire Locks Held by Other Processors 100 100 100 Original Bounded 75 75 75 Contention Percentage 50 50 50 Aggressive 25 25 25 0 0 0 0 4 8 12 16 0 4 8 12 16 0 4 8 12 16 Processors Processors Processors

  36. Speedup Ideal Aggressive Bounded Original Ideal Aggressive Bounded Original 16 16 12 12 8 Speedup 8 Speedup 4 4 0 0 0 4 8 12 16 0 4 8 12 16 Processors Processors Barnes-Hut (16K Particles) Water (512 Molecules)

  37. Recent Work: Choosing Best Policy • Best Policy May Depend On • Topology of Data Structures • Dynamic Schedule Of Computation • Information Required to Choose Best Policy Unavailable at Compile Time • Complications • Different Phases May Have Different Best Policy • In Same Phase, Best Policy May Change Over Time

  38. Solution: Generate Self-Tuning Code • Sampling Phase: Measures Performance of Different Policies • Production Phase: Uses Best Policy From Sampling Phase • Periodically Resample to Discover Changes in Best Policy • Guaranteed Performance Bounds Original Bounded Overhead Aggressive Time Sampling Phase Sampling Phase Production Phase

  39. Conclusion • Synchronization Optimizations • Data Lock Coarsening • Computation Lock Coarsening • Integrated into Prototype Parallelizing Compiler • Object-Based Programs with Dynamic Data Structures • Commutativity Analysis • Experimental Results • Optimizations Have a Significant Performance Impact • With Optimizations, Applications Perform Well

More Related