1 / 69

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers. Martin C. Rinard Pedro C. Diniz University of California, Santa Barbara Santa Barbara, California 93106 {martin,pedro}@cs.ucsb.edu http://www.cs.ucsb.edu/~{martin,pedro}. Goal.

hanh
Download Presentation

Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers Martin C. Rinard Pedro C. Diniz University of California, Santa Barbara Santa Barbara, California 93106 {martin,pedro}@cs.ucsb.edu http://www.cs.ucsb.edu/~{martin,pedro}

  2. Goal Develop a Parallelizing Compiler for Object-Oriented Computations • Current Focus • Irregular Computations • Dynamic Data Structures • Future • Persistent Data • Distributed Computations • New Analysis Technique: Commutativity Analysis

  3. Structure of Talk • Model of Computation • Example • Commutativity Testing • Steps To Practicality • Experimental Results • Conclusion

  4. 1 3 0 0 1 1 1 0 0 1 Model of Computation operations objects executing operation new object state operation initial object state invoked operations

  5. 1 0 2 3 0 0 4 0 Graph Traversal Example class graph { int val, sum; graph *left, *right; }; void graph::traverse(int v) { sum += v; if (left !=NULL) left->traverse(val); if (right!=NULL) right->traverse(val); } • Goal • Executeleft and righttraverse operations in parallel

  6. 1 1 0 0 2 3 2 3 0 0 0 0 1 4 4 1 0 0 2 3 0 0 1 1 4 1 1 0 2 3 2 3 1 1 0 0 4 4 0 0 Parallel Traversal

  7. 1 1 1 1 2 3 2 3 1 1 1 1 1 1 4 4 1 1 0 2 2 3 2 3 1 1 1 1 1 1 4 4 5 5 0 5 2 3 2 3 1 1 5 5 4 4 0 Commuting Operations in Parallel Traversal 3

  8. 1 3 0 0 Model of Computation • Operations: Method Invocations • In Example: Invocations of graph::traverse • left->traverse(3) • right->traverse(2) • Objects: Instances of Classes • In Example: Graph Nodes • Instance Variables Implement Object State • In Example: val, sum, left, right

  9. 1 3 0 0 Model of Computation • Operations: Method Invocations • In Example: Invocations of graph::traverse • left->traverse(3) • right->traverse(2) • Objects: Instances of Classes • In Example: Graph Nodes

  10. Separable Operations Each Operation Consists of Two Sections Object Section Only Accesses Receiver Object Invocation Section Only Invokes Operations Both Sections Can Access Parameters

  11. Basic Approach • Compiler Chooses A Computation to Parallelize • In Example: Entire graph::traverse Computation • Compiler Computes Extent of the Computation • Representation of all Operations in Computation • Current Representation: Set of Methods • In Example: { graph::traverse } • Do All Pairs of Operations in Extent Commute? • No - Generate Serial Code • Yes - Generate Parallel Code • In Example: All Pairs Commute

  12. Code GenerationFor Each Method in Parallel Computation • Augments Class Declaration With Mutual Exclusion Lock • Generates Driver Version of Method • Invoked from Serial Code to Start Parallel Execution • Invokes Parallel Version of Operation • Waits for Entire Parallel Computation to Finish • Generates Parallel Version of Method • Object Section • Lock Acquired at Beginning • Lock Released at End • Ensure Atomic Execution • Invocation Section • Invoked Operations • Execute in Parallel • Invokes Parallel Version

  13. Driver Version Code Generation In Example class graph { lock mutex; int val, sum; graph *left, *right; }; Class Declaration void graph::traverse(int v){ parallel_traverse(v); wait(); }

  14. Parallel Version In Example void graph::parallel_traverse(int v) { mutex.acquire(); sum += v; mutex.release(); if (left != NULL) spawn(left->parallel_traverse(val)); if (right != NULL) spawn(right->parallel_traverse(val)); }

  15. Compiler Structure Computation Selection Entire Computation of Each Method Extent Computation Traverse Call Graph to Extract Extent All Pairs of Operations In Extent Commutativity Testing All Operations Commute Operations May Not Commute Generate Serial Code Generate Parallel Code

  16. Traditional Approach • Data Dependence Analysis • Analyzes Reads and Writes • Independent Pieces of Code Execute in Parallel • Demonstrated Success for Array-Based Programs

  17. Data Dependence Analysis in Example • For Data Dependence Analysis To Succeed in Example • left andrighttraverse Must Be Independent • left andright Subgraphs Must Be Disjoint • Graph Must Be a Tree • Depends on Global Topology of Data Structure • Analyze Code that Builds Data Structure • Extract and Propagate Topology Information • Fails For Graphs

  18. Properties of Commutativity Analysis • Oblivious to Data Structure Topology • Local Analysis • Simple Analysis • Wide Range of Computations • Lists, Trees and Graphs • Updates to Central Data Structure • General Reductions • Introduces Synchronization • Relies on Commuting Operations

  19. Commutativity Testing

  20. Commutativity Testing Conditions • Do Two Operations A and B Commute? • Compiler Considers Two Execution Orders • A;B - A executes before B • B;A - B executes before A • Compiler Must Check Two Conditions Instance Variables New values of instance variables are same in both execution orders Invoked Operations A and B together directly invoke same set of operations in both execution orders

  21. 4 4 0 2 4 4 0 5 4 4 0 3 Commutativity Testing Conditions

  22. Commutativity Testing Algorithm • Symbolic Execution: • Compiler Executes Operations • Computes with Expressions not Values • Compiler Symbolically Executes Operations In Both Execution Orders • Expressions for New Values of Instance Variables • Expressions for Multiset of Invoked Operations

  23. Expression Simplification and Comparison • Compiler Applies Rewrite Rules to Simplify Expressions • a*(b+c) a*b)+(a*c) • b+(a+c) (a+b+c) • a+if(b<c,d,e)  if(b<c,a+d,a+e) • Compiler Compares Corresponding Expressions • If All Equal - Operations Commute • If Not All Equal - Operations May Not Commute

  24. Commutativity Testing Example • Two Operations r->traverse(v1) and r->traverse(v2) • In Order r->traverse(v1);r->traverse(v2) Instance Variables Newsum= (sum+v1)+v2 • Invoked Operations • if(right!=NULL,right->traverse(val)), • if(left!=NULL,left->traverse(val)), • if(right!=NULL,right->traverse(val)), • if(left!=NULL,left->traverse(val)) • In Order r->traverse(v2);r->traverse(v1) Instance Variables New sum= (sum+v2)+v1 • Invoked Operations • if(right!=NULL,right->traverse(val)), • if(left!=NULL,left->traverse(val)), • if(right!=NULL,right->traverse(val)), • if(left!=NULL,left->traverse(val))

  25. Important Special Case • Independent Operations Commute • Analysis in Current Compiler • Dependence Analysis • Operations on Objects of Different Classes • Independent Operations on Objects of Same Class • Symbolic Commutativity Testing • Dependent Operations on Objects of Same Class • Future • Integrate Pointer or Alias Analysis • Integrate Array Data Dependence Analysis

  26. Important Special Case • Independent Operations Commute • Conditions for Independence • Operations Have Different Receivers • Neither Operation Writes an Instance Variable that Other Operation Accesses • Detecting Independent Operations • In Type-Safe Languages • Class Declarations • Instance Variable Accesses • Pointer or Alias Analysis

  27. Analysis in Current Compiler • Dependence Analysis • Operations on Objects of Different Classes • Independent Operations on Objects of Same Class • Symbolic Commutativity Testing • Dependent Operations on Objects of Same Class • Future • Integrate Pointer or Alias Analysis • Integrate Array Data Dependence Analysis

  28. Steps to Practicality

  29. Programming Model Extensions • Extensions for Read-Only Data • Allow Operations to Freely Access Read-Only Data • Enhances Ability of Compiler to Represent Expressions • Increases Set of Programs that Compiler can Analyze • Analysis Granularity Extensions • Integrate Operations into Callers for Analysis Purposes • Coarsens Commutativity Testing Granularity • Reduces Number of Pairs Tested for Commutativity • Enhances Effectiveness of Commutativity Testing

  30. Optimizations • Synchronization Optimizations • Eliminate Synchronization Constructs in Methods that Only Access Read-Only Data • Reduce Number of Acquire and Release Constructs • Parallel Loop Optimization • Suppress Exploitation of Excess Concurrency

  31. Extent Constants Motivation: Allow Parallel Operations to Freely Access Read-Only Data • Extent Constant Variable Global variable or instance variable written by no operation in extent • Extent Constant Expression Expression whose value depends only on extent constant variables or parameters • Extent Constant Value Value computed by extent constant expression • Extent Constant Automatically generated opaque constant used to represent an extent constant value • Requires: Interprocedural Data Usage Analysis • Result Summarizes How Operations Access Instance Variables • Interprocedural Pointer Analysis for Reference Parameters

  32. Extent Constant Variables In Example Extent Constant Variable void graph::traverse(int v) { sum += v; if (left != NULL) left->traverse(val); if (right != NULL) right->traverse(val); } Extent Constant Variable

  33. Advantages of Extent Constants • Extent Constants Extend Programming Model • Enable Direct Global Variable Access • Enable Direct Access of Objects other than Receiver • Extent Constants Make Compiler More Effective • Enable Compact Representations of Large Expressions • Enable Compiler to Represent Values Computed by Otherwise Unanalyzable Constructs

  34. Auxiliary Operations Motivation: Coarsen Granularity of Commutativity Testing • An Operation is an Auxiliary Operation if its Entire Computation • Only Computes Extent Constant Values • Only Externally Visible Writes are to Local Variables of Caller • Auxiliary Operations are Conceptually Part of Caller • Analysis Integrates Auxiliary Operations into Caller • Represents Computed Values using Extent Constants • Requires: • Interprocedural Data Usage Analysis • Interprocedural Pointer Analysis for Reference Parameters • Intraprocedural Reaching Definition Analysis

  35. Auxiliary Operation Example int graph::square_and_add(int v) { return(val*val + v); } void graph::traverse(int v) { sum += square_and_add(v); if (left != NULL) left->traverse(val); if (right != NULL) right->traverse(val); } Extent Constant Variable Parameter Extent Constant Expression

  36. Advantages of Auxiliary Operations • Coarsen Granularity of Commutativity Testing • Reduces Number of Pairs Tested for Commutativity • Enhances Effectiveness of Commutativity Testing Algorithm • Support Modular Programming

  37. Synchronization Optimizations • Goal: Eliminate or Reduce Synchronization Overhead • Synchronization Elimination An Operation Only Computes Extent Constant Values Compiler Does Not Generate Lock Acquire and Release If Then • Lock Coarsening Data Use One Lock for Multiple Objects Computation Generate One Lock Acquire and Release for Multiple Operations on the Same Object

  38. Data Lock Coarsening Example Original Code Optimized Code class vector { lock mutex; double val[NDIM]; } void vector::add(double *v){ mutex.acquire(); for(int i=0; i < NDIM; i++) val[i] += v[i]; mutex.release(); } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; mutex.release(); acc.add(v); } class vector { double val[NDIM]; } void vector::add(double *v){ for(int i=0; i < NDIM; i++) val[i] += v[i]; } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; acc.add(v); mutex.release(); }

  39. Computation Lock Coarsening Example Original Code Optimized Code class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; acc.add(v); mutex.release(); } void body::loopsub(body *b){ int i; for (i = 0; i < N; i++) { this->gravsub(b+i); } } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; p = computeInter(b,v); phi -= p; acc.add(v); } void body::loopsub(body *b){ int i; mutex.acquire(); for (i = 0; i < N; i++) { this->gravsub(b+i); } mutex.release(); }

  40. Parallel Loops Goal: Generate Efficient Code for Parallel Loops If A Loop is in the Following Form for (i = exp1; i < exp2; i += exp3) { exp4->op(exp5,exp6, ...); } Where exp1, exp2, ... Extent Constant Expressions Then Compiler Generates Parallel Loop Code

  41. Parallel Loop Optimization • Without Parallel Loop Optimization • Each Loop Iteration Generates a Task • Tasks are Created and Scheduled Sequentially • Each Iteration Incurs Task Creation and Scheduling Overhead • With Parallel Loop Optimization • Generated Code Immediately Exposes All Iterations • Scheduler Operates on Chunks of Loop Iterations • Each Chunk of Iterations Incurs Scheduling Overhead • Advantages • Enables Compact Representation for Loop Computation • Reduces Task Creation and Scheduling Overhead • Parallelizes Overhead

  42. Suppressing Excess Concurrency Goal: Reduce Overhead of Exploiting Parallelism • Goal Achieved by Generating Computations that • Execute Operations Serially with No Parallelization Overhead • Use Synchronization Required to Execute Safely in Parallel Context • Mechanism: Mutex Versions of Methods • Object Section • Acquires Lock at Beginning • Releases Lock at End • Invocation Section • Operations Execute Serially • Invokes Mutex Version • Current Policy: • Each Parallel Loop Iteration Invokes Mutex Version of Operation • Suppresses Parallel Execution Within Iterations of Parallel Loops

  43. Experimental Results

  44. Methodology • Built Prototype Compiler • Built Run Time System • Concurrency Generation and Task Management • Dynamic Load Balancing • Synchronization • Acquired Two Complete Applications • Barnes-Hut N-Body Solver • Water Code • Automatically Parallelized Applications • Ran Applications on Stanford DASH Machine • Compare Performance with Highly Tuned, Explicitly Parallel Versions from SPLASH-2 Benchmark Suite

  45. Prototype Compiler • Clean Subset of C++ • Sage++ is Front End • Structured As a Source-To-Source Translator • Analysis Finds Parallel Loops and Methods • Compiler Generates Annotation File • Identifies Parallel Loops and Methods • Classes to Augment with Locks • Code Generator Reads Annotation File • Generates Parallel Versions of Methods • Inserts Synchronization and Parallelization Code • Parallelizes Unannotated Programs

  46. Major Restrictions Motivation: Simplify Implementation of Prototype • No Virtual Methods • No Operator or Method Overloading • No Multiple Inheritance or Templates • No typedef, struct, union or enum types • Global Variables must be Class Types • No Static Members or Pointers to Members • No Default Arguments or Variable Numbers of Arguments • No Operation Accesses a Variable Declared in a Class from which its Receiver Class Inherits

  47. Run Time Library Motivation: Provide Basic Concurrency Managment • Single Program, Multiple Data Execution Model • Single Address Space • Alternate Serial and Parallel Phases • Library Provides • Task Creation and Synchronization Primitives • Dynamic Load Balancing • Implemented • Stanford DASH Shared-Memory Multiprocessor • SGI Shared-Memory Multiprocessors

  48. Applications • Barnes-Hut • O(NlgN) N-Body Solver • Space Subdivision Tree • 1500 Lines of C++ Code • Water • Simulates Liquid Water • O(N^2) Algorithm • 1850 Lines of C++ Code

  49. Obtaining Serial C++ Version of Barnes-Hut • Started with Explicitly Parallel Version (SPLASH-2) • Removed Parallel Constructs to get Serial C • Converted to Clean Object-Based C++ • Major Structural Changes • Eliminated Scheduling Code and Data Structures • Split a Loop in Force Computation Phase • Introduced New Field into Particle Data Structure

  50. Obtaining Serial C++ Version of Water • Started with Serial C translated from FORTRAN • Converted to Clean Object-Based C++ • Major Structural Change • Auxiliary Objects for O(N^2) phases

More Related