610 likes | 759 Views
Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers. Martin C. Rinard University of California, Santa Barbara. Goal. Automatically Parallelize Irregular, Object-Based Computations That Manipulate Dynamic, Linked Data Structures. Structure of Talk.
E N D
Commutativity Analysis: A New Analysis Framework for Parallelizing Compilers Martin C. Rinard University of California, Santa Barbara
Goal Automatically Parallelize Irregular, Object-Based Computations That Manipulate Dynamic, Linked Data Structures
Structure of Talk • Model of Computation • Graph Traversal Example • Commutativity Testing • Basic Technique • Practical Extensions • Advanced Techniques • Synchronization Optimizations • Experimental Results • Future Research
Model of Computation 5 7 operations objects executing operation new object state operation 5 5 9 initial object state invoked operations
3 6 3 6 6 3 2 5 2 5 7 Example Weighted In Degree Computation • Weighted Graph With Weights On Edges • Goal Is to Compute, For Each Node, Sum of Weights on All Incoming Edges • Serial Algorithm: Marked Depth-First Traversal
3 6 6 3 2 5 7 Serial Code For Example class node { node *left, *right; int left_weight, right_weight; int sum; boolean marked; }; void node::traverse(int weight) { sum += weight; if (!marked) { marked = true; if (left !=NULL) left->traverse(left_weight); if (right!=NULL) right->traverse(right_weight); } } Goal: Execute leftandrightTraverse Operations In Parallel
3 6 2 5 Parallel Traversal 3 6 3 6 2 5 2 5 3 6 3 6 6 3 2 5 2 5
Parallel Traversal 3 6 3 6 6 6 3 3 2 5 2 5 2 3 6 3 6 6 6 3 3 2 5 2 5 7 3 6 3 6 6 6 3 3 2 5 2 5 5
Traditional Approach • Data Dependence Analysis • Compiler Analyzes Reads and Writes • Finds Independent Pieces of Code • Independent Pieces of Code Execute in Parallel • Demonstrated Success for Array-Based Programs • Dense Matrices • Affine Access Functions
Data Dependence Analysis in Example • For Data Dependence Analysis to Succeed in Example • left and right Traverse Must Be Independent • left and right Subgraphs Must Be Disjoint • Graph Must Be a Tree • Depends on Global Topology of Data Structure • Analyze Code that Builds Data Structure • Extract and Propagate Topology Information • Fails for Graphs - Computations Are Not Independent!
Commuting Operations In Parallel Traversal 3 6 3 6 6 6 3 3 2 5 2 5 2 3 6 3 6 6 6 3 3 2 5 2 5 7 3 6 3 6 6 6 3 3 2 5 2 5 5
Commutativity Analysis • Compiler Computes Extent of the Computation • Representation of all Operations in Computation • Algorithm Traverses Call Graph • In Example: { node::traverse } • Do All Pairs of Operations in Extent Commute? • No - Generate Serial Code • Yes - Generate Parallel Code • In Example: All Pairs Commute
Generated Code In Example class node { lock mutex; node *left, *right; int left_weight, right_weight; int sum; boolean marked; }; void node::traverse(int weight) { parallel_traverse(weight); wait(); } Class Declaration Driver Version
Generated Code In Example void node::parallel_traverse(int weight) { mutex.acquire(); sum += weight; if (!marked) { marked = true; mutex.release(); if (left !=NULL) spawn(left->parallel_traverse(left_weight)); if (right!=NULL) spawn(right->parallel_traverse(right_weight)); } else { mutex.release(); } } Critical Region
Properties of Commutativity Analysis • Oblivious to Data Structure Topology • Local Analysis • Simple Analysis • Suitable for a Wide Range of Programs • Programs that Manipulate Lists, Trees and Graphs • Commuting Updates to Central Data Structure • General Reductions • Incomplete Programs • Introduces Synchronization
Separable Operations Each Operation Consists of Two Sections Object Section Only Accesses Receiver Object Invocation Section Only Invokes Operations Both Sections May Access Parameters and Local Variables
Commutativity Testing Conditions • Do Two Operations A and B Commute? • Compiler Must Consider Two Potential Execution Orders • A executes before B • B executes before A • Compiler Must Check Two Conditions Instance Variables In both execution orders, new values of the instance variables are the same after the execution of the two object sections Invoked Operations In both execution orders, the two invocation sections together directly invoke the same multiset of operations
Commutativity Testing Algorithm • Symbolic Execution • Compiler Executes Operations • Computes with Expressions Instead of Values • Compiler Symbolically Executes Operations In Both Execution Orders • Expressions for New Values of Instance Variables • Expressions for Multiset of Invoked Operations
Checking Instance Variables Condition • Compiler Generates Two Symbolic Operations n->traverse(w1) and n->traverse(w2) • In Order n->traverse(w1); n->traverse(w2) • New Value of sum = (sum+w1)+w2 • New Value of marked = true • In Order n->traverse(w2); n->traverse(w1) • New Value of sum = (sum+w2)+w1 • New Value of marked = true
Checking Invoked Operations Condition • In Order n->traverse(w1); n->traverse(w2) Multiset of Invoked Operations Is if (!marked&&left!=NULL) left->traverse(left_weight), if (!marked&&right!=NULL) right->traverse(right_weight) • In Order n->traverse(w2); n->traverse(w1) Multiset of Invoked Operations Is if (!marked&&left!=NULL) left->traverse(left_weight), if (!marked&&right!=NULL) right->traverse(right_weight)
Expression Simplification and Comparison • Compiler Applies Rewrite Rules to Simplify Expressions • b+(a+c) => (a+b+c) • a*(b+c) => (a*b)+(a*c) • a+if(b<c,d,e) => if(b<c,a+d,a+e) • Compiler Compares Corresponding Expressions • If All Equal - Operations Commute • If Not All Equal - Operations May Not Commute
Practical Extensions Exploit Read-Only Data • Recognize When Computed Values Depend Only On • Unmodified Instance Variables or Global Variables • Parameters • Represent Computed Values Using Opaque Constants • Increases Set of Programs that Compiler Can Analyze • Operations Can Freely Access Read-Only Data Coarsen Commutativity Testing Granularity • Integrate Operations into Callers for Analysis Purposes • Mechanism: Interprocedural Symbolic Execution • Increases Effectiveness of Commutativity Testing
Advanced Techniques • Relative Commutativity Recognize Commuting Operations That Generate Equivalent But Not Identical Data Structures • Techniques for Operations that Contain Conditionals • Distribute Conditionals Out of Expressions • Test for Equivalence By Doing Case Analysis • Techniques for Operations that Access Arrays • Use Array Update Expressions to Represent New Values • Rewrite Rules for Array Update Expressions • Techniques for Operations that Execute Loops
Commutativity Testing for Operations With Loops • Prerequisite: Represent Values Computed In Loops • View Body of Loop as an Expression Transformer • Input Expressions: Values Before Iteration Executes • Output Expressions: Values After Iteration Executes • Represent Values Computed In Loop Using Recursively Defined Symbolic Loop Modeling Functions int t=sum; for(i)t=t+a[i]; sum=t; s(e,0) = e s(e,i+1) = s(e,i)+a[i] New Value of sum=s(sum,n) • Use Nested Induction Proofs to Determine Equivalence of Expressions With Symbolic Loop Modeling Functions
Important Special Case • Independent Operations Commute • Analysis in Current Compiler • Dependence Analysis • Operations on Objects of Different Classes • Independent Operations on Objects of Same Class • Symbolic Commutativity Testing • Dependent Operations on Objects of Same Class • Future • Integrate Shape Analysis • Integrate Array Data Dependence Analysis
Programming Model Extensions • Extensions for Read-Only Data • Allow Operations to Freely Access Read-Only Data • Enhances Ability of Compiler to Represent Expressions • Increases Set of Programs that Compiler Can Analyze • Analysis Granularity Extensions • Integrate Operations Into Callers for Analysis Purposes • Coarsens Commutativity Testing Granularity • Reduces Number of Pairs Tested for Commutativity • Enhances Effectiveness of Commutativity Testing
Optimizations • Parallel Loop Optimization • Suppress Exploitation of Excess Concurrency • Synchronization Optimizations • Eliminate Synchronization Constructs in Methods that Only Access Read-Only Data • Lock Coarsening • Replaces Multiple Mutual Exclusion Regions with • Single Larger Mutual Exclusion Region
Default Code Generation Strategy Each Object Has its Own Mutual Exclusion Lock Each Operation Acquires and Releases Lock Simple Lock Optimization Eliminate Lock Constructs In Operations That Only Access Read-Only Data
Data Lock Coarsening Transformation • Compiler Gives Multiple Objects the Same Lock • Current Policy: Nested Objects Use the Lock in Enclosing Object • Finds Sequences of Operations • Access Different Objects • Acquire and Release Same Lock • Transformed Code • Acquires Lock Once At Beginning of Sequence • Releases Lock Once At End of Sequence • Original Code • Each Operation Acquires and Releases Lock
Data Lock Coarsening Example Original Code Transformed Code class vector { lock mutex; double val[NDIM]; } void vector::add(double *v){ mutex.acquire(); for(int i=0; i < NDIM; i++) val[i] += v[i]; mutex.release(); } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; mutex.release(); acc.add(v); } class vector { double val[NDIM]; } void vector::add(double *v){ for(int i=0; i < NDIM; i++) val[i] += v[i]; } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; acc.add(v); mutex.release(); }
Data Lock Coarsening Tradeoff • Advantage: • Reduces Number of Executed Acquires and Releases • Reduces Acquire and Release Overhead • Disadvantage: May Cause False Exclusion • Multiple Parallel Operations Access Different Objects • But Operations Attempt to Acquire Same Lock • Result: Operations Execute Serially
Computation Lock Coarsening Transformation • Compiler Finds Sequences of Operations • Acquire and Release Same Lock • Transformed Code • Acquires Lock Once at Beginning of Sequence • Releases Lock Once at End of Sequence • Result • Replaces Multiple Mutual Exclusion Regions With • One Large Mutual Exclusion Region • Algorithm Based On Local Transformations • Move Lock Acquire and Release To Become Adjacent • Eliminate Adjacent Acquire and Release
Computation Lock Coarsening Example Original Code Optimized Code class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; mutex.acquire(); p = computeInter(b,v); phi -= p; acc.add(v); mutex.release(); } void body::loopsub(body *b){ int i; for (i = 0; i < N; i++) { this->gravsub(b+i); } } class body { lock mutex; double phi; vector acc; }; void body::gravsub(body *b){ double p, v[NDIM]; p = computeInter(b,v); phi -= p; acc.add(v); } void body::loopsub(body *b){ int i; mutex.acquire(); for (i = 0; i < N; i++) { this->gravsub(b+i); } mutex.release(); }
Computation Lock Coarsening Tradeoff • Advantage: • Reduces Number of Executed Acquires and Releases • Reduces Acquire and Release Overhead • Disadvantage: May Introduce False Contention • Multiple Processors Attempt to Acquire Same Lock • Processor Holding the Lock is Executing Code that was Originally in No Mutual Exclusion Region
Managing Tradeoff: Lock Coarsening Policies • To Manage Tradeoff, Compiler Must Successfully • Reduce Lock Overhead by Increasing Lock Granularity • Avoid Excessive False Exclusion and False Contention • Original Policy • Use Original Lock Algorithm • Bounded Policy • Apply Transformation Unless Transformed Code • Holds Lock During a Recursive Call, or • Holds Lock During a Loop that Invokes Operations • Aggressive Policy • Always Apply Transformation
Choosing Best Policy • Best Policy May Depend On • Topology of Data Structures • Dynamic Schedule Of Computation • Information Required to Choose Best Policy Unavailable At Compile Time • Complications • Different Phases May Have Different Best Policy • In Same Phase, Best Policy May Change Over Time
Use Dynamic Feedback to Choose Best Policy • Sampling Phase: Measures Overhead of Different Policies • Production Phase: Uses Best Policy From Sampling Phase • Periodically Resample to Discover Changes in Best Policy • Guaranteed Performance Bounds Original Overhead Bounded Aggressive Original Aggressive Time Sampling Phase Production Phase Sampling Phase
Methodology • Built Prototype Compiler for Subset of C++ • Built Run Time System for Shared Memory Machines • Concurrency Generation and Task Management • Dynamic Load Balancing and Synchronization • Acquired Three Complete Applications • Barnes-Hut • Water • String • Automatically Parallelized Applications • Ran Applications on Stanford DASH Machine • Compare with Highly Tuned, Explicitly Parallel Versions
Major Assumptions and Restrictions • Assumption: No Violation of Type Declarations • Restrictions: • Conceptually Significant • No Virtual Functions • No Function Pointers • No Exceptions • Operations Access Only • Parameters • Read-Only Data • Data Members Declared in Class of the Receiver • Implementation Convenience • No Multiple Inheritance • No Templates • No union, struct or enum Types • No typedef Declarations • Global Variables Must Be of Class Types • No Static Members • No Default Arguments or Variable Numbers of Arguments • No Numeric Casts
Applications • Barnes-Hut • O(NlgN) N-Body Solver • Space Subdivision Tree • 1500 Lines of C++ Code • Water • Simulates Liquid Water • O(N^2) Algorithm • 1850 Lines of C++ Code • String • Computes Model of Geology Between Two Oil Wells • 2050 Lines of C++ Code
Obtaining Serial C++ Version of Barnes-Hut • Started with Explicitly Parallel Version (SPLASH-2) • Removed Parallel Constructs to get Serial C • Converted to Clean Object-Based C++ • Major Structural Changes • Eliminated Scheduling Code and Data Structures • Split a Loop in Force Computation Phase • Introduced New Field into Particle Data Structure
Obtaining Serial C++ Version of Water • Started with Serial C Translated from Fortran • Converted to Clean Object-Based C++ • Major Structural Change • Auxiliary Objects for O(N^2) phases
Obtaining Serial C++ Version of String • Started With Serial C Translated From Fortran • Converted to Clean C++ • No Major Structural Changes
Ideal Ideal Explicitly Parallel Explicitly Parallel Commutativity Analysis Commutativity Analysis Performance Results for Barnes-Hut and Water 16 16 12 12 Speedup Speedup 8 8 4 4 0 0 0 4 8 12 16 0 4 8 12 16 Number of Processors Number of Processors Water on DASH 512 Molecules Barnes-Hut on DASH 16K Particles
Ideal Explicitly Parallel Commutativity Analysis 16 12 8 4 0 0 4 8 12 16 Performance Results for String Speedup Number of Processors String on DASH Big Well Model
Synchronization Optimizations • Generated A Version of Each Application for Each Lock Coarsening Policy • Original • Bounded • Aggressive • Dynamic Feedback • Ran Applications on Stanford DASH Machine