1 / 64

Eliminating Synchronization Bottlenecks in Object-Based Programs Using Adaptive Replication

This paper discusses a technique called adaptive replication to eliminate synchronization bottlenecks in object-based programs. The technique involves replicating objects that cause bottlenecks and giving each processor its own local copy to update. The copies are then combined at the end of the parallel phase. The paper also explores the legality of replicating objects and provides guidelines on which objects to replicate.

Download Presentation

Eliminating Synchronization Bottlenecks in Object-Based Programs Using Adaptive Replication

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Eliminating Synchronization Bottlenecks in Object-Based Programs Using Adaptive Replication Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology Pedro Diniz Information Sciences Institute University of Southern California

  2. Context Parallelizing Compiler Commutativity Analysis Parallel Program with Mutual Exclusion Synchronization Sequential Program

  3. Context Parallelizing Compiler Commutativity Analysis Parallel Program with Mutual Exclusion Synchronization Sequential Program • Basic Idea: View computation as atomic operations on objects • If all pairs of operations in a given phase commute (generate same final result in both execution orders) • Compiler generates parallel code

  4. Context Parallelizing Compiler Commutativity Analysis Parallel Program with Mutual Exclusion Synchronization Sequential Program Synchronization Optimization Lock Coarsening Adaptive Replication Optimized Parallel Program with Mutual Exclusion Synchronization and Data Replication

  5. Outline • Example • Model of Computation • Basic Issues • Interaction with Lock Coarsening • Experimental Results • Conclusion

  6. 5 2 4 7 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 Example 2 1

  7. Example 5 0 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 14

  8. Outline of Algorithm Graph Traversal • Acquire Lock in Object • Update Sum • Release Lock • In Parallel, Recursively Traverse Left Child and Right Child

  9. Parallel Program void node::traverse(int weight) { mutex.acquire(); sum += weight; mutex.release() if (left !=NULL) spawn left->traverse(left_weight); if (right!=NULL) spawn right->traverse(right_weight); } class node { lock mutex; node *left, *right; int left_weight; int right_weight; int sum; };

  10. Example 5 0 2 0 0 4 7 2 1 0 0 0 0 3 2 4 1 2 6 3 5 0 0 0 0 0 0 0 0 2 2 1 1 3 1 2 2 0

  11. Example 5 0 2 0 0 4 7 2 1 0 0 0 0 3 2 4 1 2 6 3 5 0 0 0 0 0 0 0 0 2 2 1 1 3 1 2 2 0

  12. Example 5 0 2 5 2 4 7 2 1 0 0 0 0 3 2 4 1 2 6 3 5 0 0 0 0 0 0 0 0 2 2 1 1 3 1 2 2 0

  13. Example 5 0 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 0 0 0 0 0 0 0 0 2 2 1 1 3 1 2 2 0

  14. Example 5 0 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 0

  15. Example 5 0 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 2

  16. Example 5 0 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 3

  17. Example 5 0 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 4

  18. Example 5 0 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 6

  19. Example 5 0 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 8

  20. Example 5 0 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 9

  21. Example 5 0 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 12

  22. Example 5 0 2 5 2 4 7 2 1 4 7 2 1 3 2 4 1 2 6 3 5 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 14

  23. Synchronization Bottleneck • Lots of Updates to One Object • Because of Mutual Exclusion, Updates Execute Sequentially • Processors Spend Time Waiting to Acquire the Lock in the Object • Performance Suffers

  24. Solution in Example • Replicate Object that Causes Bottleneck • Give Each Processor Its Own Local Copy • Each Processor Updates Local Copy • Combine Copies at End of Parallel Phase 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 0 Replicate This Object

  25. Example with Four Processors 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 0 0 0 0 0 Processor 0 Processor 1 Processor 2 Processor 3

  26. Add In First Number 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 0 2 1 2 3 Processor 0 Processor 1 Processor 2 Processor 3

  27. Add In Second Number 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 0 3 3 3 5 Processor 0 Processor 1 Processor 2 Processor 3

  28. 3 3 3 5 Processor 0 Processor 1 Processor 2 Processor 3 Combine To Get Final Result 3 2 4 1 2 6 3 5 2 2 1 1 3 1 2 2 14 +

  29. Goal: Automate Technique of Replicating Objects to Eliminate Synchronization Bottlenecks

  30. Object-Based Model of Computation 4 • Objects Instance variables (left, right, sum, …) represent state of each object • Operations on Receiver Objects • In example, traverse is an operation • Updated graph node is receiver object • Operation Execution • Updates instance variables in receiver • Invokes other operations 3 2 3 2

  31. Parallel Execution Execution of Application Consists of an Alternating Sequence of Serial Phases and Parallel Phases Serial Phase Parallel Phase Serial Phase Parallel Phase Serial Phase

  32. Operations in Parallel Phases • Instance variable updates execute atomically • Each object has mutual exclusion lock • Lock acquired before updates • Lock released after updates • Invoked operations execute in parallel

  33. Legality of Replicating Objects • Is it always legal to replicate objects? • No. All updates to object in parallel phase must be replicatable • Updates of the form v = v+exp are replicatable, where + is a commutative, associative operator with a zero, and variables in exp are not updated during the parallel phase

  34. Which Objects to Replicate? • Why Not Just Replicate All Replicatable Objects? • Some Objects Don’t Cause Bottlenecks • Replication Overhead • Space for Copies • Time to Create and Initialize Copies • Goal • Identify Objects With High Contention • Replicate Only Those Objects

  35. Basic Approach • Dynamically Measure Contention At Each Object • If Contention Is High • Replicate Object (Dynamically) • Perform Update on Local Copy • If No Contention • Perform Update on Original Object • Pay Replication Overhead Only When There is a Payoff in Parallelism

  36. Details • What is the replication policy? • Processor attempts to acquire lock. • Creates local copy only if it fails to acquire lock. • Where are replicas stored? • In a hash table. • Can’t space overhead be too high? • No. Impose a space limit. • If a replication would exceed space limit, don’t replicate object. Wait for lock.

  37. More Details • What happens at end of parallel phase? • Generated code traverses hash tables • Finds Replicas • Combines contributions -> original objects • Deallocates replicas Processor 0 6 14 Processor 1 8 Hash Tables Replicas Original Object

  38. More Details • What happens at end of parallel phase? • Generated code traverses hash tables • Finds Replicas • Combines contributions -> original objects • Deallocates replicas Processor 0 6 14 Processor 1 8 Hash Tables Replicas Original Object

  39. More Details • What happens at end of parallel phase? • Generated code traverses hash tables • Finds Replicas • Combines contributions -> original objects • Deallocates replicas Processor 0 6 14 Processor 1 8 Hash Tables Replicas Original Object

  40. More Details • What happens at end of parallel phase? • Generated code traverses hash tables • Finds Replicas • Combines contributions -> original objects • Deallocates replicas Processor 0 14 Processor 1 Hash Tables Original Object

  41. Generated Code void node::traverse(int weight) { node *replica = lookup(this); // Check for existing copy if (replica) replica->replicaTraverse(p); // Update existing copy else if(mutex.tryAcquire()) { // Try to acquire lock 1:sum += weight;// Perform update on original object mutex.release(); if (left !=NULL) spawn left->traverse(leftWeight); if (right!=NULL) spawn right->traverse(rightWeight); } else { // No existing copy, failed to acquire lock replica = this->replicate(); // Try to replicate object if (replica) replica->replicaTraverse(p); // Update new copy else{mutex.acquire();goto 1;}// Replicate failed, wait for lock } }

  42. Updating A Replica void node::replicaTraverse(int weight) { sum += weight; if (left !=NULL) spawn left->traverse(leftWeight); if (right!=NULL) spawn right->traverse(rightWeight); } Updates Execute Without Synchronization

  43. Replicating An Object void node::replicate() { // Check to see if limit exceeded if (allocated + sizeof(node) > limit) return(NULL); // Allocate New Copy node *replica = new node; allocated += sizeof(node); // Zero out updated fields replica->value = 0; // Copy other fields replica->left = left; replica->leftWeight = leftWeight; replica->right = right; replica->rightWeight = rightWeight; insert(this,replica); // Insert replica into hash table return(replica); }

  44. Adaptive Replication Summary • Static Analysis to Discover Replicatable Objects • Dynamic Measurement of Contention to Determine Which Objects to Replicate • Generated Code • Measures Contention • Replicates Objects • Updates Original and Replica Objects • Combines Results in Replicas Back Into Original Objects

  45. Lock Coarsening obj.mutex.acquire(); update obj obj.mutex.release(); unsynchronized computation obj.mutex.acquire(); update obj obj.mutex.release(); unsynchronized computation obj.mutex.acquire(); update obj obj.mutex.release(); obj.mutex.acquire(); update obj unsynchronized computation update obj unsynchronized computation update obj obj.mutex.release();

  46. Lock Coarsening obj.mutex.acquire(); while (c) { unsynchronized computation update obj } obj.mutex.release(); while (c) { unsynchronized computation obj.mutex.acquire(); update obj obj.mutex.release(); }

  47. Lock Coarsening Tradeoffs • Advantage: • Fewer Executed Lock Constructs • Acquires • Releases • Less Lock Overhead • Disadvantage: • Critical Sections Larger • May Cause Additional Serialization • In Some Cases, Completely Serializes Parallel Phase

  48. Lock Coarsening Tradeoffs With Adaptive Replication • Advantages: • Fewer Executed Lock and Replication Constructs • Replica Lookups • Lock Acquires and Releases • Less Lock and Replication Overhead • No Additional Serialization • Disadvantage: • Potential For Increased Memory Usage

  49. Result Automatically Generated Code That Replicates Objects to Eliminate Synchronization Bottlenecks Replication Policy Dynamically Adapts to the Amount of Contention for Each Object on Each Processor Lock Coarsening Plus Adaptive Replication Increases Granularity and Reduces Overhead Without Increasing Serialization

  50. Experimental Results • Prototype Implementation • In Context of Parallelizing Compiler • Commutativity Analysis • Lock Coarsening, Adaptive Replication • Four Versions • Adaptive Replication, Lock Coarsening • Adaptive Replication, No Lock Coarsening • No Replication, Best Lock Coarsening • Full Replication, Lock Coarsening

More Related