520 likes | 546 Views
This talk discusses communication optimizations for parallel computing, focusing on replication, locality, broadcast, and concurrent fetches. The goal is to reduce communication overhead and improve performance.
E N D
Communication Optimizations for Parallel Computing Using Data Access Information Martin Rinard Department of Computer Science University of California,Santa Barbara martin@cs.ucsb.edu http://www.cs.ucsb.edu/~martin
Motivation Communication Overhead Can Substantially Degrade the Performance of Parallel Computations
Communication Optimizations Replication Locality Broadcast Concurrent Fetches Latency Hiding
Applying Optimizations • Language Implementation: • Automatically • Reduces Programming Burden • No Portability Problems -Each Implementation Optimized for Current Hardware Platform Programmer: By Hand • Programming Burden • Portability Problems
Key Questions • How does the implementation get the information it needs to apply the communication optimizations? • What communication optimization algorithms does the implementation use? • How well do the optimized computations perform?
Goal of Talk Present Experience Automatically Applying Communication Optimizations in Jade
Talk Outline • Jade Language • Message Passing Implementation • Communication Optimization Algorithms • Experimental Results on iPSC/860 • Shared Memory Implementation • Communication Optimization Algorithms • Experimental Results on Stanford DASH • Conclusion
Jade • Portable, Implicitly Parallel Language • Data Access Information • Programmer starts with serial program • Uses Jade constructs to provide information about how parts of program access data • Jade Implementation Uses Data Access Information to Automatically • Extract Concurrency • Synchronize Computation • Apply Communication Optimizations
Jade Concepts • Shared Objects • Tasks • Access Specifications shared object references withonly { } do () { computation that reads and writes } rd ; wr ; access specification task
rd ; Jade Example withonly { } do () { ...} wr ; withonly { } do () { ...} rd ; wr ; withonly { } do () { ...} rd ; rd ;
Jade Example withonly { } do () { ...} rd ; wr ; withonly { } do () { ...} rd ; wr ; withonly { } do () { ...} rd ; rd ;
rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; withonly { } do () { ...} rd ; rd ;
rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; withonly { } do () { ...} rd ; rd ;
rd ; rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; wr ; withonly { } do () { ...} rd ; rd ;
rd ; rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; wr ; withonly { } do () { ...} rd ; rd ;
rd ; rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; wr ; withonly { } do () { ...} rd ; rd ; rd ; rd ;
rd ; rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; wr ; withonly { } do () { ...} rd ; rd ; rd ; rd ;
rd ; rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; wr ; withonly { } do () { ...} rd ; rd ; rd ; rd ;
rd ; rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; wr ; withonly { } do () { ...} rd ; rd ; rd ; rd ;
rd ; rd ; Jade Example withonly { } do () { ...} wr ; withonly { } do () { ...} rd ; wr ; wr ; withonly { } do () { ...} rd ; rd ; rd ; rd ;
rd ; Jade Example withonly { } do () { ...} wr ; withonly { } do () { ...} rd ; wr ; withonly { } do () { ...} rd ; rd ; rd ; rd ;
Result • At Each Point in the Execution • A Collection of Enabled Tasks • Each Task Has an Access Specification • Jade Implementation • Exploits Information in Access Specifications • Apply Communication Optimizations
Message Passing Implementation • Model of Computation for Implementation • Implementation Overview • Communication Optimizations • Experimental Results for iPSC/860
Model of Computation • Each Processor Has a Private Memory • Processors Communicate by Sending Messages through Network network memory processor
Implementation Overview Distributes Objects Across Memories
Implementation Overview Assigns Enabled Tasks to Idle Processors rd ; wr ;
Implementation Overview Transfers Objects to Accessing Processor Replicates Objects that Task will Read rd ; wr ;
Implementation Overview Transfers Objects to Accessing Processor Migrates Objects that Task will Write rd ; wr ;
Implementation Overview When all Remote Objects Arrive Task Executes rd ; wr ;
Optimization Goal Mechanism Adaptive Broadcast Parallelize Communication Broadcast Each New Version of Widely Accessed Objects Replication Enable Tasks to Concurrently Read Same Data Replicate Data on Reading Processors Latency Hiding Overlap Computation and Communication Assign Multiple Enabled Tasks to Same Processor Concurrent Fetch Parallelize Communication Concurrently Transfer Remote Objects that Task will Access Locality Eliminate Communication Execute Tasks on Processors that have Locally Available Copies of Accessed Objects
Application-Based Evaluation Water:Evaluates forces and potentials in a system of liquid water molecules String:Computes a velocity model of the geology between two oil wells Ocean:Simulates the role of eddy and boundary currents in influencing large-scale ocean movements Panel Cholesky:Sparse Cholesky factorization algorithm
Impact of Communication Optimizations Panel Cholesky Water String Ocean - + Adaptive Broadcast Replication - Latency Hiding - Concurrent Fetch + Significant Impact - Negligible Impact Required To Expose Concurrency
Locality Optimization • Integrated into Online Scheduler • Scheduler • Maintains Pool of Enabled Tasks • Maintains Pool of Idle Processors • Balances Load by Assigning Enabled Tasks to Idle Processors • Locality Algorithm Affects the Assignment
Locality Concepts Each Object has an Owner: Last processor to write the object. Owner has a current copy of the object. Each Task has a Locality Object:Currently first object in access specification. Locality Object Determines Target Processor:Owner of locality object. Goal: Execute each task on its target processor.
When Task Becomes Enabled • Scheduler Checks Pool of Idle Processors • If Target Processor is Idle Target Processor Gets Task • If Some Other Processor is Idle Other Processor Gets Task • No Processor is Idle Task is Held in Pool of Enabled Tasks
When Processor Becomes Idle • Scheduler Checks Pool of Enabled Tasks • If Processor is Target of an Enabled Task Processor Gets That Task • If Other Enabled Tasks Exist Processor Gets one of Those Tasks • No Enabled Tasks Processor Stays Idle
Implementation Versions Locality: Implementation uses Locality Algorithm No Locality: First Come, First Served Assignment of Enabled Tasks to Idle Processors Task Placement (Ocean and Panel Cholesky)Programmer assigns tasks to processors
100 100 75 75 50 50 task placement 25 25 0 0 locality 0 8 16 24 32 0 8 16 24 32 water string no locality 100 75 50 25 0 0 8 16 24 32 ocean panel cholesky Percentage of Tasks Executed on Target Processor on iPSC/860 100 75 50 25 0 0 8 16 24 32
Communication to Useful Computation Ratio on IPSC/860 (Mbytes/Second/Processor) 0.0025 0.0025 0.0020 0.0020 0.0015 0.0015 no locality 0.0010 0.0010 0.0005 0.0005 0 0 locality 0 8 16 24 32 0 8 16 24 32 water string task placement 3 3 2 2 1 1 0 0 0 8 16 24 32 0 8 16 24 32 ocean panel cholesky
32 32 24 24 16 16 task placement 8 8 0 0 locality 0 8 16 24 32 0 8 16 24 32 water string no locality 4 4 3 3 2 2 1 1 0 0 0 8 16 24 32 0 8 16 24 32 ocean panel cholesky Speedup on iPSC/860
Shared Memory Implementation • Model of Computation • Locality Optimization • Locality Performance Results
Model of Computation • Single Shared Memory • Composed of Memory Modules • Each Memory Module Associated with a Processor • Each Object Allocated in a Memory Module • Processors Communicate by Reading and Writing Objects in the Shared Memory memory module shared memory object processor
Locality Algorithm • Integrated into Online Scheduler • Scheduler Runs Distributed Task Queue • Each Processor Has a Queue of Enabled Tasks • Idle Processors Search Task Queues • Locality Algorithm Affects Task Queue Algorithm
Locality Concepts Each Object has an Owner: Processor associated with memory module that holds object. Accesses to the object from this processor are satisfied from local memory module. Each Task has a Locality Object:Currently first object in access specification. Locality Object Determines Target Processor: Owner of locality object. Goal: Execute each task on its target processor.
When Processor Becomes Idle • If Its Task Queue is not Empty Execute First Task in Task Queue • Otherwise Cyclically Search Task Queues • If Remote Task Queue is not Empty Execute Last Task in Task Queue
When Task Becomes Enabled • Locality Algorithm Inserts Task into Task Queue at the Owner of Its Locality Object • Tasks with Same Locality Object are Adjacent in Queue • Goals: • Enhance memory locality by executing each task on the owner of its locality object. • Enhance cache locality by executing tasks with the same locality object consecutively on same the processor.
Evaluation • Same Set of Applications • Water • String • Ocean • Panel Cholesky • Same Locality Versions • Locality • No Locality (Single Task Queue) • Explicit Task Placement (Ocean and Panel Cholesky)
100 75 50 task placement 25 0 locality 0 8 16 24 32 water string no locality 100 100 75 75 50 50 25 25 0 0 0 8 16 24 32 0 8 16 24 32 ocean panel cholesky Percentage of Tasks Executed on Target Processor on DASH 100 75 50 25 0 0 8 16 24 32
4000 24000 3000 18000 2000 12000 1000 6000 0 0 0 8 16 24 32 0 8 16 24 32 400 300 200 100 0 0 8 16 24 32 Task Execution Time on DASH no locality locality water string task placement 100 75 50 25 0 0 8 16 24 32 ocean panel cholesky
task placement locality water string no locality ocean panel cholesky Speedup on DASH 32 32 24 24 16 16 8 8 0 0 0 8 16 24 32 0 8 16 24 32 32 32 24 24 16 16 8 8 0 0 0 8 16 24 32 0 8 16 24 32