Communication Optimizations for Parallel Computing Using Data Access Information

Communication Optimizations for Parallel Computing Using Data Access Information Martin Rinard Department of Computer Science University of California,Santa Barbara martin@cs.ucsb.edu http://www.cs.ucsb.edu/~martin

Motivation Communication Overhead Can Substantially Degrade the Performance of Parallel Computations

Communication Optimizations Replication Locality Broadcast Concurrent Fetches Latency Hiding

Applying Optimizations • Language Implementation: • Automatically • Reduces Programming Burden • No Portability Problems -Each Implementation Optimized for Current Hardware Platform Programmer: By Hand • Programming Burden • Portability Problems

Key Questions • How does the implementation get the information it needs to apply the communication optimizations? • What communication optimization algorithms does the implementation use? • How well do the optimized computations perform?

Goal of Talk Present Experience Automatically Applying Communication Optimizations in Jade

Talk Outline • Jade Language • Message Passing Implementation • Communication Optimization Algorithms • Experimental Results on iPSC/860 • Shared Memory Implementation • Communication Optimization Algorithms • Experimental Results on Stanford DASH • Conclusion

Jade • Portable, Implicitly Parallel Language • Data Access Information • Programmer starts with serial program • Uses Jade constructs to provide information about how parts of program access data • Jade Implementation Uses Data Access Information to Automatically • Extract Concurrency • Synchronize Computation • Apply Communication Optimizations

Jade Concepts • Shared Objects • Tasks • Access Specifications shared object references withonly { } do () { computation that reads and writes } rd ; wr ; access specification task

rd ; Jade Example withonly { } do () { ...} wr ; withonly { } do () { ...} rd ; wr ; withonly { } do () { ...} rd ; rd ;

Jade Example withonly { } do () { ...} rd ; wr ; withonly { } do () { ...} rd ; wr ; withonly { } do () { ...} rd ; rd ;

rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; withonly { } do () { ...} rd ; rd ;

rd ; rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; wr ; withonly { } do () { ...} rd ; rd ;

rd ; rd ; rd ; Jade Example withonly { } do () { ...} wr ; wr ; withonly { } do () { ...} rd ; wr ; wr ; withonly { } do () { ...} rd ; rd ; rd ; rd ;

rd ; rd ; Jade Example withonly { } do () { ...} wr ; withonly { } do () { ...} rd ; wr ; wr ; withonly { } do () { ...} rd ; rd ; rd ; rd ;

rd ; Jade Example withonly { } do () { ...} wr ; withonly { } do () { ...} rd ; wr ; withonly { } do () { ...} rd ; rd ; rd ; rd ;

Result • At Each Point in the Execution • A Collection of Enabled Tasks • Each Task Has an Access Specification • Jade Implementation • Exploits Information in Access Specifications • Apply Communication Optimizations

Message Passing Implementation • Model of Computation for Implementation • Implementation Overview • Communication Optimizations • Experimental Results for iPSC/860

Model of Computation • Each Processor Has a Private Memory • Processors Communicate by Sending Messages through Network network memory processor

Implementation Overview Distributes Objects Across Memories

Implementation Overview Assigns Enabled Tasks to Idle Processors rd ; wr ;

Implementation Overview Transfers Objects to Accessing Processor Replicates Objects that Task will Read rd ; wr ;

Implementation Overview Transfers Objects to Accessing Processor Migrates Objects that Task will Write rd ; wr ;

Implementation Overview When all Remote Objects Arrive Task Executes rd ; wr ;

Optimization Goal Mechanism Adaptive Broadcast Parallelize Communication Broadcast Each New Version of Widely Accessed Objects Replication Enable Tasks to Concurrently Read Same Data Replicate Data on Reading Processors Latency Hiding Overlap Computation and Communication Assign Multiple Enabled Tasks to Same Processor Concurrent Fetch Parallelize Communication Concurrently Transfer Remote Objects that Task will Access Locality Eliminate Communication Execute Tasks on Processors that have Locally Available Copies of Accessed Objects

Application-Based Evaluation Water:Evaluates forces and potentials in a system of liquid water molecules String:Computes a velocity model of the geology between two oil wells Ocean:Simulates the role of eddy and boundary currents in influencing large-scale ocean movements Panel Cholesky:Sparse Cholesky factorization algorithm

Impact of Communication Optimizations Panel Cholesky Water String Ocean - + Adaptive Broadcast Replication - Latency Hiding - Concurrent Fetch + Significant Impact - Negligible Impact Required To Expose Concurrency

Locality Optimization • Integrated into Online Scheduler • Scheduler • Maintains Pool of Enabled Tasks • Maintains Pool of Idle Processors • Balances Load by Assigning Enabled Tasks to Idle Processors • Locality Algorithm Affects the Assignment

Locality Concepts Each Object has an Owner: Last processor to write the object. Owner has a current copy of the object. Each Task has a Locality Object:Currently first object in access specification. Locality Object Determines Target Processor:Owner of locality object. Goal: Execute each task on its target processor.

When Task Becomes Enabled • Scheduler Checks Pool of Idle Processors • If Target Processor is Idle Target Processor Gets Task • If Some Other Processor is Idle Other Processor Gets Task • No Processor is Idle Task is Held in Pool of Enabled Tasks

When Processor Becomes Idle • Scheduler Checks Pool of Enabled Tasks • If Processor is Target of an Enabled Task Processor Gets That Task • If Other Enabled Tasks Exist Processor Gets one of Those Tasks • No Enabled Tasks Processor Stays Idle

Implementation Versions Locality: Implementation uses Locality Algorithm No Locality: First Come, First Served Assignment of Enabled Tasks to Idle Processors Task Placement (Ocean and Panel Cholesky)Programmer assigns tasks to processors

100 100 75 75 50 50 task placement 25 25 0 0 locality 0 8 16 24 32 0 8 16 24 32 water string no locality 100 75 50 25 0 0 8 16 24 32 ocean panel cholesky Percentage of Tasks Executed on Target Processor on iPSC/860 100 75 50 25 0 0 8 16 24 32

Communication to Useful Computation Ratio on IPSC/860 (Mbytes/Second/Processor) 0.0025 0.0025 0.0020 0.0020 0.0015 0.0015 no locality 0.0010 0.0010 0.0005 0.0005 0 0 locality 0 8 16 24 32 0 8 16 24 32 water string task placement 3 3 2 2 1 1 0 0 0 8 16 24 32 0 8 16 24 32 ocean panel cholesky

32 32 24 24 16 16 task placement 8 8 0 0 locality 0 8 16 24 32 0 8 16 24 32 water string no locality 4 4 3 3 2 2 1 1 0 0 0 8 16 24 32 0 8 16 24 32 ocean panel cholesky Speedup on iPSC/860

Shared Memory Implementation • Model of Computation • Locality Optimization • Locality Performance Results

Model of Computation • Single Shared Memory • Composed of Memory Modules • Each Memory Module Associated with a Processor • Each Object Allocated in a Memory Module • Processors Communicate by Reading and Writing Objects in the Shared Memory memory module shared memory object processor

Locality Algorithm • Integrated into Online Scheduler • Scheduler Runs Distributed Task Queue • Each Processor Has a Queue of Enabled Tasks • Idle Processors Search Task Queues • Locality Algorithm Affects Task Queue Algorithm

Locality Concepts Each Object has an Owner: Processor associated with memory module that holds object. Accesses to the object from this processor are satisfied from local memory module. Each Task has a Locality Object:Currently first object in access specification. Locality Object Determines Target Processor: Owner of locality object. Goal: Execute each task on its target processor.

When Processor Becomes Idle • If Its Task Queue is not Empty Execute First Task in Task Queue • Otherwise Cyclically Search Task Queues • If Remote Task Queue is not Empty Execute Last Task in Task Queue

When Task Becomes Enabled • Locality Algorithm Inserts Task into Task Queue at the Owner of Its Locality Object • Tasks with Same Locality Object are Adjacent in Queue • Goals: • Enhance memory locality by executing each task on the owner of its locality object. • Enhance cache locality by executing tasks with the same locality object consecutively on same the processor.

Evaluation • Same Set of Applications • Water • String • Ocean • Panel Cholesky • Same Locality Versions • Locality • No Locality (Single Task Queue) • Explicit Task Placement (Ocean and Panel Cholesky)

100 75 50 task placement 25 0 locality 0 8 16 24 32 water string no locality 100 100 75 75 50 50 25 25 0 0 0 8 16 24 32 0 8 16 24 32 ocean panel cholesky Percentage of Tasks Executed on Target Processor on DASH 100 75 50 25 0 0 8 16 24 32

4000 24000 3000 18000 2000 12000 1000 6000 0 0 0 8 16 24 32 0 8 16 24 32 400 300 200 100 0 0 8 16 24 32 Task Execution Time on DASH no locality locality water string task placement 100 75 50 25 0 0 8 16 24 32 ocean panel cholesky

task placement locality water string no locality ocean panel cholesky Speedup on DASH 32 32 24 24 16 16 8 8 0 0 0 8 16 24 32 0 8 16 24 32 32 32 24 24 16 16 8 8 0 0 0 8 16 24 32 0 8 16 24 32

Communication Optimizations for Parallel Computing Using Data Access Information

Communication Optimizations for Parallel Computing Using Data Access Information

Presentation Transcript

Parallel Computing Explained Parallel Computing Overview

Parallel Computing Project (OPENMP using LINUX for Parallel application)

Using Parallel Computing in Drug Design

Structure-driven Optimizations for Amorphous Data-parallel Programs

Parallel Computing

Verifying Parallel Optimizations with PTRANS

Optimizations Techniques for GPU Computing

Using Parallel Computing Resources at Marquette

Parallel and Distributed Computing for Data Mining

Centre for Parallel Computing

Data access optimizations for ROOT files

Parallel I/O Optimizations

Quantum Information, Communication and Computing

Parallel Computing Using MPI

Parallel Computing

Parallel Computing Using MPI

Parallel Computing/Programming using MPI

Parallel Computing using JavaSpaces

Optimizations using SSA

Parallel I/O Optimizations

Using Communication Patterns for Parallel Program Optimization

Parallel Computing Project (OPENMP using LINUX for Parallel application)