On Dynamic Load Balancing on Graphics Processors

On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

Overview • Motivation • Methods • Experimental evaluation • Conclusion

The problem setting Work Offline Task Task Task Task Task Task Task Online Task Task Task Task

Static Load Balancing Processor Processor Processor Processor

Static Load Balancing Processor Processor Processor Processor Task Task Task Task

Static Load Balancing Processor Processor Processor Processor Task Task Task Task Subtask Subtask Subtask Subtask

Dynamic Load Balancing Processor Processor Processor Processor Task Task Task Task Subtask Subtask Subtask Subtask

Task sharing Check condition Work done? Done Acquire Task Task Set Try to get task Task Got task? No, retry Task Task Perform task Task No, continue New tasks? Add Task Task Add task

System Model • CUDA • Global Memory • Gather and scatter • Compare-And-Swap • Fetch-And-Inc • Multiprocessors • Maximum number ofconcurrent thread blocks Global Memory Multi-processor Multi-processor Multi-processor Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block

Synchronization • Blocking • Uses mutual exclusion to only allow one process at a time to access the object. • Lockfree • Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps. • Waitfree • Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps.

Load Balancing Methods • Blocking Task Queue • Non-blocking Task Queue • Task Stealing • Static Task List

Blocking queue Free TB 1 Head TB 2 Tail TB n

Blocking queue Free TB 1 Head TB 2 T1 Tail TB n

Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n • Reference P. Tsigas and Y. Zhang, A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems[SPAA01]

Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n

Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 T5 Tail TB n

Task stealing T1 TB 1 TB 2 T3 T2 TB n • Reference Arora N. S., Blumofe R. D., Plaxton C. G. , Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]

Task stealing T1 T4 TB 1 TB 2 T3 T2 TB n

Task stealing T1 T4 T5 TB 1 TB 2 T3 T2 TB n

Task stealing T1 T4 TB 1 TB 2 T3 T2 TB n

Task stealing T1 TB 1 TB 2 T3 T2 TB n

Task stealing TB 1 TB 2 T3 T2 TB n

Task stealing TB 1 TB 2 T2 TB n

Static Task List In T1 T2 T3 T4

Static Task List In T1 TB 1 T2 TB 2 T3 TB 3 T4 TB 4

Static Task List In Out T1 TB 1 T2 TB 2 T3 TB 3 T4 TB 4

Static Task List In Out T1 TB 1 T5 T2 TB 2 T3 TB 3 T4 TB 4

Static Task List In Out T1 TB 1 T5 T2 TB 2 T6 T3 TB 3 T4 TB 4

Static Task List In Out T1 TB 1 T5 T2 TB 2 T6 T3 TB 3 T7 T4 TB 4

Octree Partitioning • Bandwidth bound

Four-in-a-row • Computation intensive

Graphics Processors 8800GT 9600GT 8 Multiprocessors 57 GB/sec bandwidth • 14 Multiprocessors • 57 GB/sec bandwidth

Blocking Queue – Octree/9600GT

Blocking Queue – Octree/8800GT

Blocking Queue – Four-in-a-row

Non-blocking Queue – Octree/9600GT

Non-blocking Queue – Octree/8800GT

Non-blocking Queue - Four-in-a-row

Task stealing – Octree/9600GT

On Dynamic Load Balancing on Graphics Processors

On Dynamic Load Balancing on Graphics Processors

Presentation Transcript

Load Balancing Part 1: Dynamic Load Balancing

Load Balancing Parallel Applications on Heterogeneous Platforms

Load Balancing and Intelligent Load Balancing

Dynamic Load Balancing on Web-server Systems

Cryptography on Graphics Processors

Dynamic Load Sharing and Balancing

High-Throughput Transaction Executions on Graphics Processors

Video Coding on Multi-core Graphics Processors

Load balancing

Dynamic Load Balancing

Dynamic Simulation Load Balancing

Load Balancing

Mars: A MapReduce Framework on Graphics Processors

Frequent Itemset Mining on Graphics Processors

Parallel Computing on Graphics Processors

Static Image Filtering on Commodity Graphics Processors

The Asynchronous Dynamic Load-Balancing Library

Dynamic Load Balancing in Scientific Simulation

Load-Balancing

Dynamic Load Balancing for VORPAL