570 likes | 864 Views
On Dynamic Load Balancing on Graphics Processors. Daniel Cederman and Philippas Tsigas Chalmers University of Technology. Overview. Motivation Methods Experimental evaluation Conclusion. The problem setting. Work. Offline. Task. Task. Task. Task. Task. Task. Task. Online. Task.
E N D
On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology
Overview • Motivation • Methods • Experimental evaluation • Conclusion
The problem setting Work Offline Task Task Task Task Task Task Task Online Task Task Task Task
Static Load Balancing Processor Processor Processor Processor
Static Load Balancing Processor Processor Processor Processor Task Task Task Task
Static Load Balancing Processor Processor Processor Processor Task Task Task Task
Static Load Balancing Processor Processor Processor Processor Task Task Task Task Subtask Subtask Subtask Subtask
Static Load Balancing Processor Processor Processor Processor Task Task Task Task Subtask Subtask Subtask Subtask
Dynamic Load Balancing Processor Processor Processor Processor Task Task Task Task Subtask Subtask Subtask Subtask
Task sharing Check condition Work done? Done Acquire Task Task Set Try to get task Task Got task? No, retry Task Task Perform task Task No, continue New tasks? Add Task Task Add task
System Model • CUDA • Global Memory • Gather and scatter • Compare-And-Swap • Fetch-And-Inc • Multiprocessors • Maximum number ofconcurrent thread blocks Global Memory Multi-processor Multi-processor Multi-processor Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block Thread Block
Synchronization • Blocking • Uses mutual exclusion to only allow one process at a time to access the object. • Lockfree • Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps. • Waitfree • Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps.
Load Balancing Methods • Blocking Task Queue • Non-blocking Task Queue • Task Stealing • Static Task List
Blocking queue Free TB 1 Head TB 2 Tail TB n
Blocking queue Free TB 1 Head TB 2 Tail TB n
Blocking queue Free TB 1 Head TB 2 T1 Tail TB n
Blocking queue Free TB 1 Head TB 2 T1 Tail TB n
Blocking queue Free TB 1 Head TB 2 T1 Tail TB n
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n • Reference P. Tsigas and Y. Zhang, A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems[SPAA01]
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 Tail TB n
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 T5 Tail TB n
Non-blocking Queue TB 1 TB 1 Head TB 2 TB 2 T1 T2 T3 T4 T5 Tail TB n
Task stealing T1 TB 1 TB 2 T3 T2 TB n • Reference Arora N. S., Blumofe R. D., Plaxton C. G. , Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]
Task stealing T1 T4 TB 1 TB 2 T3 T2 TB n
Task stealing T1 T4 T5 TB 1 TB 2 T3 T2 TB n
Task stealing T1 T4 TB 1 TB 2 T3 T2 TB n
Task stealing T1 TB 1 TB 2 T3 T2 TB n
Task stealing TB 1 TB 2 T3 T2 TB n
Task stealing TB 1 TB 2 T2 TB n
Static Task List In T1 T2 T3 T4
Static Task List In T1 TB 1 T2 TB 2 T3 TB 3 T4 TB 4
Static Task List In Out T1 TB 1 T2 TB 2 T3 TB 3 T4 TB 4
Static Task List In Out T1 TB 1 T5 T2 TB 2 T3 TB 3 T4 TB 4
Static Task List In Out T1 TB 1 T5 T2 TB 2 T6 T3 TB 3 T4 TB 4
Static Task List In Out T1 TB 1 T5 T2 TB 2 T6 T3 TB 3 T7 T4 TB 4
Octree Partitioning • Bandwidth bound
Octree Partitioning • Bandwidth bound
Octree Partitioning • Bandwidth bound
Octree Partitioning • Bandwidth bound
Four-in-a-row • Computation intensive
Graphics Processors 8800GT 9600GT 8 Multiprocessors 57 GB/sec bandwidth • 14 Multiprocessors • 57 GB/sec bandwidth