10 likes | 164 Views
Dynamic Task Parallelism with a GPU Work-Stealing Runtime System. Results. Parallel computing is a necessity GPU is a promising example of heterogeneous hardware Hundreds of cores Memory bandwidth NVIDIA’s CUDA makes general purpose programming on GPUs possible, but not easy
E N D
Dynamic Task Parallelism with a GPU Work-Stealing Runtime System Results Parallel computing is a necessity GPU is a promising example of heterogeneous hardware • Hundreds of cores • Memory bandwidth NVIDIA’s CUDA makes general purpose programming on GPUs possible, but not easy Lots of expert tuning still necessary for: • Data transfer management • Multi-GPU applications • Dynamic task parallelism Background Crypt • Demonstrates low runtime overhead Nqueens • Static subtree assignment shows 2x as many tasks in one subtree as another Quicksort • Poor pivot selection results in unbalanced computation tree Dijstra Shortest Path Algorithm • Any new tasks must be spread across all other SMs Results Test each benchmark with multiple GPUs Diagnostic data allows us to measure how effectively tasks are being balanced between SMs Device: NVIDIA Tesla C2050 • 14 multiprocessor, 2.8 GB global memory, 1.15 GHz • CUDA 3.2 Host: 4 Quad Core AMD CPUs • 2.5 GHz Acknowledgements Approach Work stealing runtime which supports dynamic task parallelism, on hardware intended for data parallelism Showed effectiveness of device work stealing queues in distributing work between device SMs Our runtime incurs little overhead, and applications already optimized for CUDA demonstrate similar performance Provide a simpler runtime API to the device, though work remains to be done integrating the runtime with a true programming model Future Work Investigate more advanced memory management systems (reuse device memory, use constant memory for read-only data, etc) Saw copying data back from device took disproportionate amount of time in the runtime, compared to hand coded. We will investigate the cause Optimize device code by trying to limit cache-volatile operations and memory fences Consider using warps of threads rather than blocks of threads as execution unit Use worker blocks of different dimensions and allocate tasks to optimal worker blocks based on resources required by each kernel Fully integrate this runtime with the CnC-HC programming model being developed at Rice University Habanero Multicore Software Group. http://habanero.rice.edu CnC-CUDA: Declarative Programming for GPU’s. Max Grossman, AlinaSimion, ZoranBudimlić, VivekSarkar. 2010 Workshop on Languages and Compilers for Parallel Computing (LCPC), October 2010. Dynamic Task Parallelism with a GPU Work-Stealing Runtime System. Sanjay Chatterjee, Max Grossman, VivekSarkar. In Progress for LCPC 2011. GPU-Quicksort: A practical Quicksort algorithm for graphics processors. Cederman, Daniel and Tsigas, Philippas. J. Exp. Algorithmics. December 2009. Manage task execution on the GPU device using a hybrid work-stealing/work-sharing runtime • Host system places work for the GPU and in a work-sharing deque • Device consumes work from host and balances load across SMs by maintaining one work-stealing deque per SM, while dynamically creating new tasks on the device Hide device memory allocation and communication from user • Provide simpler runtime API for device memory management • Use asynchronous memcpy’s Manage multiple CUDA devices for the user, distributing tasks evenly across devices Provide domain expert with an abstraction of CUDA • CnC programming model: Max Grossman Rice University Advisor: Dr. VivekSarkar Conclusion <tag> (cpu_step) [out_item] [in_item] {gpu_step} {gpu_step} {gpu_step} {gpu_step} {gpu_step}