190 likes | 319 Views
Task Based Execution of GPU Applications with Dynamic Data Dependencies. Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta. GP-GPU Computing. GPUs enable high throughput data & compute intensive computations Data is partitioned into a grid of “Thread Blocks” (TBs)
E N D
Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta
GP-GPU Computing • GPUs enable high throughput data & compute intensive computations • Data is partitioned into a grid of “Thread Blocks” (TBs) • Thousands of TBs in a grid can be executed in any order • No HW support for efficient inter-TB communication • High scalability & throughput for independent data • Challenging & inefficient for inter-TB dependent data
The Problem • Data-dependent & irregular applications • Simulations (n-body, heat) • Graph algorithms (BFS, SSSP) • Inter-TB synchronization • Sync through global memory • Irregular task graphs • Static partitioning fails • Heterogeneous execution • Unbalanced distribution ! ! ! DataDependency Graph
The Solution • “Task based execution” • Transition from SIMD -> MIMD
Challenges Breaking applications into tasks Task to SM assignment Dependency tracking Inter–SM communication Load Balancing
Proposed Task Based Execution Framework Persistent Worker TBs (per SM) Distributed task queues (per SM) In-GPU dependency tracking & scheduling Load balancing via different queue insertion policies
Overview (4). Output (3). Retrieve & Execute (2). Queue (5). Resolve Dependencies (1). Grab a ready Task (6). Grab new
Concurrent Worker&Scheduler Worker Scheduler
Queue Access &Dependency Tracking IQS and OQS Efficient signaling mechanism via global memory Parallel task pointer retrieval Queues store pointers to tasks Parallel dependency check
Queue Insertion Policy Round robin: Better load balancing Poor cache locality Tail submit: [J. Hoogerbrugge et al.]: First child task is always processed by the same SM with parent. Increased locality t t+2 t+1
API user_task is called by worker_kernel Application specific data is added under WorkerContext and Task
Experimental Results NVIDIA Tesla 2050 14 SMs, 3GB memory Applications: Heat 2D: Simulation of heat dissipation over a 2D surface BFS: Breadth-first-search Comparison: Central queue vs. distributed queue
Applications Heat 2D: Regular dependencies, wavefront parallelism. Each tile is a task, intra-tile and inter-tile parallelism
Applications BFS: Irregular dependencies. Unreached neighbors of a node forms a task
Future Work S/W support for: Better task representation More task insertion policies Automated task graph partitioning for higher SM utilization.
Future Work H/W support for: Fast inter-TB sync Support for TB to SM affinity “Sleep” support for TBs
Conclusion Transition from SIMD -> MIMD Task-based execution model Per-SM task assignment In-GPU dependency tracking Locality aware queue management Room for improvement with added HW and SW support