Task Based Execution of GPU Applications with Dynamic Data Dependencies

Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta

GP-GPU Computing • GPUs enable high throughput data & compute intensive computations • Data is partitioned into a grid of “Thread Blocks” (TBs) • Thousands of TBs in a grid can be executed in any order • No HW support for efficient inter-TB communication • High scalability & throughput for independent data • Challenging & inefficient for inter-TB dependent data

The Problem • Data-dependent & irregular applications • Simulations (n-body, heat) • Graph algorithms (BFS, SSSP) • Inter-TB synchronization • Sync through global memory • Irregular task graphs • Static partitioning fails • Heterogeneous execution • Unbalanced distribution ! ! ! DataDependency Graph

The Solution • “Task based execution” • Transition from SIMD -> MIMD

Challenges Breaking applications into tasks Task to SM assignment Dependency tracking Inter–SM communication Load Balancing

Proposed Task Based Execution Framework Persistent Worker TBs (per SM) Distributed task queues (per SM) In-GPU dependency tracking & scheduling Load balancing via different queue insertion policies

Overview (4). Output (3). Retrieve & Execute (2). Queue (5). Resolve Dependencies (1). Grab a ready Task (6). Grab new

Concurrent Worker&Scheduler Worker Scheduler

Queue Access &Dependency Tracking IQS and OQS Efficient signaling mechanism via global memory Parallel task pointer retrieval Queues store pointers to tasks Parallel dependency check

Queue Insertion Policy Round robin: Better load balancing Poor cache locality Tail submit: [J. Hoogerbrugge et al.]: First child task is always processed by the same SM with parent. Increased locality t t+2 t+1

API user_task is called by worker_kernel Application specific data is added under WorkerContext and Task

Experimental Results NVIDIA Tesla 2050 14 SMs, 3GB memory Applications: Heat 2D: Simulation of heat dissipation over a 2D surface BFS: Breadth-first-search Comparison: Central queue vs. distributed queue

Applications Heat 2D: Regular dependencies, wavefront parallelism. Each tile is a task, intra-tile and inter-tile parallelism

Applications BFS: Irregular dependencies. Unreached neighbors of a node forms a task

Runtime

Scalability

Future Work S/W support for: Better task representation More task insertion policies Automated task graph partitioning for higher SM utilization.

Future Work H/W support for: Fast inter-TB sync Support for TB to SM affinity “Sleep” support for TBs

Conclusion Transition from SIMD -> MIMD Task-based execution model Per-SM task assignment In-GPU dependency tracking Locality aware queue management Room for improvement with added HW and SW support

Task Based Execution of GPU Applications with Dynamic Data Dependencies

Task Based Execution of GPU Applications with Dynamic Data Dependencies

Presentation Transcript

Some Applications of GPU-Based Medical Imaging

Static and Dynamic Verification of Indirect Data Sharing in Component-based Applications

Dynamic Symbolic Execution

Improved Testing of Multithreaded Programs with Dynamic Symbolic Execution

GPU Parallel Execution Model / Architecture

Detection of Attacks with Proxy-based Execution

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System

Data Dependencies

RAD Web Applications with ASP.NET Dynamic Data

Dynamic Data Driven Applications Systems

Interactive GPU-based Segmentation Of Large Volume Data With Level Sets

Monitoring Data Dependencies to Support Recovery in Concurrent Process Execution*

Learning Dynamic Bayesian Networks with Changing Dependencies

Deadline Miss Rates of Applications with Stochastic Task Execution Times

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System

Dynamic Symbolic Execution

Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications

GPU Parallel Execution Model / Architecture

Dynamic Execution Core

Dynamic Symbolic Execution

Interactive GPU-based Segmentation Of Large Volume Data With Level Sets