150 likes | 235 Views
Last time: Runtime infrastructure for hybrid (GPU-based) platforms Task scheduling Extracting performance models at runtime Memory management Asymmetric Distributed Shared Memory
E N D
Last time: Runtime infrastructure for hybrid (GPU-based) platforms • Task scheduling • Extracting performance models at runtime • Memory management • Asymmetric Distributed Shared Memory StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines, Cédric Augonnet, Samuel Thibault, and Raymond Namyst. TR-7240, INRIA, March 2010. [link] An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems, Isaac Gelado, Javier Cabezas, John Stone, Sanjay Patel, Nacho Navarro, Wen-mei Hwu, ASPLOS’10 [pdf]
Today: • Bridging runtime and language support • ‘Virtualizing GPUs’ Achieving a Single Compute Device Image in OpenCL for Multiple GPUs, Jungwon Kim, Honggyu Kim, Joo Hwan Lee, Jaejin Lee, PPoPP’11 [pdf] Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework, Vignesh T. Ravi et al., HPDC 2011
Today: • Bridging runtime and language support • ‘Virtualizing GPUs’ Achieving a Single Compute Device Image in OpenCL for Multiple GPUs, Jungwon Kim, Honggyu Kim, Joo Hwan Lee, Jaejin Lee, PPoPP’11 [pdf] Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework, Vignesh T. Ravi et al., HPDC 2011 best paper!
Context: clouds shift to support HPC applications • initiallytightly coupled applications not suited for could applications • today • Chinese – cloud with 40Gbps infiniband • Amazaon HPC instance • GPU instances: Amazon, Nimbix • Challenge: make GPUs shared resources in the could.
Challenge: make GPUs a shared resource in the could. • Why do this? • GPUs are costly resources • Multiple VMs on a node with a single GPU • Increase utilization • app level: some apps might not use GPUs much; • kernel level: some kernels can be collocatd
Two streams • How? • Evaluate … • opportunities • gains • overheads
1. The ‘How?’ • Preamble: Concurrent kernels are supported by today’s GPUs • Each kernel can execute a different task • Tasks can be mapped to different streaming multiprocessors (using thread-block configuration) • Problem: concurrent execution limited to the set of kernels invoked within a single processor context • Past virtualization solutions • API rerouting / intercept library
1. The ‘How?’ • Preamble: Concurrent kernels are supported by today’s GPUs • Each kernel can execute a different task • Tasks can be mapped to different streaming multiprocessors (using thread-block configuration) • Problem: concurrent execution limited to the set of kernels invoked within a single processor context
1. The ‘How?’ • Architecture
2. Evaluation – The opportunity • The opportunity • Key assumption: Under-utilization of GPUs • Space-sharing • Kernels occupy different SP • Time-sharing • Kernels time-share same SP (benefit form harware support form context switces) • Note: is it not always possible
2. Evaluation – The opportunity • The opportunity • Key assumption: Under-utilization of GPUs • Sharing • Space-sharing • Kernels occupy different SP • Time-sharing • Kernels time-share same SP (benefit form harware support form context switces) • Note: resource conflicts may prevent this • Molding – change kernel configuration (different number of thread blocks / threads per block) to improve collocation
Discussion • Limitations • Hardware support
OpenCL vs. CUDA • http://ft.ornl.gov/doku/shoc/level1 • http://ft.ornl.gov/pubs-archive/shoc.pdf