Operating System Abstractions for GPU Programming

Operating System Abstractions for GPU Programming Chris Rossbach, Microsoft Research Emmett Witchel, University of Texas at Austin September 23 2010

Motivation • GPU application domains limited • CUDA • Rich APIs/abstractions • Language integration  familiar environment • Composition/integration into systems • Programmer-visible abstractions expressive • Realization/implementations unsatisfactory • Better OS-level abstractions are required • (IP/Business concerns aside)

Traditional OS-Level abstractions int main(argc, argv) { FILE * fp = fopen(“quack”, “w”); if(fp == NULL) fprintf(stderr, “failure\n”); … return 0; } // How do I program just a CPU and a disk? programmer- visible interface OS-level abstractions Hardware interface

GPU Abstractions programmer- visible interface 1 OS-level abstraction! The programmer gets to work with great abstractions… Why is this a problem?

Poor OS abstractions limit GPUs Doing fine without OS support: • Gaming/Graphics • Shader Languages • DirectX, OpenGL • “GPU Computing” nee “GPGPU” • user-mode/batch • scientific algorithms • Latency-tolerant • CUDA • The application ecosystem is more diverse • No OS abstractions  no traction

Interactive Applications • Gestural Interface • Brain-Computer Interface • Spatial Audio • Image Recognition • Processing user input: • need low latency, concurrency • must be multiplexed by OS • protection/isolation

Gestural Interface Raw images “Hand” events Point cloud HID InputOS Image Filtering Geometric Transform Gesture Recognition • High data rates • Noisy input • Data-parallel algorithms Ack! Noise!

So use the GPU! (naïve approach) #> catusb | xform | detect | hidinput & • Run catusbon CPU • Run xformuses GPU • Run detect uses GPU • Run hidinput: on CPU Use CUDA to write xform and detect!

Running a program on a GPU • GPUs cannot run OS: different ISA • Disjoint memory space, no coherence* • Host CPU must manage execution • Program inputs explicitly bound at runtime CPU Main memory User-mode apps must implement Copy inputs Send commands Copy outputs GPU memory GPU

Technology Stack View • 12 kernel crossings • 6 copy_to_user • 6 copy_from_user • Performance tradeoffs for • runtime/abstractions xform detect detect xform CUDA Runtime catusb catusb hidinput hidinput User Mode Drivers (DXVA) user kernel OS Executive Kernel Mode Drivers HAL Run GPU Kernel USB CPU GPU

So, big deal…do it all in the kernel catusb xform detect hidinput user kernel OS Executive Kernel Mode Drivers HAL USB CPU GPU • No CUDA, no high level abstractions • If you’re MS and/or nVidia, this might be tenable… • Solution is specialized • but there is still a data migration problem…

Hardware View • We’d prefer: • catusb: USB bus  GPU memory • xform, detect: no transfers • hidinput: single GPUmainmem transfer • if GPUs become coherent with main memory… • The machine can do this, where are the interfaces? Cache pollution Wasted bandwidth Wasted power CPU FSB hidinput GPU DIMM Northbridge DDR2/3 PCI-e xform detect DIMM DDR2/3 catusb DMI Current task: Southbridge USB 2.0 catusb xform

Outline • Motivation • Problems with lack of OS abstractions • Can CUDA solve these problems? • CUDA Streams • Asynchrony • GPUDirect™ • OS guarantees • New OS abstractions for GPUs • Related Work • Conclusion

Doesn’t CUDA address these problems? CPU CUDA streams, async: (Overlap capture/xfer/exec) “Write-combining memory” (uncacheable) FSB GPU DIMM Northbridge DDR2/3 PCI-e DIMM DDR2/3 DMI Page-locked host memory (faster DMA) Portable Memory (share page-locked) GPUDirect™ Mapped Memory (map mem into GPU space) (transparent xfer app-level upcalls) Southbridge USB 2.0

CUDA Streams Overlap Communication with Computation Stream Y Stream X Compute Engine Copy Engine Copy X0 Copy Y0 Copy X0 Kernel Xa Copy Y0 Kernel Y Kernel Xa Kernel Xb Kernel Xb Copy Y1 Kernel Y Copy X1 Copy X1 Copy Y1

Streams: explicitly scheduled Compute Engine Copy Engine CudaMemcpyAsync(X0…); KernelXa<<<…>>>(); KernelXb<<<…>>>(); CudaMemcpyAsync(X1…) CudaMemcpyAsync(Y0); KernelY<<<…>>>(); CudaMemcpyAsync(Y1); Copy X0 Kernel Xa Kernel Xb Copy X1 Copy Y0 Kernel Y Copy Y1 Each stream proceeds serially, different streams overlap Naïve programming eliminates potential concurrency

Reorder Codebetter schedule Compute Engine Copy Engine CudaMemcpyAsync(X0…); KernelXa<<<…>>>(); KernelXb<<<…>>>(); CudaMemcpyAsync(Y0); KernelY<<<…>>>(); CudaMemcpyAsync(X1…) CudaMemcpyAsync(Y1); Copy X0 Kernel Xa Copy Y0 Kernel Xb Kernel Y • Our design can’t use this anyway! • … xform | detect … • CUDA Streams in xform, detect • different processes • different address spaces • require additional IPC coordination Copy X1 Copy Y1 • Order sensitive • Applications must statically determine order • Couldn’t a scheduler with a global view do a better job dynamically?

CUDA Asynchrony OS-supported frames per second CUDA+streams CUDA+async Higher is better CUDA • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidiaGeForce GT230 HD: Host-to-Device only HD: Device-to-Host only H D: duplex communication

GPUDirect™ • “Allows 3rd party devices to access CUDA memory”: (eliminates data copy) • Great! but: • requires per-driver support • not just CUDA support! • no programmer-visible interface • OS can generalize

The Elephant in the Room Traditional OS guarantees: • Fairness • Isolation • No user-space runtime can provide these! • Can support… • Cannot guarantee

CPU-bound processes hurt GPUs frames per second Higher is better CPU scheduler and GPU scheduler not integrated! • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidiaGeForce GT230 HD: Host-to-Device only HD: Device-to-Host only H D: duplex communication

GPU-bound processes hurt CPUs • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidiaGeForce GT230 Flatter lines Are better

Meaningful “GPU Computing” impliesGPUs should be managed like CPUs • Process API analogues • IPC API analogues • Scheduler hint analogues • Must integrate with existing interfaces • CUDA/DXGI/DirectX • DRI/DRM/OpenGL

Outline • Motivation • Problems with lack of OS abstractions • Can CUDA solve these problems? • New OS abstractions for GPUs • Related Work • Conclusion

Proposed OS abstractions • ptask • Like a process, thread, can exist without user host process • OS abstraction…not a full CPU-process • List of mappable input/output resources • endpoint • Globally named kernel object • Can be mapped to ptask input/output resources • A data source or sink (e.g. buffer in GPU memory) • channel • Similar to a pipe • Connect arbitrary endpoints • 1:1, 1:M, M:1, N:M • Generalization of GPUDirect™ mechanism • Expand system call interface: • process API analogues • IPC API analogues • scheduler hints

Revised technology stack • 1-1 correspondence between programmer and OS abstractions • existing APIs can be built on top of new OS abstractions

Gestural interface revisited rawimg g_input process: catusb ptask: xform ptask: detect cloud hands usbsrc hid_in process: hidinput = process = ptask = endpoint • Computation expressed as a graph • Synthesis[Masselin 89] (streams, pumps) • Dryad [Isard 07] • SteamIt[Thies 02] • Offcodes[Weinsberg 08] • others… = channel

Gestural interface revisited rawimg g_input USBGPU mem process: catusb ptask: xform ptask: detect cloud hands usbsrc hid_in process: hidinput = process = ptask = endpoint GPU mem GPU mem = channel • Eliminate unnecessary communication…

Gestural interface revisited New data triggers new computation rawimg g_input process: catusb ptask: xform ptask: detect cloud hands usbsrc hid_in process: hidinput = process = ptask = endpoint = channel • Eliminates unnecessary communication • Eliminates u/k crossings, computation

Early Results: potential benefit 10x 3.9x frames per second Higher is better • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • NvidiaGeForce GT230 HD: Host-to-Device only HD: Device-to-Host only H D: duplex communication

Outline • Motivation • Problems with lack of OS abstractions • Can CUDA solve these problems? • New OS abstractions for GPUs • Related Work • Conclusion

Related Work • OS support for Heterogeneous arch: • Helios [Nightingale 09] • BarrelFish[Baumann 09] • Offcodes[Weinsberg 08] • Graph-based programming models • Synthesis [Masselin 89] • Monsoon/Id [Arvind] • Dryad [Isard 07] • StreamIt[Thies 02] • DirectShow • TCP Offload [Currid 04] • GPU Computing • CUDA, OpenCL

Conclusions • CUDA: programming interface is right • but OS must get involved • Current interfaces waste data movement • Current interfaces inhibit modularity/reuse • Cannot guarantee fairness, isolation • OS-level abstractions are required Questions?

Operating System Abstractions for GPU Programming

Operating System Abstractions for GPU Programming

Presentation Transcript

GPU programming

Programming Abstractions for Multicore Clouds

System Programming and Operating System

GPU programming: CUDA

GPU Programming

CS179: GPU Programming

Parallel Programming Abstractions

CUDA GPU Programming

CS179: GPU Programming

CS179: GPU Programming

CS179: GPU Programming

Operating Systems must support GPU abstractions

GPU Programming

GPU Programming

CS179: GPU Programming

CS179: GPU Programming

Programming Abstractions for Approximate Computing

GPU Programming Paradigms

OO Abstractions for Distributed Programming

GPU Programming

Operating System Support for Memory Abstractions in Parallel Languages

GPU Programming Overview