340 likes | 524 Views
Operating System Abstractions for GPU Programming. Chris Rossbach , Microsoft Research Emmett Witchel , University of Texas at Austin September 23 2010. Motivation. GPU application domains limited CUDA Rich APIs/abstractions Language integration familiar environment
E N D
Operating System Abstractions for GPU Programming Chris Rossbach, Microsoft Research Emmett Witchel, University of Texas at Austin September 23 2010
Motivation • GPU application domains limited • CUDA • Rich APIs/abstractions • Language integration familiar environment • Composition/integration into systems • Programmer-visible abstractions expressive • Realization/implementations unsatisfactory • Better OS-level abstractions are required • (IP/Business concerns aside)
Traditional OS-Level abstractions int main(argc, argv) { FILE * fp = fopen(“quack”, “w”); if(fp == NULL) fprintf(stderr, “failure\n”); … return 0; } // How do I program just a CPU and a disk? programmer- visible interface OS-level abstractions Hardware interface
GPU Abstractions programmer- visible interface 1 OS-level abstraction! The programmer gets to work with great abstractions… Why is this a problem?
Poor OS abstractions limit GPUs Doing fine without OS support: • Gaming/Graphics • Shader Languages • DirectX, OpenGL • “GPU Computing” nee “GPGPU” • user-mode/batch • scientific algorithms • Latency-tolerant • CUDA • The application ecosystem is more diverse • No OS abstractions no traction
Interactive Applications • Gestural Interface • Brain-Computer Interface • Spatial Audio • Image Recognition • Processing user input: • need low latency, concurrency • must be multiplexed by OS • protection/isolation
Gestural Interface Raw images “Hand” events Point cloud HID InputOS Image Filtering Geometric Transform Gesture Recognition • High data rates • Noisy input • Data-parallel algorithms Ack! Noise!
What I wish I could do #> catusb | xform | detect | hidinput & • catusb:captures image data from usb • xform: • Noise filtering • Geometric transformation • detect:extract gestures from point cloud • hidinput: send mouse events (or whatever) #> catusb | xform | detect | hidinput & #> catusb | xform | detect | hidinput & Data parallel Inherently sequential Could parallelize on a CMP, but…
So use the GPU! (naïve approach) #> catusb | xform | detect | hidinput & • Run catusbon CPU • Run xformuses GPU • Run detect uses GPU • Run hidinput: on CPU Use CUDA to write xform and detect!
Running a program on a GPU • GPUs cannot run OS: different ISA • Disjoint memory space, no coherence* • Host CPU must manage execution • Program inputs explicitly bound at runtime CPU Main memory User-mode apps must implement Copy inputs Send commands Copy outputs GPU memory GPU
Technology Stack View • 12 kernel crossings • 6 copy_to_user • 6 copy_from_user • Performance tradeoffs for • runtime/abstractions xform detect detect xform CUDA Runtime catusb catusb hidinput hidinput User Mode Drivers (DXVA) user kernel OS Executive Kernel Mode Drivers HAL Run GPU Kernel USB CPU GPU
So, big deal…do it all in the kernel catusb xform detect hidinput user kernel OS Executive Kernel Mode Drivers HAL USB CPU GPU • No CUDA, no high level abstractions • If you’re MS and/or nVidia, this might be tenable… • Solution is specialized • but there is still a data migration problem…
Hardware View • We’d prefer: • catusb: USB bus GPU memory • xform, detect: no transfers • hidinput: single GPUmainmem transfer • if GPUs become coherent with main memory… • The machine can do this, where are the interfaces? Cache pollution Wasted bandwidth Wasted power CPU FSB hidinput GPU DIMM Northbridge DDR2/3 PCI-e xform detect DIMM DDR2/3 catusb DMI Current task: Southbridge USB 2.0 catusb xform
Outline • Motivation • Problems with lack of OS abstractions • Can CUDA solve these problems? • CUDA Streams • Asynchrony • GPUDirect™ • OS guarantees • New OS abstractions for GPUs • Related Work • Conclusion
Doesn’t CUDA address these problems? CPU CUDA streams, async: (Overlap capture/xfer/exec) “Write-combining memory” (uncacheable) FSB GPU DIMM Northbridge DDR2/3 PCI-e DIMM DDR2/3 DMI Page-locked host memory (faster DMA) Portable Memory (share page-locked) GPUDirect™ Mapped Memory (map mem into GPU space) (transparent xfer app-level upcalls) Southbridge USB 2.0
CUDA Streams Overlap Communication with Computation Stream Y Stream X Compute Engine Copy Engine Copy X0 Copy Y0 Copy X0 Kernel Xa Copy Y0 Kernel Y Kernel Xa Kernel Xb Kernel Xb Copy Y1 Kernel Y Copy X1 Copy X1 Copy Y1
Streams: explicitly scheduled Compute Engine Copy Engine CudaMemcpyAsync(X0…); KernelXa<<<…>>>(); KernelXb<<<…>>>(); CudaMemcpyAsync(X1…) CudaMemcpyAsync(Y0); KernelY<<<…>>>(); CudaMemcpyAsync(Y1); Copy X0 Kernel Xa Kernel Xb Copy X1 Copy Y0 Kernel Y Copy Y1 Each stream proceeds serially, different streams overlap Naïve programming eliminates potential concurrency
Reorder Codebetter schedule Compute Engine Copy Engine CudaMemcpyAsync(X0…); KernelXa<<<…>>>(); KernelXb<<<…>>>(); CudaMemcpyAsync(Y0); KernelY<<<…>>>(); CudaMemcpyAsync(X1…) CudaMemcpyAsync(Y1); Copy X0 Kernel Xa Copy Y0 Kernel Xb Kernel Y • Our design can’t use this anyway! • … xform | detect … • CUDA Streams in xform, detect • different processes • different address spaces • require additional IPC coordination Copy X1 Copy Y1 • Order sensitive • Applications must statically determine order • Couldn’t a scheduler with a global view do a better job dynamically?
CUDA Asynchrony OS-supported frames per second CUDA+streams CUDA+async Higher is better CUDA • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidiaGeForce GT230 HD: Host-to-Device only HD: Device-to-Host only H D: duplex communication
GPUDirect™ • “Allows 3rd party devices to access CUDA memory”: (eliminates data copy) • Great! but: • requires per-driver support • not just CUDA support! • no programmer-visible interface • OS can generalize
The Elephant in the Room Traditional OS guarantees: • Fairness • Isolation • No user-space runtime can provide these! • Can support… • Cannot guarantee
CPU-bound processes hurt GPUs frames per second Higher is better CPU scheduler and GPU scheduler not integrated! • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidiaGeForce GT230 HD: Host-to-Device only HD: Device-to-Host only H D: duplex communication
GPU-bound processes hurt CPUs • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidiaGeForce GT230 Flatter lines Are better
Meaningful “GPU Computing” impliesGPUs should be managed like CPUs • Process API analogues • IPC API analogues • Scheduler hint analogues • Must integrate with existing interfaces • CUDA/DXGI/DirectX • DRI/DRM/OpenGL
Outline • Motivation • Problems with lack of OS abstractions • Can CUDA solve these problems? • New OS abstractions for GPUs • Related Work • Conclusion
Proposed OS abstractions • ptask • Like a process, thread, can exist without user host process • OS abstraction…not a full CPU-process • List of mappable input/output resources • endpoint • Globally named kernel object • Can be mapped to ptask input/output resources • A data source or sink (e.g. buffer in GPU memory) • channel • Similar to a pipe • Connect arbitrary endpoints • 1:1, 1:M, M:1, N:M • Generalization of GPUDirect™ mechanism • Expand system call interface: • process API analogues • IPC API analogues • scheduler hints
Revised technology stack • 1-1 correspondence between programmer and OS abstractions • existing APIs can be built on top of new OS abstractions
Gestural interface revisited rawimg g_input process: catusb ptask: xform ptask: detect cloud hands usbsrc hid_in process: hidinput = process = ptask = endpoint • Computation expressed as a graph • Synthesis[Masselin 89] (streams, pumps) • Dryad [Isard 07] • SteamIt[Thies 02] • Offcodes[Weinsberg 08] • others… = channel
Gestural interface revisited rawimg g_input USBGPU mem process: catusb ptask: xform ptask: detect cloud hands usbsrc hid_in process: hidinput = process = ptask = endpoint GPU mem GPU mem = channel • Eliminate unnecessary communication…
Gestural interface revisited New data triggers new computation rawimg g_input process: catusb ptask: xform ptask: detect cloud hands usbsrc hid_in process: hidinput = process = ptask = endpoint = channel • Eliminates unnecessary communication • Eliminates u/k crossings, computation
Early Results: potential benefit 10x 3.9x frames per second Higher is better • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • NvidiaGeForce GT230 HD: Host-to-Device only HD: Device-to-Host only H D: duplex communication
Outline • Motivation • Problems with lack of OS abstractions • Can CUDA solve these problems? • New OS abstractions for GPUs • Related Work • Conclusion
Related Work • OS support for Heterogeneous arch: • Helios [Nightingale 09] • BarrelFish[Baumann 09] • Offcodes[Weinsberg 08] • Graph-based programming models • Synthesis [Masselin 89] • Monsoon/Id [Arvind] • Dryad [Isard 07] • StreamIt[Thies 02] • DirectShow • TCP Offload [Currid 04] • GPU Computing • CUDA, OpenCL
Conclusions • CUDA: programming interface is right • but OS must get involved • Current interfaces waste data movement • Current interfaces inhibit modularity/reuse • Cannot guarantee fairness, isolation • OS-level abstractions are required Questions?