1 / 34

Operating System Abstractions for GPU Programming

Operating System Abstractions for GPU Programming. Chris Rossbach , Microsoft Research Emmett Witchel , University of Texas at Austin September 23 2010. Motivation. GPU application domains limited CUDA Rich APIs/abstractions Language integration  familiar environment

oistin
Download Presentation

Operating System Abstractions for GPU Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Operating System Abstractions for GPU Programming Chris Rossbach, Microsoft Research Emmett Witchel, University of Texas at Austin September 23 2010

  2. Motivation • GPU application domains limited • CUDA • Rich APIs/abstractions • Language integration  familiar environment • Composition/integration into systems • Programmer-visible abstractions expressive • Realization/implementations unsatisfactory • Better OS-level abstractions are required • (IP/Business concerns aside)

  3. Traditional OS-Level abstractions int main(argc, argv) { FILE * fp = fopen(“quack”, “w”); if(fp == NULL) fprintf(stderr, “failure\n”); … return 0; } // How do I program just a CPU and a disk? programmer- visible interface OS-level abstractions Hardware interface

  4. GPU Abstractions programmer- visible interface 1 OS-level abstraction! The programmer gets to work with great abstractions… Why is this a problem?

  5. Poor OS abstractions limit GPUs Doing fine without OS support: • Gaming/Graphics • Shader Languages • DirectX, OpenGL • “GPU Computing” nee “GPGPU” • user-mode/batch • scientific algorithms • Latency-tolerant • CUDA • The application ecosystem is more diverse • No OS abstractions  no traction

  6. Interactive Applications • Gestural Interface • Brain-Computer Interface • Spatial Audio • Image Recognition • Processing user input: • need low latency, concurrency • must be multiplexed by OS • protection/isolation

  7. Gestural Interface Raw images “Hand” events Point cloud HID InputOS Image Filtering Geometric Transform Gesture Recognition • High data rates • Noisy input • Data-parallel algorithms Ack! Noise!

  8. What I wish I could do #> catusb | xform | detect | hidinput & • catusb:captures image data from usb • xform: • Noise filtering • Geometric transformation • detect:extract gestures from point cloud • hidinput: send mouse events (or whatever) #> catusb | xform | detect | hidinput & #> catusb | xform | detect | hidinput & Data parallel Inherently sequential Could parallelize on a CMP, but…

  9. So use the GPU! (naïve approach) #> catusb | xform | detect | hidinput & • Run catusbon CPU • Run xformuses GPU • Run detect uses GPU • Run hidinput: on CPU Use CUDA to write xform and detect!

  10. Running a program on a GPU • GPUs cannot run OS: different ISA • Disjoint memory space, no coherence* • Host CPU must manage execution • Program inputs explicitly bound at runtime CPU Main memory User-mode apps must implement Copy inputs Send commands Copy outputs GPU memory GPU

  11. Technology Stack View • 12 kernel crossings • 6 copy_to_user • 6 copy_from_user • Performance tradeoffs for • runtime/abstractions xform detect detect xform CUDA Runtime catusb catusb hidinput hidinput User Mode Drivers (DXVA) user kernel OS Executive Kernel Mode Drivers HAL Run GPU Kernel USB CPU GPU

  12. So, big deal…do it all in the kernel catusb xform detect hidinput user kernel OS Executive Kernel Mode Drivers HAL USB CPU GPU • No CUDA, no high level abstractions • If you’re MS and/or nVidia, this might be tenable… • Solution is specialized • but there is still a data migration problem…

  13. Hardware View • We’d prefer: • catusb: USB bus  GPU memory • xform, detect: no transfers • hidinput: single GPUmainmem transfer • if GPUs become coherent with main memory… • The machine can do this, where are the interfaces? Cache pollution Wasted bandwidth Wasted power CPU FSB hidinput GPU DIMM Northbridge DDR2/3 PCI-e xform detect DIMM DDR2/3 catusb DMI Current task: Southbridge USB 2.0 catusb xform

  14. Outline • Motivation • Problems with lack of OS abstractions • Can CUDA solve these problems? • CUDA Streams • Asynchrony • GPUDirect™ • OS guarantees • New OS abstractions for GPUs • Related Work • Conclusion

  15. Doesn’t CUDA address these problems? CPU CUDA streams, async: (Overlap capture/xfer/exec) “Write-combining memory” (uncacheable) FSB GPU DIMM Northbridge DDR2/3 PCI-e DIMM DDR2/3 DMI Page-locked host memory (faster DMA) Portable Memory (share page-locked) GPUDirect™ Mapped Memory (map mem into GPU space) (transparent xfer app-level upcalls) Southbridge USB 2.0

  16. CUDA Streams Overlap Communication with Computation Stream Y Stream X Compute Engine Copy Engine Copy X0 Copy Y0 Copy X0 Kernel Xa Copy Y0 Kernel Y Kernel Xa Kernel Xb Kernel Xb Copy Y1 Kernel Y Copy X1 Copy X1 Copy Y1

  17. Streams: explicitly scheduled Compute Engine Copy Engine CudaMemcpyAsync(X0…); KernelXa<<<…>>>(); KernelXb<<<…>>>(); CudaMemcpyAsync(X1…) CudaMemcpyAsync(Y0); KernelY<<<…>>>(); CudaMemcpyAsync(Y1); Copy X0 Kernel Xa Kernel Xb Copy X1 Copy Y0 Kernel Y Copy Y1 Each stream proceeds serially, different streams overlap Naïve programming eliminates potential concurrency

  18. Reorder Codebetter schedule Compute Engine Copy Engine CudaMemcpyAsync(X0…); KernelXa<<<…>>>(); KernelXb<<<…>>>(); CudaMemcpyAsync(Y0); KernelY<<<…>>>(); CudaMemcpyAsync(X1…) CudaMemcpyAsync(Y1); Copy X0 Kernel Xa Copy Y0 Kernel Xb Kernel Y • Our design can’t use this anyway! • … xform | detect … • CUDA Streams in xform, detect • different processes • different address spaces • require additional IPC coordination Copy X1 Copy Y1 • Order sensitive • Applications must statically determine order • Couldn’t a scheduler with a global view do a better job dynamically?

  19. CUDA Asynchrony OS-supported frames per second CUDA+streams CUDA+async Higher is better CUDA • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidiaGeForce GT230 HD: Host-to-Device only HD: Device-to-Host only H D: duplex communication

  20. GPUDirect™ • “Allows 3rd party devices to access CUDA memory”: (eliminates data copy) • Great! but: • requires per-driver support • not just CUDA support! • no programmer-visible interface • OS can generalize

  21. The Elephant in the Room Traditional OS guarantees: • Fairness • Isolation • No user-space runtime can provide these! • Can support… • Cannot guarantee

  22. CPU-bound processes hurt GPUs frames per second Higher is better CPU scheduler and GPU scheduler not integrated! • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidiaGeForce GT230 HD: Host-to-Device only HD: Device-to-Host only H D: duplex communication

  23. GPU-bound processes hurt CPUs • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • nVidiaGeForce GT230 Flatter lines Are better

  24. Meaningful “GPU Computing” impliesGPUs should be managed like CPUs • Process API analogues • IPC API analogues • Scheduler hint analogues • Must integrate with existing interfaces • CUDA/DXGI/DirectX • DRI/DRM/OpenGL

  25. Outline • Motivation • Problems with lack of OS abstractions • Can CUDA solve these problems? • New OS abstractions for GPUs • Related Work • Conclusion

  26. Proposed OS abstractions • ptask • Like a process, thread, can exist without user host process • OS abstraction…not a full CPU-process • List of mappable input/output resources • endpoint • Globally named kernel object • Can be mapped to ptask input/output resources • A data source or sink (e.g. buffer in GPU memory) • channel • Similar to a pipe • Connect arbitrary endpoints • 1:1, 1:M, M:1, N:M • Generalization of GPUDirect™ mechanism • Expand system call interface: • process API analogues • IPC API analogues • scheduler hints

  27. Revised technology stack • 1-1 correspondence between programmer and OS abstractions • existing APIs can be built on top of new OS abstractions

  28. Gestural interface revisited rawimg g_input process: catusb ptask: xform ptask: detect cloud hands usbsrc hid_in process: hidinput = process = ptask = endpoint • Computation expressed as a graph • Synthesis[Masselin 89] (streams, pumps) • Dryad [Isard 07] • SteamIt[Thies 02] • Offcodes[Weinsberg 08] • others… = channel

  29. Gestural interface revisited rawimg g_input USBGPU mem process: catusb ptask: xform ptask: detect cloud hands usbsrc hid_in process: hidinput = process = ptask = endpoint GPU mem GPU mem = channel • Eliminate unnecessary communication…

  30. Gestural interface revisited New data triggers new computation rawimg g_input process: catusb ptask: xform ptask: detect cloud hands usbsrc hid_in process: hidinput = process = ptask = endpoint = channel • Eliminates unnecessary communication • Eliminates u/k crossings, computation

  31. Early Results: potential benefit 10x 3.9x frames per second Higher is better • Windows 7 x64 8GB RAM • Intel Core 2 Quad 2.66GHz • NvidiaGeForce GT230 HD: Host-to-Device only HD: Device-to-Host only H D: duplex communication

  32. Outline • Motivation • Problems with lack of OS abstractions • Can CUDA solve these problems? • New OS abstractions for GPUs • Related Work • Conclusion

  33. Related Work • OS support for Heterogeneous arch: • Helios [Nightingale 09] • BarrelFish[Baumann 09] • Offcodes[Weinsberg 08] • Graph-based programming models • Synthesis [Masselin 89] • Monsoon/Id [Arvind] • Dryad [Isard 07] • StreamIt[Thies 02] • DirectShow • TCP Offload [Currid 04] • GPU Computing • CUDA, OpenCL

  34. Conclusions • CUDA: programming interface is right • but OS must get involved • Current interfaces waste data movement • Current interfaces inhibit modularity/reuse • Cannot guarantee fairness, isolation • OS-level abstractions are required Questions?

More Related