1 / 26

OpenCL ch. 1: intro.

OpenCL ch. 1: intro. Jongeun Lee. Fall 2013. Heterogeneity. A modern platform includes CPU GPU Programmable accelerators ASIC accelerators. ARM-based SoC device. Programmable Solutions. CPU + Hardware Accelerator. SoC FPGAs. Xilinx Zynq. Altera HPS. OpenCL Overview.

keran
Download Presentation

OpenCL ch. 1: intro.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OpenCLch. 1: intro. Jongeun Lee Fall 2013

  2. Heterogeneity • A modern platform includes • CPU • GPU • Programmable accelerators • ASIC accelerators ARM-based SoC device

  3. Programmable Solutions

  4. CPU + Hardware Accelerator • SoC FPGAs Xilinx Zynq Altera HPS

  5. OpenCL Overview • Open Computing Language • Software centric • C/C++ for host program • OpenCL C for accelerator device (C99 based) • Unified design methodology

  6. OpenCL Driving Force • Consortium • http://www.khronos.org • royalty-free standard • Application-driven • consumer electronics • image, video • augmented reality • military • video & radar analytics • scientific/high performance computing • molecular dynamics, bioinformatics, financial, etc.

  7. Software in a Many-Core World • Model (high level abstraction) • needed for manageable parallel programming • There are many, e.g.: • data parallelism • task parallelism

  8. Mapping onto Hardware • Heterogeneous computing challenge • different ISA, memory architecture, etc. • Traditional way • different SW modules explicitly tied to distinct HW components • GPGPU way • GPU centric • All the interesting computation done on GPU • Unified way • user pays for all the OpenCL devices • why not utilize all of them

  9. OpenCL differs • rather than increasing abstraction • expose hardware heterogeneity • supports a wide range of applications and hardware • explicitly considers data parallelism and task parallelism

  10. Conceptual Foundations of OpenCL • application steps • Discoverthe components that make up the heterogeneous system. • Probe the characteristics of these components so that the software can adapt to the specific features of different hardware elements. • Createthe blocks of instructions (kernels) that will run on the platform. • Set up and manipulate memory objectsinvolved in the computation. • Executethe kernels in the right order and on the right components of the system. • Collectthe final results.

  11. Platform Model • device: where kernels execute, also called compute device • can be anything • subdivided into compute unit and PE

  12. Execution Model • host program vs. kernel • OpenCL defines how host program interacts with OpenCL-defined objects

  13. Execution Model • How a kernel executes on OpenCL device • Host program issues a command that submits kernel • OpenCL runtime creates an integer index space • for each point an instance of kernel executes, called work-item • the coordinates in the index space are the global ID for the work-item • kernel submission • creates a collection of work-items • all work-items run the same code • work-group • equal-size partition of the index space • work-group ID, local ID • concurrent execution within a work-group

  14. NDRange • NDRange • used for global ID, local ID, work-group ID • N: 1, 2, or 3 • Example (N=2) • global index space: (Gx, Gy) • global ID of a work-item: (gx, gy) • work-group index space: (Wx, Wy) • work-group ID: (wx, wy) • local index space: (Lx, Ly) • local ID: (lx, ly) • Then it follows: • Lx = Gx / Wx • Ly = Gy / Wy • gx = wx * Lx + lx • gy = wy * Ly + ly • Note: • global index space may have non-zero offset (new in OpenCL 1.1) • wx = gx / Lx • wy = gy / Ly • lx = gx % Lx • ly = gy % Ly

  15. Host Program and Context • host program defines • kernels • NDRange • queues that control the details of how and when kernels execute • context • context • environment within which kernels are defined and execute • comprises • devicesto be used • kernelsto run • program objects: source code and executables for the kernels • memory objects visible to the OpenCL devices

  16. Command-Queue • what is it? • one command-queue per device • host posts commands to the command-queue • commands are scheduled for execution on the device • three command types • kernel execution commands • memory commands • synchronization commands • put constraints on the order of command execution

  17. More on Execution Order • host – device • asynchronous by default • within a single command-queue • in-order execution: must be supported • out-of-order execution: optional • programmer explicitly enforces any necessary execution order • synchronization mechanisms • synchronization commands • using event objects • e.g., A command waits until certain conditions on the event object exists • can also coordinate execution between host and device

  18. Memory Model • two types of memory objects • buffer object • contiguous memory block • can use pointers, build data structures • image object • restricted to images • image storage format may be optimized (by device) • opaque object, accessible only through API functions • subregions of memory objects • first-class object

  19. Memory Region • private memory • private to work-item • local memory • shared within a work-group • global memory • visible to all workgroups • constant memory • part of global memory region • host memory • visible only to host

  20. Memory Management • divided memory regions • memory management is EXPLICIT • copying • using data transfer commands (by host) • blocking or non-blocking • mapping/unmapping • using memory map commands (by host) • host can map a region from memory object into its own address space • blocking or non-blocking

  21. Memory Consistency • device memory • private: load/store model • local: consistent only at work-group sync points (e.g., work-group barrier) • global: same as local • no consistency enforced between different work-groups • order of loads/stores may appear different for different work-items! (relaxed consistency) • ordering relative to commands • (all work-items complete) -> (kernel command is signaled as finished) • loads/stores are completedbefore the signaling (~release consistency) • for out-of-order queue, we need more • using command-queue barrier • explicitly manage consistency through event mechanisms

  22. Programming Model • not as precise as execution model • primarily two: data & task parallelism • but may be more • hybrid model • additional ones can be created on top of basic execution model

  23. Data-Parallel Programming Model • What is it? • single logical sequence of instructions • applied concurrently to elements of data structure • OpenCL • kernel’s instructions are applied concurrently to work-items • Data sharing? • Can data be shared among work-items in work-group? • supported via local memory region • but how to manage dependencies? • work-group barrier • Between work-items from different work-groups? Parallel algorithm = a sequence of concurrent updates

  24. Task-Parallel Programming Model • It could mean different things • task as a kernel that executes as a single work-item • why? • kernels submitted with out-of-order queue • tasks are dynamically scheduled • effective when # of tasks >> # of compute units • limitation? • tasks connected into a graph using OpenCL’s event model • commands may generate events • subsequent commands can wait for events • this creates static task graphs

  25. Platform API • query the system about the OpenCL frameworks/devices available • context creation • Runtime API • Compiler for OpenCL C • not supported: • recursive functions, function pointers, bit fields • many std lib (including stdio.h, stdlib.h) • many others are added • Note: • floating point • embedded profile

More Related