260 likes | 453 Views
OpenCL ch. 1: intro. Jongeun Lee. Fall 2013. Heterogeneity. A modern platform includes CPU GPU Programmable accelerators ASIC accelerators. ARM-based SoC device. Programmable Solutions. CPU + Hardware Accelerator. SoC FPGAs. Xilinx Zynq. Altera HPS. OpenCL Overview.
E N D
OpenCLch. 1: intro. Jongeun Lee Fall 2013
Heterogeneity • A modern platform includes • CPU • GPU • Programmable accelerators • ASIC accelerators ARM-based SoC device
CPU + Hardware Accelerator • SoC FPGAs Xilinx Zynq Altera HPS
OpenCL Overview • Open Computing Language • Software centric • C/C++ for host program • OpenCL C for accelerator device (C99 based) • Unified design methodology
OpenCL Driving Force • Consortium • http://www.khronos.org • royalty-free standard • Application-driven • consumer electronics • image, video • augmented reality • military • video & radar analytics • scientific/high performance computing • molecular dynamics, bioinformatics, financial, etc.
Software in a Many-Core World • Model (high level abstraction) • needed for manageable parallel programming • There are many, e.g.: • data parallelism • task parallelism
Mapping onto Hardware • Heterogeneous computing challenge • different ISA, memory architecture, etc. • Traditional way • different SW modules explicitly tied to distinct HW components • GPGPU way • GPU centric • All the interesting computation done on GPU • Unified way • user pays for all the OpenCL devices • why not utilize all of them
OpenCL differs • rather than increasing abstraction • expose hardware heterogeneity • supports a wide range of applications and hardware • explicitly considers data parallelism and task parallelism
Conceptual Foundations of OpenCL • application steps • Discoverthe components that make up the heterogeneous system. • Probe the characteristics of these components so that the software can adapt to the specific features of different hardware elements. • Createthe blocks of instructions (kernels) that will run on the platform. • Set up and manipulate memory objectsinvolved in the computation. • Executethe kernels in the right order and on the right components of the system. • Collectthe final results.
Platform Model • device: where kernels execute, also called compute device • can be anything • subdivided into compute unit and PE
Execution Model • host program vs. kernel • OpenCL defines how host program interacts with OpenCL-defined objects
Execution Model • How a kernel executes on OpenCL device • Host program issues a command that submits kernel • OpenCL runtime creates an integer index space • for each point an instance of kernel executes, called work-item • the coordinates in the index space are the global ID for the work-item • kernel submission • creates a collection of work-items • all work-items run the same code • work-group • equal-size partition of the index space • work-group ID, local ID • concurrent execution within a work-group
NDRange • NDRange • used for global ID, local ID, work-group ID • N: 1, 2, or 3 • Example (N=2) • global index space: (Gx, Gy) • global ID of a work-item: (gx, gy) • work-group index space: (Wx, Wy) • work-group ID: (wx, wy) • local index space: (Lx, Ly) • local ID: (lx, ly) • Then it follows: • Lx = Gx / Wx • Ly = Gy / Wy • gx = wx * Lx + lx • gy = wy * Ly + ly • Note: • global index space may have non-zero offset (new in OpenCL 1.1) • wx = gx / Lx • wy = gy / Ly • lx = gx % Lx • ly = gy % Ly
Host Program and Context • host program defines • kernels • NDRange • queues that control the details of how and when kernels execute • context • context • environment within which kernels are defined and execute • comprises • devicesto be used • kernelsto run • program objects: source code and executables for the kernels • memory objects visible to the OpenCL devices
Command-Queue • what is it? • one command-queue per device • host posts commands to the command-queue • commands are scheduled for execution on the device • three command types • kernel execution commands • memory commands • synchronization commands • put constraints on the order of command execution
More on Execution Order • host – device • asynchronous by default • within a single command-queue • in-order execution: must be supported • out-of-order execution: optional • programmer explicitly enforces any necessary execution order • synchronization mechanisms • synchronization commands • using event objects • e.g., A command waits until certain conditions on the event object exists • can also coordinate execution between host and device
Memory Model • two types of memory objects • buffer object • contiguous memory block • can use pointers, build data structures • image object • restricted to images • image storage format may be optimized (by device) • opaque object, accessible only through API functions • subregions of memory objects • first-class object
Memory Region • private memory • private to work-item • local memory • shared within a work-group • global memory • visible to all workgroups • constant memory • part of global memory region • host memory • visible only to host
Memory Management • divided memory regions • memory management is EXPLICIT • copying • using data transfer commands (by host) • blocking or non-blocking • mapping/unmapping • using memory map commands (by host) • host can map a region from memory object into its own address space • blocking or non-blocking
Memory Consistency • device memory • private: load/store model • local: consistent only at work-group sync points (e.g., work-group barrier) • global: same as local • no consistency enforced between different work-groups • order of loads/stores may appear different for different work-items! (relaxed consistency) • ordering relative to commands • (all work-items complete) -> (kernel command is signaled as finished) • loads/stores are completedbefore the signaling (~release consistency) • for out-of-order queue, we need more • using command-queue barrier • explicitly manage consistency through event mechanisms
Programming Model • not as precise as execution model • primarily two: data & task parallelism • but may be more • hybrid model • additional ones can be created on top of basic execution model
Data-Parallel Programming Model • What is it? • single logical sequence of instructions • applied concurrently to elements of data structure • OpenCL • kernel’s instructions are applied concurrently to work-items • Data sharing? • Can data be shared among work-items in work-group? • supported via local memory region • but how to manage dependencies? • work-group barrier • Between work-items from different work-groups? Parallel algorithm = a sequence of concurrent updates
Task-Parallel Programming Model • It could mean different things • task as a kernel that executes as a single work-item • why? • kernels submitted with out-of-order queue • tasks are dynamically scheduled • effective when # of tasks >> # of compute units • limitation? • tasks connected into a graph using OpenCL’s event model • commands may generate events • subsequent commands can wait for events • this creates static task graphs
Platform API • query the system about the OpenCL frameworks/devices available • context creation • Runtime API • Compiler for OpenCL C • not supported: • recursive functions, function pointers, bit fields • many std lib (including stdio.h, stdlib.h) • many others are added • Note: • floating point • embedded profile