520 likes | 724 Views
Introduction to OpenCL * Ohad Shacham Intel Software and Services Group Thanks to Elior Malul, Arik Narkis, and Doron Singer . Evolution of OpenCL *. Sequential Programs. int main() { //read input scalar_mul (…) return 0; }. void scalar_mul ( int n,
E N D
Introduction to OpenCL*Ohad ShachamIntel Software and Services GroupThanks to Elior Malul, Arik Narkis, and Doron Singer
Evolution of OpenCL* • Sequential Programs intmain(){ //read input scalar_mul(…) return 0; } voidscalar_mul(intn, const float *a, const float *b, float *c){ inti; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; }
Evolution of OpenCL* • Multi-threaded Programs int main(){ //read input pthread_start(…, scalar_mul); scalar_mul(n/2, …); pthread_join(…); return 0; } voidscalar_mul(int n, const float *a, const float *b, float *c){ inti; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; }
Problems – concurrent programs • Writing concurrent programs is hard • Concurrent algorithms • Threads • Work balancing • Need to update programs when adding new cores to the system • Dataraces, livelocks, deadlocks • Solving bugs in concurrent programs is harder
Evolution of OpenCL* • Vector instruction utilization intmain(){ //read input scalar_mul(…) return 0; } voidscalar_mul(intn, const float *a, const float *b, float *c){ inti; for (i = 0; i < n; i+=4){ __m128 a_vec = _mm_load_ps(a+i); __m128 b_vec = _mm_load_ps(b+i); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + i, c_vec); } }
Problems – vector instructions usage • Utilizing vector instructions in also not a trivial task • Vendor dependent code • Usage is not future proof • New efficient instruction • Wider vector registers
GPGPU GPGPU stands for General-Purpose computation on Graphics Processing Units (GPUs). GPUs are high-performance many-core processors that can be used to accelerate a wide range of applications (www.gpgpu.org) Photo taken from: http://folding.stanford.edu/English/FAQ-NVIDIA
GPUs utilization • Many corescan be utilized for computation • GPUs become programmable - GPGPU • CUDA* • Problems • Each vendor has its own language • Requires tweaking to get performance • How can I run both on CPUs and GPUs?
What do we need? • Heterogeneous • Automatically utilizes all available processing units • Portable • High Performance • Utilize Hardware characteristics • Future Proof • Abstract concurrency from the user
OpenCL* – heterogeneous computing Diagram based on deck presented in OpenCL* BOF at SIGGRAPH 2010 by Neil Trevett, NVIDIA, OpenCL* Chair
OpenCL* in a nutshell • An OpenCL* application consists two parts: • A set of APIs in C that allows compiling and running OpenCL* “Kernels” • A code that is executed on the device by the OpenCL* runtime
Data parallelism A fundamental pattern in high-performance parallel algorithms Applying same computation logic across multiple data elements C[i] = A[i] * B[i] i = 0 C[i] = A[i] * B[i] i = 1 i = 0 C[i] = A[i] * B[i] i = 2 C[i] = A[i] * B[i] C[i] = A[i] * B[i] i = 3 i = i + 1 C[i] = A[i] * B[i] i = N-2 C[i] = A[i] * B[i] i = N-1
Data parallelism Usage • Client machines • Video transcoding and editing • Pro image editing • Facial recognition • Workstations • CAD tools • 3D data content creation • Servers • Science and simulations • Medical imaging • Oil & Gas • Finance (e.g., Black-Scholes) • …
OpenCL* kernel example voidarray_mul(int n, const float *a, const float *b, float *c){ int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i]; } __kernel voidarray_mul( __globalconst float *a, __globalconst float *b, __globalfloat *c){ int id = get_global_id(0); c[id] = a[id] * b[id]; }
OpenCL* kernel example __kernel voidarray_mul(__globalconst float *a, __globalconst float *b, __global float *c){ int id = get_global_id(0); c[id] = a[id] * b[id]; } get_global_id(0) a b c
Execution Model Global WorkGroup WorkGroup WorkGroup WorkGroup WorkItem
The OpenCL* model • OpenCL* runtime is invoked on Host CPU (using OpenCL* API) • Choose target device/s for parallel computation • Data-parallel functions, called Kernels, are compiled (on host) • Compiled for specific target devices (CPU, GPU, etc..) • Data chunks (called Buffers) are moved across devices • Kernel “commands” queued for execution on target devices • Asynchronous execution
The OpenCL* - C language • Derived from ISO C99 • Few restrictions e.g., recursion, function pointers • Short vector types e.g., float4, short2, int16 • Built-in functions • math (e.g., sin), geometric, common (e.g., min, clamp)
OpenCL* key features • Unified programming model for all devices • Develop once, run everywhere • Designed for massive data-parallelism • Implicitly takes care of threading and intrinsicsfor optimal performance 19
OpenCL* key features • Dynamic compilation model (Just In Time - JIT) • Future proof, provided vendors update their implementations • Enables heterogeneous computing • A clever application can use all resources of the platform simultaneously 20
Benefits to User • Hardware abstraction • write once, run everywhere • Cross devices, cross vendors • Automatic parallelization • Good tradeoff between development simplicity and performance • Future proof optimizations • Open standard • Supported by many vendors
Benefits to Hardware Vendor • Enables good hardware ‘time to market’ • Programming model enables good hardware utilization • Applications are automatically portable and future proof • JIT compilation
OpenCL* Cons • Low level – based on C99 • No heap! • Lean framework • Expert tool • In term of correctness and performance • OpenCL* is not performance portable • Tweaking is needed for each vendor • Future specs and implementations may require no tweaking?
Vector dot multiplication voidvectorDotMul(int* vecA, int* vecB, intsize, int* result){ *result = 0; for (inti=0; i < size; ++i) *result += vecA[i] * vecB[i]; }
Single work item 1 2 * = 2 1 2 * = 4 2 1 * 2 = 2 6 = 2 1 * 8 2 2 = 1 * 10 2 2 1 * = 2 12 * 2 = 1 14 12 2 2 16 1 * = 2
Vector dot multiplication in OpenCL* __kernel void vectorDotMul(int* vecA, int* vecB, intsize, int* result) { if(get_global_id(0) == 0){ *result = 0;for (inti=0; i<size; ++i)*result += vecA[i] * vecB[i]; } }
Single work group 1 2 2 * = 1 2 4 2 * = 1 2 * 2 = = 2 8 1 2 4 * 2 = 1 * 2 2 1 * = 12 2 4 * 2 = 1 2 2 1 * = 4 2 16
__kernel void vectorDotMul(int* vecA, int* vecB, intsize, int* result){ int id = get_local_id(0); __localvolatile intpartialSum[MAX_SIZE]; intlocalSize = get_local_size(0); int work = size/localSize; int start = id*work;int end = start+work; for(int j=start; j<end; ++j)partialSum[id] += vecA[j] * vecB[j]; barrier(CLK_LOCAL_MEM_FENCE); if(id == 0) *result = 0;for (inti=0; i<localSize; ++i) *result += partialSum[i];} Work item calculation Reduction
Efficient reduction 1 2 2 * = 1 2 4 2 * = 1 2 * 2 = = 2 1 8 2 4 * 2 = 1 * 2 2 1 * = 4 2 4 * 2 = 1 2 2 1 * = 4 2 8 16
Vectorization • Processors provide vector units • SIMD on CPUs • Warp on GPUs • Utilize to perform few operations in parallel • Arithmetic operations • Binary operations • Memory operation
Loop vectorization voidmul(int size, int* a, int* b, int* c) { for (inti=0; i < size; ++i) { c[i] = a[i] * b[i]; } }
Loop vectorization voidmul(int size, int* a, int* b, int* c) { for (inti=0; i < size; i += 4) { c[i] = a[i] * b[i]; c[i+1] = a[i+1] * b[i+1]; c[i+2] = a[i+2] * b[i+2]; c[i+3] = a[i+3] * b[i+3]; } }
Loop vectorization voidmul(int size, int* a, int* b, int* c) { for (inti=0; i < size; i += 4) { __m128 a_vec = _mm_load_ps(a + i); __m128 b_vec = _mm_load_ps(b + i); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + i, c_vec); } }
Automatic loop vectorization • Is there dependency between a, b, and c? voidmul(int size, int* a, int* b, int* c) { for (inti=0; i < size; ++i) { c[i] = a[i] * b[i]; } }
Automatic loop vectorization voidmul(int size, int* a, int* b, int* c) { for (inti=0; i < size; ++i) { c[i] = a[i] * b[i]; } } b c
Automatic loop vectorization voidmul(int size, int* a, int* b, int* c) { for (inti=0; i < size; i += 4) { c[i] = a[i] * b[i]; c[i+1] = a[i+1] * b[i+1]; c[i+2] = a[i+2] * b[i+2]; c[i+3] = a[i+3] * b[i+3]; } } b c
Automatic vectorization in OpenCL* __kernel void mul(int size, int* a, int* b, int* c) { intid = get_global_id(0); c[id] = a[id] * b[id]; }
Automatic vectorization in OpenCL* for (int id=workGroupIdStart; id < workGroupIdEnd; ++id) { c[id] = a[id] * b[id]; }
Automatic vectorization in OpenCL* for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { c[id] = a[id] * b[id]; c[id+1] = a[id+1] * b[id+1]; c[id+2] = a[id+2] * b[id+2]; c[id+3] = a[id+3] * b[id+3]; }
Automatic vectorization in OpenCL* for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { __m128 a_vec = _mm_load_ps(a + id); __m128 b_vec = _mm_load_ps(b + id); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + id, c_vec); }
Single work group 1 2 2 * = 1 2 4 2 * = 1 2 * 2 = = 2 1 8 2 4 * 2 = 1 * 2 2 1 * = 4 2 4 * 2 = 1 2 2 1 * = 4 2 8 16
Vectorizer friendly 1 2 2 * = 1 2 * 2 = 2 = 1 * 2 * 2 = 1 2 1 2 4 2 * = = 2 1 8 2 4 * 2 1 * = 4 2 4 2 1 * = 4 2 8 16
__kernel void vectorDotMul(int* vecA, int* vecB, intsize, int* result){ int id = get_local_id(0); __localvolatile intpartialSum[MAX_SIZE]; intlocalSize = get_local_size(0); int work = size/localSize; for (int j=start; j < cols; j + = size) partialSum[id] += vecA[j] * vecB[j]; barrier(CLK_LOCAL_MEM_FENCE); if(id == 0) *result = 0;for (inti=0; i<localSize; ++i) *result += partialSum[i];} Work item calculation Reduction
Predication __kernel void mul(int size, int* a, int* b, int* c) { intid = get_global_id(0); if(id > 6) { c[id] = a[id] * b[id]; } else { c[id] = a[id] + b[id]; } }
Predication for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { if(id > 6) { c[id] = a[id] * b[id]; } else { c[id] = a[id] + b[id]; } } How can we vectorize the loop?
Predication for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { bool mask = (id > 6); int c1 = a[id] * b[id]; int c2 = a[id] + b[id]; c[id] = (mask) ? c1 : c2; }
Predication for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { __m128 idVec = // vector of consecutive ids __m128 mask = _mm_cmpgt_epi32(idVec, Vec6); __m128 a_vec = _mm_load_ps(a + id); __m128 b_vec = _mm_load_ps(b + id); __m128 c1_vec = _mm_mul_ps(a_vec, b_vec); __m128 c2_vec = _mm_add_ps(a_vec, b_vec); __m128 c3_vec = _mm_blendv_ps(c1_vec, c2_vec, mask); __mm_store_ps(c + id, c3_vec); }
General tweaking • Consecutive memory accesses • SIMD, WARP • How can we vectorize with control flow? • Can we somehow create an efficient code with control flow? • Uniform CF • CF diverge in SIMD size • Enough work groups to utilize machine
Architecture tweaking • CPU • Locality • No local memory (also slow in some GPUs) • Enough compute for a work group • Overcome thread creation overhead • GPU • Use local memory • Avoid bank conflicts
Conclusion • OpenCL* is an open standard that lets developers: • Write the same code for any type of processor • Use all existing resources of a platform in their application • Automatic parallelism • OpenCL* applications are automatically portable and forward compatible • OpenCL* is still an expert tool • OpenCL* is not performance portable • Tweaking for each vendor should be done