590 likes | 770 Views
Supporting Applications Involving Irregular Accesses and Recursive Control Flow on Emerging Parallel Environments. Xin Huo Department of Computer Science and Engineering The Ohio State University Advisor: Prof. Gagan Agrawal. Motivation - Parallel Architectures. Many-core Architecture.
E N D
Supporting Applications Involving Irregular Accesses and Recursive Control Flow on Emerging Parallel Environments • Xin Huo • Department of Computer Science and Engineering • The Ohio State University • Advisor: Prof. Gagan Agrawal
Motivation - Parallel Architectures • Many-core Architecture • Programming Model • CUDA/OpenCL • All x86 based programming models • Running Method • Offload method • Offload/Native method • Memory Configuration • Configurable shared memory + L2 cache • Coherent L2 cache • SIMT vs. SIMD • Support non-continuous and non-aligned accesses, and control dependencies automatically • Not direct support for non-continuous, non-aligned accesses, control/data dependencies
Fusion APU • Host Memory • GPU Memory Motivation - Parallel Architectures • x86 CPU CORES • GPU Engine Arrays • Heterogeneous Architecture • CPU - Decoupled GPU • CPU - Coupled GPU • PCIe bus • Memory • Controller • Physical Memory • Physical Memory
Motivation - Application Patterns • Irregular / Unstructured Reduction • A dwarf in Berkeley view on parallel computing (Molecular Dynamics and Euler) • Challenges in Parallelism • Heavy data dependencies • Challenges in Memory Performance • Indirect memory accesses result in poor data locality • Recursive Control Flow • Conflicts between SIMD architecture and control dependencies • Recursion Support: SSE (No), OpenCL (No), CUDA(Yes) • Generalized Reduction and Stencil Computation • Need to be reconsidered when harness thousands of threads
Thesis Work • Parallel • Architectures • Application • Patterns • GPU • Generalized Reduction • Irregular Reduction • CPU + GPU • Stencil Computation • APU • Recursive Application • Xeon Phi
Thesis Work • Parallel • Architectures • Application • Patterns • Approaches for Parallelizing Reductions on GPUs (HiPC 2010) • An Execution Strategies and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs (ICS 2011) • Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations (HiPC 2011) • Runtime Support for Accelerating Applications on an Integrated CPU-GPU Architecture (SC 2012) • Efficient scheduling of recursive control flow on GPUs (ICS 2013) • A Programming System for Xeon Phis with Runtime SIMD Parallelization (ICS2014) • GPU • Generalized Reduction • Irregular Reduction • CPU + GPU • Stencil Computation • APU • Recursive Application • Xeon Phi
Outline • Latest work • A Programming System for Xeon Phis with Runtime SIMD Parallelization • Other work • Strategy and runtime support for irregular reductions on GPUs • Improve parallelism of SIMD for recursive applications on GPUs • Different strategies for generalized reduction on GPU • Task scheduling frameworks • Decoupled GPU + CPU • Coupled GPU + CPU • Conclusion
for(i = 0; i < n; i += 4) • A[i:(i+4)] = A[i:(i+4)] + B[i:(i+4)]; • for(i = 0; i < n; ++i) • A[i] = A[i] + B[i]; SIMD - Single Instruction Multiple Data • Scalar Loop • SIMD Loop
GPU vs. Xeon Phi • 244 • 3904
Contributions • MIMD and SIMD Programming System • MIMD Parallelism • Pattern knowledge based parallel framework • Task Partitioning & Scheduling Strategies • Data reorganization to help SIMD parallelism • Automatic MIMD API • SIMD Parallelism • Automatic SIMD API • Control flow dependency resolving • Data dependency resolving
job • job • job • job • job MIMD Parallel System • Task Definition • MIMD Parallel Framework • Configuration • Task Type • Task • Parameter Type • Kernel Function • Data Reorganization • Task Partitioner • job • Job Scheduler
MIMD Parallel System • Task Partitioner • Computation Space Partitioning • Achieve good load balance • Need data replication or locking to resolve data dependency • Generalized Reduction, Stencil Computation • Reduction Space Partitioning • No data dependency between partitions • Introduce computation redundancy and load imbalance - can be reduced by appropriate partitioning and reorder method • Irregular Reduction • Job Scheduler • Static, dynamic, or user defined strategies • API Interface • Easy interface • run(Task) : Register Task to MIMD framework, and start to run • join() : Block and wait execution finish
value0 • s • v0-0 • v1-0 • v0-1 • s • value1 • v1-1 • v0-2 • value • s1 • v0-3 • v1-0 • s • value2 • v1-2 • v1-1 • v1-2 • v1-3 • value3 • s • v1-3 • int s = s1 • int s; • vint array[2]; • vint v = s; SIMD Data Type • vint v = v1; • vint v; • 
continuous memory access • non-continuous memory access SIMD API
Input: • float *data; • int i; //the index of one node • //Iterate k clusters to compute the distance to each cluster • for(int j = 0; j < k; ++j){ • float dis = 0.0; • for(int m = 0; m < 3; ++m) • { • dis += (data[i*3+m]-cluster[j*3+m])* • (data[i*3+m]-cluster[j*3+m]); • } • dis = sqrt(dis); • } • Input: • vfloat *data; • int i; //the index of one node • //Iterate k clusters to compute the distance to each cluster • for(int j = 0; j < k; ++j){ • vfloat dis = 0.0; • for(int m = 0; m < 3; ++m) • { • dis += (data[i+m*n]-cluster[j*3+m])* • (data[i+m*n]-cluster[j*3+m]); • } • dis = sqrt(dis); • } Sample Kernels - Kmeans • Sequential Codes • SIMD API
Input : (i, j) the index of one node, vfloat b[N][N] • vfloat Dx = 0.0; • vfloat Dy = 0.0; • //Compute the weight for a node in a 3x3 area • for(int p = -1; p <= 1; p++){ • for(int q = -1; q <= 1; q++){ • Dx+=wH[p+1][q+1]*b[X(i,p,q)][Y(j,p,q)]; • Dy += wV[p+1][q+1]*b[X(i,p,q)][Y(j,p,q)]; • } • } • vfloat z = sqrt(Dx*Dx + Dy*Dy); • z.store(&a[i][j]); • Input : (i, j) the index of one node, float b[N][N] • float Dx = 0.0; • float Dy = 0.0; • //Compute the weight for a node in a 3x3 area • for(int p = -1; p <= 1; p++){ • for(int q = -1; q <= 1; q++){ • Dx += wH[p+1][q+1]*b[i+p][j+q]; • Dy += wV[p+1][q+1]*b[i+p][j+q]; • } • } • float z = sqrt(Dx*Dx + Dy*Dy); • a[i][j] = z; Sample Kernels - Sobel • Sequential Codes • SIMD API
Input : (i, j) the index of one node, float b[N][N] • __m512 Dx = _mm_set1_ps(0.0); • __m512 Dy = _mm_set1_ps(0.0); • //Compute the weight for a node in a 3x3 area • for(int p = -1; p <=1; ++p){ • for(int q = -1; q <=1; ++q){ • __m512 *tmp = (__m512*)&b[i+q][j+p*vec_width]; • __m512 tmpx = _mm512_mul_ps(*tmp, wH[p+1][q+1]); • Dx = _mm512_add_ps(Dx, tmpx); • __m512 tmpy = _mm512_mul_ps(*tmp, wV[p+1][q+1]); • Dy = _mm512_add_ps(Dy, tmpy); • } • } • __m512 sqDX = _mm512_mul_ps(Dx, Dx); • __m512 sqDy = _mm512_mul_ps(Dy, Dy); • __m512 ret = _mm512_add_ps(sqDx, sqDy); • ret = _mm512_sqrt_ps(ret); • _mm512_store_ps(&a[i][j], ret); • Input : (i, j) the index of one node, float b[N][N] • float Dx = 0.0; • float Dy = 0.0; • //Compute the weight for a node in a 3x3 area • for(int p = -1; p <= 1; p++){ • for(int q = -1; q <= 1; q++){ • Dx += wH[p+1][q+1]*b[i+p][j+q]; • Dy += wV[p+1][q+1]*b[i+p][j+q]; • } • } • float z = sqrt(Dx*Dx + Dy*Dy); • a[i][j] = z; Sample Kernels - Sobel • Sequential codes • Manual vectorization codes
Data Layout Reorganization • Whydata reorganization? • Vectorization only support continuous and aligned memory access • Gather/Scatter APIs to load and store non-continuous and unaligned data • Increase more than 60% overhead and introduce programming efforts to users
Data Layout Reorganization • Non-unit-stride memory access • AOS (Array of Structure) to SOA (Structure of Array) • Structure{x, y, z} • [x0, y0, z0, x1, y1, z1,…] => [x0, x1, …, y0, y1, …, z0, z1, …] • Non-aligned memory access • Non-linear data layout transformation proposed in “Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures” • Provide functions to gain the indices of the adjacent nodes after transformation • Irregular/Indirect memory access • Edge reordering method proposed by Bin Ren
Active • Active • Inactive • Inactive • vint v = [0, 2, 4, 6]; • mask = v < 3; • old = 1; • v = 0; • if(v < 3) • v = 0; • else • v = 1; Control Flow Resolution • IMCI first introduces operations with mask parameters • mask: bitset (1: active lanes, 0: inactive lanes) • old value: the default value for inactive lanes • mask = [1, 1, 0, 0] • How to pass mask and • old value to our • operator overload API? • 0 • 0 • 1 • 1
Mask Operation Implementation • IMCI implementation • Mask operations has different API from unmask operations • _mm512_add_ps(v1, v2) and _mm512_mask_add_ps(old, mask, v1, v2) • Our Considerations • Operator overload functions do not support extra parameters • Minimize the complexity of API • Mask Status is one kind of status in execution • It exist during the whole thread life cycle - static • It is private to each thread - thread local
vint v = [0, 2, 4, 6]; • MS::set_mask(v<3, 1); • v = 0; • if(v < 3) • v = 0; • else • v = 1; • vint v = [0, 2, 4, 6]; • mask = v < 3; • old = 1; • v = 0; Mask Operation Implementation • IMCI implementation • Mask operations has different API from unmask operations • _mm512_add_ps(v1, v2) and _mm512_mask_add_ps(old, mask, v1, v2) • Our Considerations • Operator overload functions do not support extra parameters • Minimize the complexity of API • Mask Status is one kind of status in execution • It exist during the whole thread life cycle - static • It is private to each thread - thread local
Mask Operation Implementation • When our API should use mask operation? • API always use mask operation • No effort for users • Introduce extra performance overhead • User decide when to use mask operation • Two vector types: unmask vector type and mask vector type • Change from unmask to mask type by calling mask() function
vint v = [0, 2, 4, 6]; • MS::set_mask(v<3, 1); • v.mask() = 0; • vint v = [0, 2, 4, 6]; • MS::set_mask(v<3, 1); • v = 0; • if(v < 3) • v = 0; • else • v = 1; • vint v = [0, 2, 4, 6]; • mask = v < 3; • old = 1; • v = 0; Mask Operation Implementation • When our API should use mask operation? • API always use mask operation • No effort for users • Introduce extra performance overhead • User decide when to use mask operation • Two vector types: unmask vector type and mask vector type • Change from unmask to mask type by calling mask() function
Data Dependency Resolution • Writing conflits in reduction • Undefined behavior when different SIMD lanes update the same location • Serialized reduction • template<class ReducFunc = func> • void reduction(int *update, int scale, int offset, vint *index, type value) • for(int i = 0; i < simd_width; ++i){ • ReducFunc()(update[index[i]*scale+offset], value[i];) • }
Data Dependency Resolution • Reorder reduction • A • A • A • B • B • B • C • D • A • B • C • D • A • B • Group reduction • Still need gather and scatter for • non-continuous memory access • A • A • B • B • C • D • A • B • A • B • C • D
Experiment Evaluations • Environment configuration • Xeon Phi SE 10P with 61 cores at 1.1 GHz • 8GB DDR5 memory • Running in native model • Intel ICC 13.1.0 with -O3 option • Benchmarks • Generalized reduction: Kmeans, NBC • Stencil computation: Sobel, Heat3D • Experiment Goals • SIMD-API • SIMD-Manual • Pthread-vec: -vec option • Pthread-novec: -no-vec option • OpenMP-vec: -vec option with #pragma vector always • OpenMP-novec: -no-vec option
Speedup - Generalized Reduction • SIMD-API achieves better performance compared to Pthread-nonvec and Pthread-vec • SIMD-API introduces very small overhead compared to SIMD-Manual • Kmeans: the performance of compiler vectorization dependents on K • NBC: large number of divergences limit the performance of compiler vectorization
Speedup - Stencil Computation • Unaligned access is the major challenge for vectorization • Compiler can also do auto-vectorization for stencil computation in some conditions • Sobel: Compiler vectorization fails due to the complicate nest loop • Heat3D: Compiler vectorization achieves the best performance, which is very similar to our API version
Comparison with OpenMP • MIMD + SIMD Parallelism • More than 3 times speedups for most applications • MIMD Parallelism • Gain better performance due to the pattern knowledge based task partitioning and scheduling strategies.
Outline • Latest work • A Programming System for Xeon Phis with Runtime SIMD Parallelization • Other work • Strategy and runtime support for irregular reductions on GPUs • Improve parallelism of SIMD for recursive applications on GPUs • Different strategies for generalized reduction on GPU • Task scheduling frameworks • Decoupled GPU + CPU • Coupled GPU + CPU • Conclusion
Irregular Reduction • Irregular Applications • Unstructured grid pattern • Random and indirect memory accesses • Molecular Dynamics • Indirection Array -> Edges (Interactions) • Reduction Objects -> Molecules (Attributes) • Computation Space -> Interactions b/w molecules • Reduction Space -> Attributes of Molecules
Main Issues • Traditional Strategies are not effective • Full Replication (Private copy per thread) • Large memory overhead • Both intra-block and inter-block combination • Shared memory usage unlikely • Locking Scheme (Private copy per block) • Heavy conflicts within a block • Avoid intra-block combination, but not inter-block combination • Shared memory is only available for small data sets • Need to choose Partitioning Strategy • Make sure data can be put into shared memory • Choice of partitioning space (Computation VS. Reduction) • Tradeoffs: Partitioning overhead & Execution efficiency
Contributions • A Novel Partitioning-based Locking Strategy • Efficient shared memory utilization • Eliminate both intra and inter-block combination • Optimized Runtime Support • Multi-Dimensional Partitioning Method • Reordering & Updating components for correctness and memory performance • Significant Performance Improvements • Exhaustive evaluation • Up to 3.3x improvement over traditional strategies
Choice of Partitioning Space • Two partitioning choices: • Computation Space • Partition on edges • Reduction Space • Partition on nodes
5 Computation Space Partitioning • 1 • 3 • Pros: • Load Balance on Computation • Cons: • Unequal reduction size in each partition • Replicated reduction elements (4 out of 16 nodes are replicated) • Combination cost • 2 • Partitioning on the iterations of computation loop • 4 • Partition 1 • 8 • 6 • 5 • 12 • 16 • 2 • 4 • 7 • 9 • 11 • 13 • 10 • 16 • 6 • 12 • Partition 2 • 14 • 7 • 4 • 5 • Shared memory is infeasible • Partition 3 • Partition 4
5 Reduction Space Partitioning • 1 • 3 • Pros: • Balanced reduction space • Independent between each two partitions • Avoid combination cost • Shared memory is feasible • Cons: • Imbalance on computation space • Replicated work caused by the crossing edges • 2 • Partitioning on the Reduction Elements • 4 • 8 • Partition 1 • 6 • 12 • 5 • 16 • 7 • 9 • 7 • 11 • 13 • 10 • 16 • Partition 2 • 14 • Partition 3 • Partition 4
Reduction Space Partitioning - Challenges • Unbalanced & Replicated Computation • Partitioning method can achieve balance between Cost and Efficiency • Cost: Execution time of partitioning method • Efficiency: Reduce number of crossing edges (Replicated work) • Maintain correctness on GPU • Reorder reduction space • Update/Reorder computation space
Runtime Partitioning Approaches • Metis Partitioning (Multi-level k-way Partitioning) • Execute sequentially on CPU • Minimizes crossing edges • Cons: Large overhead for data initialization • GPU-based (Trivial) Partitioning • Parallel execution on GPU • Minimize execution time • Cons: Large number of crossing edges among partitions • Multi-dimensional Partitioning (Coordinate Information) • Execute sequentially on CPU • Balance between cost and efficiency • High Cost • Low Efficiency
Performance Gains • Euler:Comparison between Partitioning-based Locking (PBL) , Locking, Full Replication, and Sequential CPU time Molecular Dynamics:Comparison between Partitioning-based Locking (PBL) , Locking, Full Replication, and Sequential CPU time
Outline • Latest work • A Programming System for Xeon Phis with Runtime SIMD Parallelization • Other work • Strategy and runtime support for irregular reductions on GPUs • Improve parallelism of SIMD for recursive applications on GPUs • Different strategies for generalized reduction on GPU • Task scheduling frameworks • Decoupled GPU + CPU • Coupled GPU + CPU • Conclusion
Current Recursion Support on Modern GPUs • Problem definition: intra-warp thread scheduling • Each thread in a warp executes an independent recursive function call • SIMT (Single Instruction Multiple Thread): different from the traditional SSE extensions • Each thread owns a stack for function call • Branches control: different branches execute in serial • Recursion support on current GPUs • AMD GPUs and OpenCL programming model • Not support recursion • NVIDIA GPUs • Support recursion from computing capability 2.0 and SDK 3.1 • Performance is limited by the reconvergence method — immediate post dominator reconvergence
/* General Branch */ • Fib(n) • { • if(n < 2) • { • return 1; • } • else { • x = Fib(n-1); • y = Fib(n-2); • return x+y; • } • } Immediate post-dominator Reconvergence in Recursion • T0 • T1 • Reconvergence can only happen on the same recursion level • Threads with shorter branch cannot return until the threads with longer branch coming back to the reconvergence point
Contributions • Dynamic reconvergence method for recursion • Reconvergence can happen before or after immediate post-dominator • Two kinds of implementations • Dynamic greedy reconvergence • Dynamic majority reconvergence • Significant performance improvements • Exhaustive evaluation on six recursion benchmarks with different characteristics • Up to 6x improvement over immediate post-dominator method
Dynamic Reconvergence Mechanisms • T0 • Divergence • T1