Xin Huo Department of Computer Science and Engineering The Ohio State University

Supporting Applications Involving Irregular Accesses and Recursive Control Flow on Emerging Parallel Environments • Xin Huo • Department of Computer Science and Engineering • The Ohio State University • Advisor: Prof. Gagan Agrawal

Motivation - Parallel Architectures • Many-core Architecture • Programming Model • CUDA/OpenCL • All x86 based programming models • Running Method • Offload method • Offload/Native method • Memory Configuration • Configurable shared memory + L2 cache • Coherent L2 cache • SIMT vs. SIMD • Support non-continuous and non-aligned accesses, and control dependencies automatically • Not direct support for non-continuous, non-aligned accesses, control/data dependencies

Fusion APU • Host Memory • GPU Memory Motivation - Parallel Architectures • x86 CPU CORES • GPU Engine Arrays • Heterogeneous Architecture • CPU - Decoupled GPU • CPU - Coupled GPU • PCIe bus • Memory • Controller • Physical Memory • Physical Memory

Motivation - Application Patterns • Irregular / Unstructured Reduction • A dwarf in Berkeley view on parallel computing (Molecular Dynamics and Euler) • Challenges in Parallelism • Heavy data dependencies • Challenges in Memory Performance • Indirect memory accesses result in poor data locality • Recursive Control Flow • Conflicts between SIMD architecture and control dependencies • Recursion Support: SSE (No), OpenCL (No), CUDA(Yes) • Generalized Reduction and Stencil Computation • Need to be reconsidered when harness thousands of threads

Thesis Work • Parallel • Architectures • Application • Patterns • GPU • Generalized Reduction • Irregular Reduction • CPU + GPU • Stencil Computation • APU • Recursive Application • Xeon Phi

Thesis Work • Parallel • Architectures • Application • Patterns • Approaches for Parallelizing Reductions on GPUs (HiPC 2010) • An Execution Strategies and Optimized Runtime Support for Parallelizing Irregular Reductions on Modern GPUs (ICS 2011) • Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations (HiPC 2011) • Runtime Support for Accelerating Applications on an Integrated CPU-GPU Architecture (SC 2012) • Efficient scheduling of recursive control flow on GPUs (ICS 2013) • A Programming System for Xeon Phis with Runtime SIMD Parallelization (ICS2014) • GPU • Generalized Reduction • Irregular Reduction • CPU + GPU • Stencil Computation • APU • Recursive Application • Xeon Phi

Outline • Latest work • A Programming System for Xeon Phis with Runtime SIMD Parallelization • Other work • Strategy and runtime support for irregular reductions on GPUs • Improve parallelism of SIMD for recursive applications on GPUs • Different strategies for generalized reduction on GPU • Task scheduling frameworks • Decoupled GPU + CPU • Coupled GPU + CPU • Conclusion

Intel Xeon Phi Architecture

for(i = 0; i < n; i += 4) • A[i:(i+4)] = A[i:(i+4)] + B[i:(i+4)]; • for(i = 0; i < n; ++i) • A[i] = A[i] + B[i]; SIMD - Single Instruction Multiple Data • Scalar Loop • SIMD Loop

MMX, SSE, AVX and IMCI Instruction Sets

GPU vs. Xeon Phi • 244 • 3904

Contributions • MIMD and SIMD Programming System • MIMD Parallelism • Pattern knowledge based parallel framework • Task Partitioning & Scheduling Strategies • Data reorganization to help SIMD parallelism • Automatic MIMD API • SIMD Parallelism • Automatic SIMD API • Control flow dependency resolving • Data dependency resolving

MIMD and SIMD Challenges for Different Patterns

job • job • job • job • job MIMD Parallel System • Task Definition • MIMD Parallel Framework • Configuration • Task Type • Task • Parameter Type • Kernel Function • Data Reorganization • Task Partitioner • job • Job Scheduler

MIMD Parallel System • Task Partitioner • Computation Space Partitioning • Achieve good load balance • Need data replication or locking to resolve data dependency • Generalized Reduction, Stencil Computation • Reduction Space Partitioning • No data dependency between partitions • Introduce computation redundancy and load imbalance - can be reduced by appropriate partitioning and reorder method • Irregular Reduction • Job Scheduler • Static, dynamic, or user defined strategies • API Interface • Easy interface • run(Task) : Register Task to MIMD framework, and start to run • join() : Block and wait execution finish

SIMD Data Type

value0 • s • v0-0 • v1-0 • v0-1 • s • value1 • v1-1 • v0-2 • value • s1 • v0-3 • v1-0 • s • value2 • v1-2 • v1-1 • v1-2 • v1-3 • value3 • s • v1-3 • int s = s1 • int s; • vint array[2]; • vint v = s; SIMD Data Type • vint v = v1; • vint v; •

continuous memory access • non-continuous memory access SIMD API

Input: • float *data; • int i; //the index of one node • //Iterate k clusters to compute the distance to each cluster • for(int j = 0; j < k; ++j){ • float dis = 0.0; • for(int m = 0; m < 3; ++m) • { • dis += (data[i*3+m]-cluster[j*3+m])* • (data[i*3+m]-cluster[j*3+m]); • } • dis = sqrt(dis); • } • Input: • vfloat *data; • int i; //the index of one node • //Iterate k clusters to compute the distance to each cluster • for(int j = 0; j < k; ++j){ • vfloat dis = 0.0; • for(int m = 0; m < 3; ++m) • { • dis += (data[i+m*n]-cluster[j*3+m])* • (data[i+m*n]-cluster[j*3+m]); • } • dis = sqrt(dis); • } Sample Kernels - Kmeans • Sequential Codes • SIMD API

Input : (i, j) the index of one node, vfloat b[N][N] • vfloat Dx = 0.0; • vfloat Dy = 0.0; • //Compute the weight for a node in a 3x3 area • for(int p = -1; p <= 1; p++){ • for(int q = -1; q <= 1; q++){ • Dx+=wH[p+1][q+1]*b[X(i,p,q)][Y(j,p,q)]; • Dy += wV[p+1][q+1]*b[X(i,p,q)][Y(j,p,q)]; • } • } • vfloat z = sqrt(Dx*Dx + Dy*Dy); • z.store(&a[i][j]); • Input : (i, j) the index of one node, float b[N][N] • float Dx = 0.0; • float Dy = 0.0; • //Compute the weight for a node in a 3x3 area • for(int p = -1; p <= 1; p++){ • for(int q = -1; q <= 1; q++){ • Dx += wH[p+1][q+1]*b[i+p][j+q]; • Dy += wV[p+1][q+1]*b[i+p][j+q]; • } • } • float z = sqrt(Dx*Dx + Dy*Dy); • a[i][j] = z; Sample Kernels - Sobel • Sequential Codes • SIMD API

Input : (i, j) the index of one node, float b[N][N] • __m512 Dx = _mm_set1_ps(0.0); • __m512 Dy = _mm_set1_ps(0.0); • //Compute the weight for a node in a 3x3 area • for(int p = -1; p <=1; ++p){ • for(int q = -1; q <=1; ++q){ • __m512 *tmp = (__m512*)&b[i+q][j+p*vec_width]; • __m512 tmpx = _mm512_mul_ps(*tmp, wH[p+1][q+1]); • Dx = _mm512_add_ps(Dx, tmpx); • __m512 tmpy = _mm512_mul_ps(*tmp, wV[p+1][q+1]); • Dy = _mm512_add_ps(Dy, tmpy); • } • } • __m512 sqDX = _mm512_mul_ps(Dx, Dx); • __m512 sqDy = _mm512_mul_ps(Dy, Dy); • __m512 ret = _mm512_add_ps(sqDx, sqDy); • ret = _mm512_sqrt_ps(ret); • _mm512_store_ps(&a[i][j], ret); • Input : (i, j) the index of one node, float b[N][N] • float Dx = 0.0; • float Dy = 0.0; • //Compute the weight for a node in a 3x3 area • for(int p = -1; p <= 1; p++){ • for(int q = -1; q <= 1; q++){ • Dx += wH[p+1][q+1]*b[i+p][j+q]; • Dy += wV[p+1][q+1]*b[i+p][j+q]; • } • } • float z = sqrt(Dx*Dx + Dy*Dy); • a[i][j] = z; Sample Kernels - Sobel • Sequential codes • Manual vectorization codes

Data Layout Reorganization • Whydata reorganization? • Vectorization only support continuous and aligned memory access • Gather/Scatter APIs to load and store non-continuous and unaligned data • Increase more than 60% overhead and introduce programming efforts to users

Data Layout Reorganization • Non-unit-stride memory access • AOS (Array of Structure) to SOA (Structure of Array) • Structure{x, y, z} • [x0, y0, z0, x1, y1, z1,…] => [x0, x1, …, y0, y1, …, z0, z1, …] • Non-aligned memory access • Non-linear data layout transformation proposed in “Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures” • Provide functions to gain the indices of the adjacent nodes after transformation • Irregular/Indirect memory access • Edge reordering method proposed by Bin Ren

Active • Active • Inactive • Inactive • vint v = [0, 2, 4, 6]; • mask = v < 3; • old = 1; • v = 0; • if(v < 3) • v = 0; • else • v = 1; Control Flow Resolution • IMCI first introduces operations with mask parameters • mask: bitset (1: active lanes, 0: inactive lanes) • old value: the default value for inactive lanes • mask = [1, 1, 0, 0] • How to pass mask and • old value to our • operator overload API? • 0 • 0 • 1 • 1

Mask Operation Implementation • IMCI implementation • Mask operations has different API from unmask operations • _mm512_add_ps(v1, v2) and _mm512_mask_add_ps(old, mask, v1, v2) • Our Considerations • Operator overload functions do not support extra parameters • Minimize the complexity of API • Mask Status is one kind of status in execution • It exist during the whole thread life cycle - static • It is private to each thread - thread local

vint v = [0, 2, 4, 6]; • MS::set_mask(v<3, 1); • v = 0; • if(v < 3) • v = 0; • else • v = 1; • vint v = [0, 2, 4, 6]; • mask = v < 3; • old = 1; • v = 0; Mask Operation Implementation • IMCI implementation • Mask operations has different API from unmask operations • _mm512_add_ps(v1, v2) and _mm512_mask_add_ps(old, mask, v1, v2) • Our Considerations • Operator overload functions do not support extra parameters • Minimize the complexity of API • Mask Status is one kind of status in execution • It exist during the whole thread life cycle - static • It is private to each thread - thread local

Mask Operation Implementation • When our API should use mask operation? • API always use mask operation • No effort for users • Introduce extra performance overhead • User decide when to use mask operation • Two vector types: unmask vector type and mask vector type • Change from unmask to mask type by calling mask() function

vint v = [0, 2, 4, 6]; • MS::set_mask(v<3, 1); • v.mask() = 0; • vint v = [0, 2, 4, 6]; • MS::set_mask(v<3, 1); • v = 0; • if(v < 3) • v = 0; • else • v = 1; • vint v = [0, 2, 4, 6]; • mask = v < 3; • old = 1; • v = 0; Mask Operation Implementation • When our API should use mask operation? • API always use mask operation • No effort for users • Introduce extra performance overhead • User decide when to use mask operation • Two vector types: unmask vector type and mask vector type • Change from unmask to mask type by calling mask() function

Data Dependency Resolution • Writing conflits in reduction • Undefined behavior when different SIMD lanes update the same location • Serialized reduction • template<class ReducFunc = func> • void reduction(int *update, int scale, int offset, vint *index, type value) • for(int i = 0; i < simd_width; ++i){ • ReducFunc()(update[index[i]*scale+offset], value[i];) • }

Data Dependency Resolution • Reorder reduction • A • A • A • B • B • B • C • D • A • B • C • D • A • B • Group reduction • Still need gather and scatter for • non-continuous memory access • A • A • B • B • C • D • A • B • A • B • C • D

Experiment Evaluations • Environment configuration • Xeon Phi SE 10P with 61 cores at 1.1 GHz • 8GB DDR5 memory • Running in native model • Intel ICC 13.1.0 with -O3 option • Benchmarks • Generalized reduction: Kmeans, NBC • Stencil computation: Sobel, Heat3D • Experiment Goals • SIMD-API • SIMD-Manual • Pthread-vec: -vec option • Pthread-novec: -no-vec option • OpenMP-vec: -vec option with #pragma vector always • OpenMP-novec: -no-vec option

Speedup - Generalized Reduction • SIMD-API achieves better performance compared to Pthread-nonvec and Pthread-vec • SIMD-API introduces very small overhead compared to SIMD-Manual • Kmeans: the performance of compiler vectorization dependents on K • NBC: large number of divergences limit the performance of compiler vectorization

Speedup - Stencil Computation • Unaligned access is the major challenge for vectorization • Compiler can also do auto-vectorization for stencil computation in some conditions • Sobel: Compiler vectorization fails due to the complicate nest loop • Heat3D: Compiler vectorization achieves the best performance, which is very similar to our API version

Scalability

Comparison with OpenMP • MIMD + SIMD Parallelism • More than 3 times speedups for most applications • MIMD Parallelism • Gain better performance due to the pattern knowledge based task partitioning and scheduling strategies.

Irregular Reduction • Irregular Applications • Unstructured grid pattern • Random and indirect memory accesses • Molecular Dynamics • Indirection Array -> Edges (Interactions) • Reduction Objects -> Molecules (Attributes) • Computation Space -> Interactions b/w molecules • Reduction Space -> Attributes of Molecules

Main Issues • Traditional Strategies are not effective • Full Replication (Private copy per thread) • Large memory overhead • Both intra-block and inter-block combination • Shared memory usage unlikely • Locking Scheme (Private copy per block) • Heavy conflicts within a block • Avoid intra-block combination, but not inter-block combination • Shared memory is only available for small data sets • Need to choose Partitioning Strategy • Make sure data can be put into shared memory • Choice of partitioning space (Computation VS. Reduction) • Tradeoffs: Partitioning overhead & Execution efficiency

Contributions • A Novel Partitioning-based Locking Strategy • Efficient shared memory utilization • Eliminate both intra and inter-block combination • Optimized Runtime Support • Multi-Dimensional Partitioning Method • Reordering & Updating components for correctness and memory performance • Significant Performance Improvements • Exhaustive evaluation • Up to 3.3x improvement over traditional strategies

Choice of Partitioning Space • Two partitioning choices: • Computation Space • Partition on edges • Reduction Space • Partition on nodes

5 Computation Space Partitioning • 1 • 3 • Pros: • Load Balance on Computation • Cons: • Unequal reduction size in each partition • Replicated reduction elements (4 out of 16 nodes are replicated) • Combination cost • 2 • Partitioning on the iterations of computation loop • 4 • Partition 1 • 8 • 6 • 5 • 12 • 16 • 2 • 4 • 7 • 9 • 11 • 13 • 10 • 16 • 6 • 12 • Partition 2 • 14 • 7 • 4 • 5 • Shared memory is infeasible • Partition 3 • Partition 4

5 Reduction Space Partitioning • 1 • 3 • Pros: • Balanced reduction space • Independent between each two partitions • Avoid combination cost • Shared memory is feasible • Cons: • Imbalance on computation space • Replicated work caused by the crossing edges • 2 • Partitioning on the Reduction Elements • 4 • 8 • Partition 1 • 6 • 12 • 5 • 16 • 7 • 9 • 7 • 11 • 13 • 10 • 16 • Partition 2 • 14 • Partition 3 • Partition 4

Reduction Space Partitioning - Challenges • Unbalanced & Replicated Computation • Partitioning method can achieve balance between Cost and Efficiency • Cost: Execution time of partitioning method • Efficiency: Reduce number of crossing edges (Replicated work) • Maintain correctness on GPU • Reorder reduction space • Update/Reorder computation space

Runtime Partitioning Approaches • Metis Partitioning (Multi-level k-way Partitioning) • Execute sequentially on CPU • Minimizes crossing edges • Cons: Large overhead for data initialization • GPU-based (Trivial) Partitioning • Parallel execution on GPU • Minimize execution time • Cons: Large number of crossing edges among partitions • Multi-dimensional Partitioning (Coordinate Information) • Execute sequentially on CPU • Balance between cost and efficiency • High Cost • Low Efficiency

Performance Gains • Euler:Comparison between Partitioning-based Locking (PBL) , Locking, Full Replication, and Sequential CPU time Molecular Dynamics:Comparison between Partitioning-based Locking (PBL) , Locking, Full Replication, and Sequential CPU time

Current Recursion Support on Modern GPUs • Problem definition: intra-warp thread scheduling • Each thread in a warp executes an independent recursive function call • SIMT (Single Instruction Multiple Thread): different from the traditional SSE extensions • Each thread owns a stack for function call • Branches control: different branches execute in serial • Recursion support on current GPUs • AMD GPUs and OpenCL programming model • Not support recursion • NVIDIA GPUs • Support recursion from computing capability 2.0 and SDK 3.1 • Performance is limited by the reconvergence method — immediate post dominator reconvergence

/* General Branch */ • Fib(n) • { • if(n < 2) • { • return 1; • } • else { • x = Fib(n-1); • y = Fib(n-2); • return x+y; • } • } Immediate post-dominator Reconvergence in Recursion • T0 • T1 • Reconvergence can only happen on the same recursion level • Threads with shorter branch cannot return until the threads with longer branch coming back to the reconvergence point

Contributions • Dynamic reconvergence method for recursion • Reconvergence can happen before or after immediate post-dominator • Two kinds of implementations • Dynamic greedy reconvergence • Dynamic majority reconvergence • Significant performance improvements • Exhaustive evaluation on six recursion benchmarks with different characteristics • Up to 6x improvement over immediate post-dominator method

Dynamic Reconvergence Mechanisms • T0 • Divergence • T1

Xin Huo Department of Computer Science and Engineering The Ohio State University

Xin Huo Department of Computer Science and Engineering The Ohio State University

Presentation Transcript

Florida State University Department of Computer Science

Jieping Ye Department of Computer Science and Engineering Arizona State University public.asu

University of Hail College of Computer Science and Engineering Department of computer Science and Software Engineering

Computer Science Department Sonoma State University

Florida State University Department of Computer Science

Florida State University Department of Computer Science

Department of Computer Science and Engineering

Department of Computer Science, Wayne State University

Computer Group Engineering Department University of Science and Culture

University of Bridgeport Department of Computer Science and Engineering

Department of Computer Science and Engineering The Ohio State University

Biomedical Engineering The Ohio State University

Department of Computer Science and Computer Engineering

Department of Computer Science and Engineering Bangladesh University of Engineering and Technology

Department of Computer Science and Engineering

The Ohio State University Department of Horticulture and Crop Science

Department of Computer Science and Engineering

Department of Computer Science and Engineering

University of Bridgeport Department of Computer Science and Engineering

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

The Computer Science and Engineering Department

Department of Computer Science & Engineering