300 likes | 311 Views
Explore the efficiency, programmability, and scalability of hybrid cluster systems leveraging both GPU and CPU accelerators. Dive into challenges, programming models, and MapReduce API for optimized performance. Discover insights from experiments on supercomputers and future research directions.
E N D
Co-processing SPMD Computation on GPUs and CPUs cluster Cluster 2013, Indianapolis, 9/24/2013 Hui Li Pervasive Technology Institute Indiana University Bloomington, IN 47408, USA
Acknowledgement FutureGrid project funded by National Science Foundation (NSF) under Grant No. 0910812 • Co Authors: Geoffrey Fox Gregor von Laszewski Arun Chauhan • Ph.D: Andrew Pangborn Jerome Mitchell Adnan Ozsoy Jong Choi 2
Outline • Introduction • Heterogeneous Computation • Programming models for GPU and CPU • System Design and Implementation • Heterogeneous MapReduce API • Scheduling Model and Implementation • Experiment Results • GEMV, C-means, and GMM • Conclusion
Hybrid Cluster with GPU and CPU (Tianhe-2, China) (Titan, USA) (Intel, Xeon Phi) Supercomputers with GPU and multicore CPU Accelerators used for scientific research (NVIDIA, Kepler) (AMD, Fusion)
Heterogeneous Computation • Why Heterogeneous? • CPU are not capable for exascale computation. • Energy and power are the first class constraint on exascale system • There are more overlap between Top 500 and Green 500 machines • How Heterogeneous? • Not only use heterogeneous devices separately, but make use of them together as a system. • Challenges for heterogeneous computing • Programmability • Efficiency • Scalability www.intel.com; www.green500.org
Programmability Challenges for Heterogeneous Devices • Programming Models on CPU • SIMD, MIMD, fast for threading code, • Low Latency, Modest parallelism, • OpenMP, Pthreads • Programming Models on GPU • SIMT, fast for vector code • High Throughput, Massive parallelism, • CUDA, OpenCL • Programming models on GPU and CPU. • Either use a hybrid programming model, or use a uniform programming model. • Difficult to exploit all computing units simultaneously. • OpenACC, Qilin, MAGMA Program on GPU and CPU
Code Samples SPMD for (int tid = 0;tid<num_threads;tid++){ if (pthread_create(NULL,NULL,RunPandaCPUMapThread, panda_cpu_task_info[tid])!=0) perror("Thread creation failed!\n"); }//for for (int tid = 0;tid<num_threads;tid++){ void *exitstat; if (pthread_join(d_g_state->panda_cpu_task[tid],&exitstat)!=0) perror("joining failed"); }//for SIMD void add(uint32_t *a, uint32_t *b, uint32_t *c, int n) { for(int i=0; i<n; i+=4) { //compute c[i], c[i+1], c[i+2], c[i+3] uint32×4_t a4 = vld1q_u32(a+i); uint32×4_t b4 = vld1q_u32(b+i); uint32×4_t c4 = vaddq_u32(a4,b4); vst1q_u32(c+i,c4); } } SIMT __global__ void add(float *a, float *b, float *c) { int i = blockIdx.x * blockDim.x + threadIdx.x; a[i]=b[i]+c[i]; //no loop! }
Heterogeneous MapReduce API • MapReduce • A successful paradigm consist of Map Reduce stages to run SPMD computation • Hide the implementation and optimization details of the underlying system from developers • Heterogeneous MapReduce • Consist GPU and CPU implementation of MapReduce API • Applications should deliver different performance on different types of hardware resources • Leave the programmer the flexibility to optimize the code for either CPU or GPU Heterogeneous MapReduce Based Scheme
Heterogeneous MapReduce API • Three different types of MapReduce interface, each of which consists of map, reduce, combiner, and compare functions • The runtime implementation use the C++ and CUDA as backend • CUDA device and host function can be the same or different
Scheduling SPMD on GPU and CPU • The runtime schedule C++/CUDA binary on GPU and CPU devices simultaneously • Difficulty • More complex than scheduling SPMD on homogeneous resources in order to have good efficiency • Workload balance, Task granularity • Solution • Use roofline model to unify capability of GPU and CPU • Extend roofline model so as to use multiple devices simultaneously • Apply synthesis approaches for this problem: 1)Two level scheduling strategy 2) Dynamical task scheduling
Roofline Model (1) • Performance is how well the kernel’s characteristics map to an architecture’s characteristics • Roofline model can describe the type of SPMD task and the capability of hardware device • Roofline model can integrate in-core performance, memory bandwidth into a single readily understandable performance figure • Metric of interest: • Architecture’s characteristics: Peak Performance of GPUand CPU, DRAM bandwidth, PCI-E bandwidth, Network bandwidth, Disk bandwidth. • kernel’s characteristics: Arithmetic Intensity
Roofline Model (2) O(1) O(N) O(log(N)) Arithmetic Intensity BLAS 1,2 PDEs FFTs CMeans BLAS 3
Extend Roofline Model to Use Multiple Devices Simultaneously • The peak performance is a continuous additive interval function of arithmetic density of app and a set of characteristics of hardware devices • Assign the proper workload to each device according to calculated peak performance of every hardware device Low Latency High Throughput 1000 Roofline Model 100 Unified peak performance for CPU and GPU in flops for target SPMD application
Equations to Calculate the Workload Distribution between GPU and CPU • Use the roofline model to calculate the workload distribution between GPU and CPU • Variable • Fc/Fg flop per second for target application on CPU/GPU, respectively • Ac/Ag flop per byte for target application on CPU/GPU, respectively • B_dram present the bandwidth of DRAM, • B_pcie present the bandwidth of PCI-E, • Pc/Pg present peak performance of CPU/GPU • In Eq(1), when Tg_p ~= Tc_p, Tgc gets the minimal value • In Eq(5), p present the proportion of workload assigned to CPU
Equations to Calculate the Workload Distribution between GPU and CPU • Equation (6)&(7) calculate Fg and Fc value so as to calculate the P value • Compute the value of Fg and Fc for three different situations • The equality is established when data transfer rate is equal to computation speed or when the computation speed reaches the peak performance, the roofline of performance • Equation (8) is the core result of the proposed analytical model, and it can be used to explain the work-load distribution among the CPUs and GPUs for various SPMD applications.
Synthesis Approach for Scheduling SPMD on GPU and CPU Cluster • Scheduling in heterogeneous environment is more complex than that in homogeneous environment • Two-level scheduling • First level schedules computation among nodes, the scheduler is located on master node. • Second level schedules tasks among devices on same node, the scheduler is located on worker node. • Other scheduling strategies • Dynamic task scheduling: GPU and CPU daemons dynamical poll tasks from worker node. • Funnel the tasks: use the GPU/CPU task daemon to manage the task scheduling on GPU/CPU devices.
Two Level Scheduling Strategy Worker Node #1 Worker Node #0 • First level scheduling: • Assume the computation capability of all the nodes are identical. • Split whole job into some tasks, where #tasks = 2*(#nodes) • Second level scheduling: • Split tasks into sub-tasks on each node • Work load distribution among GPU and CPU is calculated by using extended roofline model. Group 1 Group 1 Group 2 Group 2
Tasks Funnel: Use Task Daemon to Schedule Tasks on GPU and CPU • CPU Daemon • Eliminate the cost of creating and terminating thread for CPU tasks • #cpu_pthread = #cores -#gpu_pthread • GPU Daemon • GPU resource lacks of multi-task management in the OS level. • Maintain the GPU context among different streams or kernels invocation. • #gpu_pthread = #gpu Worker Node Slot 2 Slot 1 Task Tracker
Dynamic Task Scheduling and Task Granularity Worker Node Task Pool • There are the CPU task daemons and GPU task daemons on each worker node • CPU task scheduling • CPU daemon dynamical poll task from worker node. • Task granularity: #task = 2*(#cores) • GPU task scheduling • GPU daemon dynamical poll task from worker node • It is more complex to decide the task granularity for the GPUs. • Overlap data transfer with computation • Increase the arithmetic intensity so as to saturate the peak performance of GPU • The minimal task granularity is calculated by equation 10&11 Slot 3 Slot 1 Slot 2
Runtime Framework Parallel Runtime System Framework https://github.com/futuregrid/panda
C-means algorithm with CUDA • C-means algorithm with CUDA: • Configure: • Copy data from the CPU to GPU • Map function: • Calculate the distance matrix • Calculate the membership matrix • Update the centers kernel • Reduce function: • Aggregate the partial cluster centers and compute final cluster centers. • Compute the difference between the current cluster centers and previous iteration. • Main program: • The iteration will stop when the difference is smaller than predefined threshold or it will go to next iteration. • Compute the cluster distance and memberships using final centers.
C-means results: 1) Granularity, 2) Workload Balance, 3) Cache Static Data, 4) Performance Compare 1 2 3 4
GMM algorithm with CUDA • The expectation maximization using a mixture model approach takes the data set as sum of a mixture of multiple distinct events. • Gaussians mixtures from probabilistic models composed of multiple distinct Gaussians distributions as clusters. Each cluster ‘m’ within a D dimensional data set can be characterized by the following parameters GMM results with PRS Nm: the number of samples in the cluster πm : probability that a sample in data set belongs to the cluster μm : a D dimensional mean Rm: a DxD spectral covariance matrix
Performance results on GPUs and CPUs cluster (1) DGEMV (2) C-means (3) GMM weak scalability for GEMV, C-means, and GMM applications with up to 8 nodes on Delta. Y axis represent Gflops per node for each application. (1) GEMV, M=35000, N=10,000 per Node. (2) C-means, N=1000,000 per node , D=100, M=10. (3) GMM, N=100,000 per node, D=60, M=100. The red bard means only using GPUs as computation resources, while blue bar means using both GPUs and CPUs as computation resources
Performance results on GPUs and CPUs cluster • Hardware Configuration • Metrics: bandwidth of DRAM, PCI-E bus, peak performance of GPU and CPU • Uses nvcc 5.0 and gcc 4.4.6 • Workload Distribution • The work load distribution proportions, p values, between GPU and CPU are calculated by using Equation (8). • The error between p values calculated by using Equation (8) and the ones by application profiling is less than 10% • Applications with low arithmetic intensity on the CPU; applications with high arithmetic intensity on the GPU. Hardware Configuration Work Load Distribution among GPU and CPU for Three SPMD Applications using Proposed Model
Contributions and Future Work • Contributions: • We propose an analytic model that scheduling SPMD tasks on heterogeneous cluster and we implement the distributed parallel system based on this model • We extend roofline model so as to schedule computation on heterogeneous resources simultaneously • Future Work • Extend the proposed analytical model by considering the network bandwidth issue • Applying the analytical model to heterogeneous fat nodes
Thank You! Questions? 27
Heat Power restriction Transistor size CPU aren’t getting any faster
Parallel Programming Models on GPU Designed to be used for writing software in a wide variety of application domains on GPU Internal DSLs depends on a host language, whose syntax is both influenced and restricted by the host language. External DSLs can be built from the ground up. Focus on programmer productivity, and high performance with minimal programmer effort Directive-based extensions to existing high-level languages with user-controlled parallelism offload workload from CPU to accelerator, providing portability across operating systems, CPU and GPU
Tasks Funnel: use device daemon to manage GPU and CPU resources • CPU Daemon • Eliminate the cost of creating and terminating thread for CPU tasks • #cpu_pthread = #cores -#gpu_pthread • GPU Daemon • GPU resource lacks of multi-task management by the OS. • Maintain the GPU context among different streams or kernels invocation. • #gpu_pthread = #gpu Worker Node Slot 2 Slot 1 Task Tracker