GPU/CPU Work Sharing with Parallel Language XcalableMP-dev for Accelerated Computing

GPU/CPU Work Sharing with Parallel Language XcalableMP-dev for Accelerated Computing †2 †1 †1,2 Tetsuya Odajima, TaisukeBoku, Toshihiro Hanawa, Jinpil Lee , Mitsuhisa Sato University of Tsukuba, Japan Graduate School of Systems and Information Engineering, University of Tsukuba, JapanCenter for Computational Sciences, University of Tsukuba, Japan †1 †1,2 †1 †2 P2S2 2012

Outline • Background & Purpose • Base language: XcalableMP& XcalableMP-dev • StarPU • Implementation of XMP-dev/StarPU • Preliminary performance evaluation • Conclusion & Future work P2S2 2012

Background • GPGPU is widely used for HPC • Impact of NVIDIA CUDA, OpenCL, etc… • Programming became easy on single node • > Many GPU cluster appear on TOP500 list • Problem of programming on GPU cluster • Inter-node programming model (such as MPI) • Data management among CPU and GPU • > Programmability and productivity arevery low • GPU is very powerful, but... • CPU’s performance have been improved. • We cannot neglect its performance. ・Complex ・Include program source P2S2 2012

Purpose • High productivity programming on GPU clusters • Utilization of both GPU and CPU resources • Work sharing of the loop execution • XcalableMP acceleration device extension (XMP-dev) [Nakao, et. al., PGAS10] [Lee, et. al., HeteroPar’11] • Parallel programming model for GPU clusters • Directive-base, low programming cost • StarPU[INRIA Bordeaux] • Runtime system for GPU/CPU work sharing. P2S2 2012

Related Works • PGI Fortran and HMPP • Source-to-source compiler for single node with GPUs • Not to be designed for GPU/CPU co-working • StarPU research • “StarPU: a unified platform for task scheduling on heterogeneous multicore architectures” [Augonnet, 2010] • “Faster, Cheaper, Better - a Hybridization Methodology to Develop Linear Algebra Software for GPUs” [Emmanuel, 2010] • Higher performance than GPU only • GPU/CPU work sharing is effective P2S2 2012

XcalableMP (XMP) • A PGAS language designed for distributed memory systems • Directive-base; easy to understand • array distribution, inter-node communication, loop work sharing on CPU, etc… • Low programming cost • little change from the original sequential program • XMP is developed by a working group by many Japanese universities and HPC companies P2S2 2012

XcalableMP acceleration device extension (XMP-dev) • Developed HPCS laboratory, University of Tsukuba • XMP-dev is an extension of XMP for accelerator-equipped cluster • Additional directives (manage accelerators) • Mapping data onto accelerator’s device memory • Data transfer between CPU and accelerator • Work sharing on loop with accelerator device (ex. GPU cores) P2S2 2012

Example of XMP-dev node1 int x[16], y[16]; #pragma xmp nodes p(2) #pragma xmp template t(0:16-1) #pragma xmp distribute t(BLOCK) onto p #pragma xmp align [i] with t(i) :: x, y int main() { … #pragma xmp device replicate (x, y) { #pragma xmp device replicate_sync in (x, y) #pragma xmp device loop on t(i) for (i = 0; i < 16; i++) x[i] += y[ij * 10; #pragma xmp device replicate_sync out (x) } } x HOST y DEVICE node2 x HOST y DEVICE P2S2 2012

Example of XMP-dev node1 int x[16], y[16]; #pragma xmp nodes p(2) #pragma xmp template t(0:16-1) #pragma xmp distribute t(BLOCK) onto p #pragma xmp align [i] with t(i) :: x, y int main() { … #pragma xmp device replicate (x, y) { #pragma xmp device replicate_sync in (x, y) #pragma xmp device loop on t(i) for (i = 0; i < 16; i++) x[i] += y[ij * 10; #pragma xmp device replicate_sync out (x) } } x HOST y x DEVICE y node2 x HOST y x DEVICE DEVICE y P2S2 2012

Example of XMP-dev node1 int x[16], y[16]; #pragma xmp nodes p(2) #pragma xmp template t(0:16-1) #pragma xmp distribute t(BLOCK) onto p #pragma xmp align [i] with t(i) :: x, y int main() { … #pragma xmp device replicate (x, y) { #pragma xmp device replicate_sync in (x, y) #pragma xmp device loop on t(i) for (i = 0; i < 16; i++) x[i] += y[ij * 10; #pragma xmp device replicate_sync out (x) } } x HOST y x DEVICE y data transfer HOST -> DEVICE node2 x HOST y x DEVICE DEVICE y P2S2 2012

Example of XMP-dev node1 int x[16], y[16]; #pragma xmp nodes p(2) #pragma xmp template t(0:16-1) #pragma xmp distribute t(BLOCK) onto p #pragma xmp align [i] with t(i) :: x, y int main() { … #pragma xmp device replicate (x, y) { #pragma xmp device replicate_sync in (x, y) #pragma xmp device loop on t(i) for (i = 0; i < 16; i++) x[i] += y[ij * 10; #pragma xmp device replicate_sync out (x) } } x HOST y x DEVICE y node2 x HOST y x DEVICE DEVICE y P2S2 2012

Example of XMP-dev node1 int x[16], y[16]; #pragma xmp nodes p(2) #pragma xmp template t(0:16-1) #pragma xmp distribute t(BLOCK) onto p #pragma xmp align [i] with t(i) :: x, y int main() { … #pragma xmp device replicate (x, y) { #pragma xmp device replicate_sync in (x, y) #pragma xmp device loop on t(i) for (i = 0; i < 16; i++) x[i] += y[ij * 10; #pragma xmp device replicate_sync out (x) } } x HOST y x DEVICE y data transfer DEVICE->HOST node2 x HOST y x DEVICE DEVICE y P2S2 2012

Example of XMP-dev node1 int x[16], y[16]; #pragma xmp nodes p(2) #pragma xmp template t(0:16-1) #pragma xmp distribute t(BLOCK) onto p #pragma xmp align [i] with t(i) :: x, y int main() { … #pragma xmp device replicate (x, y) { #pragma xmp device replicate_sync in (x, y) #pragma xmp device loop on t(i) for (i = 0; i < 16; i++) x[i] += y[ij * 10; #pragma xmp device replicate_sync out (x) } } x HOST y DEVICE free data on Devices node2 x HOST y DEVICE DEVICE P2S2 2012

StarPU • Developed by INRIA Bordeaux, France • StarPU is a runtime system • allocates and dispatches resource • schedules the task execution dynamically • All the target data is recorded and managed in a data pool shared by all computation resources. • guarantees the coherence of data among multiple task executions P2S2 2012

Example of StarPU starpu_codelet cl = { .where = STARPU_CPU|STARPU_CUDA, .cpu_func = c_func, .cuda_func = g_func, .nbuffers = 7}; double x; starpu_data_handlex_h; starpu_vector_data_register(&x_h, 0, x, …); structstarpu_data_filter f = { .filter_func = starpu_block_filter_func_vector, .nchildren = NSLICEX , …}; starpu_data_partition(x_h, dim); for (i = 0; i < NSLICEX; i++) { structstarpu_task *task = starpu_task_create(); task->cl = &cl;task->buffer[0].h = starpu_sub_data(x, 1, i); …} starpu_data_unpartition(x_h); starpu_data_unregister(x_h); x P2S2 2012

Example of StarPU starpu_codelet cl = { .where = STARPU_CPU|STARPU_CUDA, .cpu_func = c_func, .cuda_func = g_func, .nbuffers = 7}; double x; starpu_data_handlex_h; starpu_vector_data_register(&x_h, 0, x, …); structstarpu_data_filter f = { .filter_func = starpu_block_filter_func_vector, .nchildren = NSLICEX , …}; starpu_data_partition(x_h, dim); for (i = 0; i < NSLICEX; i++) { structstarpu_task *task = starpu_task_create(); task->cl = &cl;task->buffer[0].h = starpu_sub_data(x, 1, i); …} starpu_data_unpartition(x_h); starpu_data_unregister(x_h); x x_h P2S2 2012

Example of StarPU starpu_codelet cl = { .where = STARPU_CPU|STARPU_CUDA, .cpu_func = c_func, .cuda_func = g_func, .nbuffers = 7}; double x; starpu_data_handlex_h; starpu_vector_data_register(&x_h, 0, x, …); structstarpu_data_filter f = { .filter_func = starpu_block_filter_func_vector, .nchildren = NSLICEX , …}; starpu_data_partition(x_h, dim); for (i = 0; i < NSLICEX; i++) { structstarpu_task *task = starpu_task_create(); task->cl = &cl;task->buffer[0].h = starpu_sub_data(x, 1, i); …} starpu_data_unpartition(x_h); starpu_data_unregister(x_h); x x_h CPU0 GPU0 CPU0 GPU CPU0 CPU core P2S2 2012

Example of StarPU starpu_codelet cl = { .where = STARPU_CPU|STARPU_CUDA, .cpu_func = c_func, .cuda_func = g_func, .nbuffers = 7}; double x; starpu_data_handlex_h; starpu_vector_data_register(&x_h, 0, x, …); structstarpu_data_filter f = { .filter_func = starpu_block_filter_func_vector, .nchildren = NSLICEX , …}; starpu_data_partition(x_h, dim); for (i = 0; i < NSLICEX; i++) { structstarpu_task *task = starpu_task_create(); task->cl = &cl;task->buffer[0].h = starpu_sub_data(x, 1, i); …} starpu_data_unpartition(x_h); starpu_data_unregister(x_h); x x_h P2S2 2012

Example of StarPU starpu_codelet cl = { .where = STARPU_CPU|STARPU_CUDA, .cpu_func = c_func, .cuda_func = g_func, .nbuffers = 7}; double x; starpu_data_handlex_h; starpu_vector_data_register(&x_h, 0, x, …); structstarpu_data_filter f = { .filter_func = starpu_block_filter_func_vector, .nchildren = NSLICEX , …}; starpu_data_partition(x_h, dim); for (i = 0; i < NSLICEX; i++) { structstarpu_task *task = starpu_task_create(); task->cl = &cl;task->buffer[0].h = starpu_sub_data(x, 1, i); …} starpu_data_unpartition(x_h); starpu_data_unregister(x_h); x P2S2 2012

Implementation of XMP-dev/StarPU • XMP-dev • inter-node communication • data distribution • StarPU • data transfer between GPU and CPU • GPU/CPU work sharing on single-node • We combine XMP-dev and StarPU to enhance the function and performance • GPU/CPU work sharing on multi-node GPU cluster P2S2 2012

Task Model of XMP-dev/StarPU node1 node2 CPU core CPU core CPU core GPU CPU core GPU CPU core GPU CPU core GPU CPU core CPU core P2S2 2012

Task Model of XMP-dev/StarPU XMP-dev/StarPU divide global array node1 node2 CPU core CPU core CPU core GPU CPU core GPU CPU core GPU CPU core GPU CPU core CPU core P2S2 2012

Task Model of XMP-dev/StarPU StarPU divide local array and allocate tasks to devices. node1 node2 CPU core CPU core CPU core GPU CPU core GPU CPU core GPU CPU core GPU CPU core CPU core P2S2 2012

Preliminary performance evaluation • N-body (double precision) • Use hand-compiled code • XMP-dev/StarPU (GPU/CPU) vs XMP-dev (GPU) • # of particles : 16K or 32K • Node specification P2S2 2012

Preliminary performance evaluation • Assumption and Condition • StarPU allocates a CPU core to a GPU for controlling and data management • So, 15 CPU cores (16 – 1) contribute for computation on CPU • Term definition • Number of Tasks (tc) • Number of partitions to split data array (each partition is mapped to a task) • Chunk size (N / tc) • Number of elements per task P2S2 2012

Relative performance of XMP-dev/StarPU to XMP-dev ・XMP-dev/StarPU is very low performance… -compared with 35% ・Number of tasks and chunk size effect a performance in hybrid environment. -> There are several patterns. P2S2 2012

The relation between number of tasks and chunk size idling ・If there are enough number of tasks and chunk size, we can achieve good performance in previous implementation. ・Either case with not enough number of tasks or small chunk size ->It is difficult to keep good balance of them with a “fixed size chunk” CPU GPU CPU GPU Bad Good P2S2 2012

The relation between number of tasks and chunk size • In a fixed problem size Chunk size Freedom of task allocation Freedom of task allocation Chunk size P2S2 2012 ・We have to balance “chunk size” with “freedom of task allocation”. -> An aggressive solution is to provide different size of task to CPU core or GPU. -> To find the task size on each devices, calculate ratio from GPU and CPU performance.

Sustained Performance of GPU and CPU GPU-only CPU-only ・CPU performance is improved up to the case of (32, 64). Number of tasks is 128. ・GPU is assigned a single task. More than 2 tasks, scheduling overhead is large. P2S2 2012

Optimized hybrid work sharing • “CPU Weight” • allocated partitions for CPU cores N * CPU Weight CPU GPU P2S2 2012

Improved Relative Performance of XMP-dev/StarPUvs XMP-dev 16k 32k ・When 32k, we can achieve better than 1.4 times the performance of XMP-dev(GPU only) -The load balance between CPU and GPU are taken well in these cases. ・When 16k, we can achieve only 1.05 times the performance -Performance is unstable. We have to analyze this factor. P2S2 2012

Conclusion & Future work • We proposed a programming language and run-time system named XMP-dev/StarPU. • GPU/CPU hybrid work sharing • In preliminary performance evaluation, we can achieve 1.4 timesbetter than XMP-dev • Completion of XMP-dev/StarPU compiler implementation • Analyze dynamic task size management for GPU/CPU load balancing • Examination of various applications on larger parallel systems P2S2 2012

Thank you for your attention. • Please ask me slowly • Thank you! P2S2 2012

GPU/CPU Work Sharing with Parallel Language XcalableMP-dev for Accelerated Computing