220 likes | 519 Views
Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems. Janghaeng Lee Mehrzad Samadi Yongjun Park and Scott Mahlke. University of Michigan - Ann Arbor. Heterogeneity. Computer systems have become heterogeneous Laptops, Servers Mobile devices
E N D
Transparent CPU-GPU Collaboration forData-Parallel Kernels on Heterogeneous Systems JanghaengLee MehrzadSamadi Yongjun Park and Scott Mahlke University of Michigan - Ann Arbor
Heterogeneity • Computer systems havebecome heterogeneous • Laptops, Servers • Mobile devices • GPUs integrated withCPUs • Discrete GPUs added for • Gaming • Supercomputing • Co-processors for massive data-parallel workload • Intel Xeon PHI
Example of Heterogeneous System Core Core Core Core External GPUs GPU Shared L3 Memory Controller Host Faster GPUs Slower GPUs GPU Cores(> 16 cores) Multi CoreCPU (4-16 cores) GPU Cores(< 300 Cores) GPU Cores(< 100 Cores) < 20GB/s > 150GB/s > 100GB/s Memory Global Memory Global Memory PCIe < 16GB/s
Typical Heterogeneous Execution GPU 0 CPU GPU 1 Seq.Code TransferInput KernelRun onGPU 1 IDLE IDLE TransferOutput Time
Collaborative Heterogeneous Execution GPU 0 CPU GPU 1 GPU 0 CPU GPU 1 Seq.Code Seq.Code TransferInput TransferInput • 3 Issues • How to transform the kernel • How to merge output efficiently • How to partition Run on CPU Run on GPU 0 Run on GPU 1 KernelRun onGPU 1 IDLE IDLE TransferOutput Merge Seq.Code TransferOutput Speedup Time
OpenCL Execution Model OpenCL Data Parallel Kernel Work-item Work-group PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE PE … … Compute Unit (CU) Compute Unit (CU) … … … … Compute Device 1 Compute Unit (CU) Compute Unit (CU) … … Shared Memory Constant Memory Global Memory
Virtualizing Computing Device OpenCL Data Parallel Kernel Work-item Work-group Scheduler Scheduler Scheduler Scheduler NVIDIA SMX 0 1 2 3 0 0 0 1 1 1 2 2 2 3 3 3 4 5 6 7 4 4 4 5 5 5 6 6 6 7 7 7 Executeswork-item 8 9 10 11 8 8 8 9 9 9 10 10 10 11 11 11 12 13 14 15 12 12 12 13 13 13 14 14 14 15 15 15 Regs Shared Mem
Virtualization on CPU Work-item Work-group Core 0 Core 1 Core 2 Core 3 ALU ALU Control ALU ALU ALU ALU ALU ALU Control Control Control ALU ALU ALU ALU ALU ALU ALU ALU Executesa work-item Register Register Register Register Cache Cache Cache Cache
Collaborative Execution OpenCL Kernel Flatten Workgroups Host Memory (Global) Device Memory (Global) Merge Output COPY Input
OpenCL - Single Kernel Multiple Devices (SKMD) Application Binary • Key Ideas • Kernel Transform • Work on subset of WGs • Efficiently merge outputsin different address spaces • Buffer Management • Manages working setfor each device • Partition • Assign optimal workloadfor each device OpenCL API num_groups(2) Launch Kernel num_groups(1) WG-VariantProfile Data num_groups(0) Partitioner KernelTransformer BufferManager SKMD Framework CPU Device (Host) Faster GPUs Slower GPUs GPU Cores(> 16 cores) Multi CoreCPU (4-16 cores) GPU Cores(< 300 Cores) GPU Cores(< 100 Cores) PCIe < 16GB/s
Kernel Transform • Partial Work-group Execution __kernel voidoriginal_program( ... ) { } , intwg_from, intwg_to) partial_program num_groups(2) num_groups(1) intidx = get_group_id(0);intidy = get_group_id(1); intsize_x = get_num_groups(0); intflat_id = idx + idy * size_x; num_groups(0) [KERNEL CODE] FlattenN-DimensionalWorkgroups if (flat_id < wg_from || flat_id > wg_to) return;
Buffer Management – Classifying Kernel Contiguous Memory Access Kernel Input Address Read Work-groups 33% 50% 17% CPU Device (Host) Faster GPUs Slower GPUs Write Output Address Easy to merge output in separate address space Discontiguous Memory Access Kernel Input Address Read Work-groups 33% 50% 17% Write Output Address Hard to Merge
Merging Outputs • Intuition • CPU would produce the same as GPU’s if it executed rest of work-groups • Already has rest of results from GPU • Simple solution for merging kernel • Enable work-groups that were enabled in GPUs • Replace global store value with copy (load from GPU result) • Log output location &check log locationwhen merging • BAD idea • Additional overhead
Merge Kernel Transformation kernel voidpartial_program ( ... , __global float *output,intwg_from, intwg_to) {intflat_id = idx + idy * size_x; if (flat_id < wg_from || flat_id > wg_to) return; // kernel body for (){ ... sum += ...; } output[tid] = sum; } merge_program , float *gpu_out) Merging Cost Done in Memory B/W ≈ 20 GB /s < 0.1 ms in most applications Removed by Dead Code Elimination gpu_out[tid]; Store to global memory
Partitioning • Uses profile data • Done in Runtime • Must be fast • Uses Decision Tree Heuristics (Greedy) • Start from the root node assuming • All workgroups are assigned to the fastest device • Fixed # of workgroups are offloaded from the fastest device to another from the parent’s node
Decision Tree Example • At each node, also considers • Balancing Factor • Do not choose a child that has less balanced execution time between devices • Data Transfer Time • Merging Costs • Performance Variation on # of workgroups f(Dev1, Dev2, Dev3) f(256,0,0) =200ms f(255,1,0) f(255,0,1) =195ms =197ms f(254,2,0) f(254,1,1) =195ms =193ms f(253,2,1) f(253,1,2) =192ms =190ms
Results Intel Xeon Only
Results Intel Xeon Only 29%
Results (Break Down) Vector Add MatrixMultiplication
Summary • Systems have been become more heterogeneous • Configured with several types of devices • Existing CPU + GPU heterogeneous • Single device executes a single kernel • Single Kernel Multiple Devices (SKMD) • CPUs and GPUs working on a single kernel • Transparent Framework • Partial kernel execution • Merging partial output • Optimal partitioning • Performance improvement of 29% over single device execution