Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission Haicheng Wu*, Gregory Diamos#, Jin Wang*, SrihariCadambi^, SudhakarYalamanchili*, SrimatChakradhar^ *Georgia Institute of Technology #NVIDIA Research ^NEC Laboratories America Sponsors: National Science Foundation, LogicBlox Inc. , IBM, and NVIDIA

The General Purpose GPU • GPU is a many core co-processor • 10s to 100s of cores • 1000s to 10,000s of concurrent threads • CUDA and OpenCL are the dominant programming models • Well suited for data parallel apps • Molecular Dynamics, Options Pricing, Ray Tracing, etc. • Commodity: led by NVIDIA, AMD, and Intel CPU (Multi Core) 2-10 Cores ③Execute ① Input Data GPU ~1500 Cores ② Launch Kernel ④ Result MAIN MEM ~128GB PCI-E GPU MEM ~6GB

Enterprise: Amazon EC2 GPU Instance NVIDIA Tesla Amazon EC2 GPU Instances

Data Warehousing Applications on GPUs • The good • Lots of potential data parallelism • If data fits in GPU mem, 2x—27x speedup has been shown • The bad • Very large data set (will not even fit in host memory) • I/O bound (GPU has no disk) • PCI data transfer takes 15–90% of the total time* • B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query co-processing on graphics processors. • In TODS, 2009.

This Work • Goal: Demonstrate the benefits of Kernel Fusion/Kernel Fission in enabling Large data warehousing applications on GPUs • Assumptions • In-memory system • Host memory, not GPU memory • Not OLTP (Online Transaction Processing) type simple queries • Focus on data analysis instead of data entry/retrieval

Two Optimizations for Data Movement CPU (Multi Core) 2-10 Cores Our solutions are: Kernel Fusion– Aggregate computation to reuse data Kernel Fission– Overlap computation with PCI transfer GPU ~1500 Cores MAIN MEM ~128GB PCI-E GPU MEM ~6GB ~16GB/s This is the problem!!!

Relational Algebra (RA) Operators RA are building blocks of DB APPs 7

Common RA Combinations of TPC-H 8

Experimental Environment Using a sequence of SELECTs to demonstrate the benefits of Kernel Fusion/Fission 9

PCI Bandwidth vs. GPU Computation Capacity GPU Computation Capacity (1 SELECT) < PCI Bandwidth 10

Kernel Fusion A1: A2: A1: A2: A3: KernelA + +/- A3: Fused Kernel A1 A2 A3 A1 A2 - Kernel A A3 Fused Kernel A , B Kernel B KernelB Result Result

Benefits of Kernel Fusion-Reduce Data Footprint (1) Temporal Locality Spatial Locality GPU GPU A1 A2 temp temp A3 Result GPU temp A1 A2 A3 Result Traverse the data only ONCE 12

Benefits of Kernel Fusion-Reduce Data Footprint (2) Memory Efficiency Reduce Data Transfer A1 A2 A1 A2 Kernel A A3 input1 GPU MEM Temp A3 Temp Kernel B CPU MEM GPU MEM result1 Result A1 input2 A1 A2 A3 A2 result2 Fused Kernel A , B A3 GPU MEM Result

Benefits of Kernel Fusion-Enlarge Optimization Scope Eliminate Common Stages Enable More Opt Kernel A Fused Kernel A, B Kernel B Larger code is good for other optimizations: a) instruction scheduling, b) register assignment, c) constant propagation ……

Examples of Kernel Fusion Original 1 SELECT Fused 2 SELECTs 15

Kernel Fusion-Overall Performance Including PCI PCI-e noise Excluding PCI 1.80x speedup 16

Kernel Fusion-Breakdown Execution Time Not needed Faster filter and gather 17

Kernel Fusion-Sensitivity Lower selected rate is better Fusing more kernels is better 18

Kernel Fission-CUDA Stream • Commands (Kernel or Memcpy) of different CUDA STREAM can run in parallel • Commands in the same CUDA STREAM have to run in sequential 19

Kernel Fission-Stream Pool Stream Pool is a library that abstracts away the details of CUDA STREAM 20

Kernel Fission-Different Ways to Use CUDA Stream Concurrently running two kernels is not always beneficial small uses half resource asbig 21

Example of Kernel Fission 1.37x speedup 22

Kernel Fusion + Kernel Fission 1.41x serial 1.31x fusion only 1.10x fission only 23

Real Queries-Q1 Totally 1.26x speedup Query Plan

Real Queries-Q21 Totally 1.13x speedup Query Plan 25

Conclusions • Two Data movement optimizations (Kernel Fusion & Kernel Fission) saves the memory transfer time and speeds up the computation time for Data Warehousing Apps. • Kernel Fusion • Does not need to dump intermediate temporary data • Enlarge the optimization scope • Kernel Fission works like double buffer that can overlap data transfer with GPU Computation

Thank You Questions? 27

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission