290 likes | 453 Views
Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission. Haicheng Wu*, Gregory Diamos # , Jin Wang*, Srihari Cadambi ^, Sudhakar Yalamanchili *, Srimat Chakradhar ^ * Georgia Institute of Technology # NVIDIA Research ^ NEC Laboratories America.
E N D
Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission Haicheng Wu*, Gregory Diamos#, Jin Wang*, SrihariCadambi^, SudhakarYalamanchili*, SrimatChakradhar^ *Georgia Institute of Technology #NVIDIA Research ^NEC Laboratories America Sponsors: National Science Foundation, LogicBlox Inc. , IBM, and NVIDIA
The General Purpose GPU • GPU is a many core co-processor • 10s to 100s of cores • 1000s to 10,000s of concurrent threads • CUDA and OpenCL are the dominant programming models • Well suited for data parallel apps • Molecular Dynamics, Options Pricing, Ray Tracing, etc. • Commodity: led by NVIDIA, AMD, and Intel CPU (Multi Core) 2-10 Cores ③Execute ① Input Data GPU ~1500 Cores ② Launch Kernel ④ Result MAIN MEM ~128GB PCI-E GPU MEM ~6GB
Enterprise: Amazon EC2 GPU Instance NVIDIA Tesla Amazon EC2 GPU Instances
Data Warehousing Applications on GPUs • The good • Lots of potential data parallelism • If data fits in GPU mem, 2x—27x speedup has been shown • The bad • Very large data set (will not even fit in host memory) • I/O bound (GPU has no disk) • PCI data transfer takes 15–90% of the total time* • B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander. Relational query co-processing on graphics processors. • In TODS, 2009.
This Work • Goal: Demonstrate the benefits of Kernel Fusion/Kernel Fission in enabling Large data warehousing applications on GPUs • Assumptions • In-memory system • Host memory, not GPU memory • Not OLTP (Online Transaction Processing) type simple queries • Focus on data analysis instead of data entry/retrieval
Two Optimizations for Data Movement CPU (Multi Core) 2-10 Cores Our solutions are: Kernel Fusion– Aggregate computation to reuse data Kernel Fission– Overlap computation with PCI transfer GPU ~1500 Cores MAIN MEM ~128GB PCI-E GPU MEM ~6GB ~16GB/s This is the problem!!!
Relational Algebra (RA) Operators RA are building blocks of DB APPs 7
Experimental Environment Using a sequence of SELECTs to demonstrate the benefits of Kernel Fusion/Fission 9
PCI Bandwidth vs. GPU Computation Capacity GPU Computation Capacity (1 SELECT) < PCI Bandwidth 10
Kernel Fusion A1: A2: A1: A2: A3: KernelA + +/- A3: Fused Kernel A1 A2 A3 A1 A2 - Kernel A A3 Fused Kernel A , B Kernel B KernelB Result Result
Benefits of Kernel Fusion-Reduce Data Footprint (1) Temporal Locality Spatial Locality GPU GPU A1 A2 temp temp A3 Result GPU temp A1 A2 A3 Result Traverse the data only ONCE 12
Benefits of Kernel Fusion-Reduce Data Footprint (2) Memory Efficiency Reduce Data Transfer A1 A2 A1 A2 Kernel A A3 input1 GPU MEM Temp A3 Temp Kernel B CPU MEM GPU MEM result1 Result A1 input2 A1 A2 A3 A2 result2 Fused Kernel A , B A3 GPU MEM Result
Benefits of Kernel Fusion-Enlarge Optimization Scope Eliminate Common Stages Enable More Opt Kernel A Fused Kernel A, B Kernel B Larger code is good for other optimizations: a) instruction scheduling, b) register assignment, c) constant propagation ……
Examples of Kernel Fusion Original 1 SELECT Fused 2 SELECTs 15
Kernel Fusion-Overall Performance Including PCI PCI-e noise Excluding PCI 1.80x speedup 16
Kernel Fusion-Breakdown Execution Time Not needed Faster filter and gather 17
Kernel Fusion-Sensitivity Lower selected rate is better Fusing more kernels is better 18
Kernel Fission-CUDA Stream • Commands (Kernel or Memcpy) of different CUDA STREAM can run in parallel • Commands in the same CUDA STREAM have to run in sequential 19
Kernel Fission-Stream Pool Stream Pool is a library that abstracts away the details of CUDA STREAM 20
Kernel Fission-Different Ways to Use CUDA Stream Concurrently running two kernels is not always beneficial small uses half resource asbig 21
Example of Kernel Fission 1.37x speedup 22
Kernel Fusion + Kernel Fission 1.41x serial 1.31x fusion only 1.10x fission only 23
Real Queries-Q1 Totally 1.26x speedup Query Plan
Real Queries-Q21 Totally 1.13x speedup Query Plan 25
Conclusions • Two Data movement optimizations (Kernel Fusion & Kernel Fission) saves the memory transfer time and speeds up the computation time for Data Warehousing Apps. • Kernel Fusion • Does not need to dump intermediate temporary data • Enlarge the optimization scope • Kernel Fission works like double buffer that can overlap data transfer with GPU Computation
Thank You Questions? 27