Dissertation Defense

GPU DECLARATIVE FRAMEWORK:DEFG Dissertation Defense Robert Senser October 29, 2014 PhD Committee: Gita Alaghband (chair) Tom Altman (advisor) Michael Mannino Boris Stilman Tam Vu

Presentation Outline • Motivation for Work • Background: Graphical Processing Units (GPUs) and OpenCL • GPU DECLARATIVE FRAMEWORK: DEFG • Diverse GPU Applications using DEFG • Image Filters (Sobel and Median) • Breadth-First Search • Sorting Roughly Sorted Data • Iterative Matrix Inversion • Dissertation Accomplishments • Future Research and Observations

Motivation for Work • GPUs can provide high throughput • Radeon HD 7990: 2 (double-precision) TFLOPS • Developing parallel HPC software is difficult • Parallel development for GPUs is even more difficult • GPU HPC software development requires: • Understanding of unique GPU hardware characteristics • Use of specialized algorithms • Use of GPU-specific, low-level APIs • OpenCL • CUDA • Driving notion: Let software minimize the complexity and difficultly.

Background: GPUs and OpenCL • Graphical Processing Unit (GPU) • Highly specialized coprocessor • Hundreds of cores • Thousands of hardware-managed threads • SIMT: Single Instruction, Multiple Thread • Variant of the common Single Instruction, Multiple Data (SIMD) model • Threads not on the execution path pause • Code executed in a “kernel” • Common GPU programming environments • OpenCL, which is Open Source • CUDA, which is NVIDIA proprietary • DEFG is designed for OpenCL

High-Level GPU Architecture GPU CPU PCIe bus Cache? Local Cache Global RAM RAM bus • GPU Characteristic: • Processors often connected by Peripheral Component Interconnect Express (PCIe) bus • GPU has own fast Global RAM • Threads have a small amount of fast local memory • May or may not have a cache • Many hardware-controlled threads • Lacks CPU-style predictive branching, etc. Virtual Memory

OpenCL Overview • Specification provided by Khronos Group • Open Source, multi-vendor • Hardware device support • GPUs • CPUs • Digital signal processors (DSPs) • Field-programmable gate arrays (FPGAs) • Device kernel normally written in C • Each thread shares a common kernel • CPU-side code • C/C++ • Very low-level, detailed CPU-side application interface (API) • Third-party bindings for Java, Python, etc.

GPU Applications • Three components • Application algorithms • CPU-side code • Moves kernel code to GPU • Manages GPU execution and errors • Moves application data between CPU and GPU • May contain a portion of application algorithms • GPU kernel code • Can have multiple kernels per application • Each kernel usually contains an algorithm or algorithm step • Kernel code often uses GPU-specific techniques • This work concentrates on the CPU-side code

GPU Performance • Major Issues in GPU Performance • Kernel Instruction Path Divergence • Occurs with conditional instructions (ifs, loops, etc.) • Causes some threads to pause • Needs to be minimized, if not totally avoided • High Memory Latency • Each RAM access can consume time of 200-500 instructions • Accesses to global RAM should be coalesced • “Bank conflicts” can occur with local thread memory • Rob Farber GPU suggestions [1]: • “Get the data on the GPU and leave it” • “Give the GPU enough work to do” • “Focus on data reuse to avoid memory limits” • Existing HPC code usually re-factored for GPU use

DEFG Overview • GPU software development tool for OpenCL • Generates the CPU-side of GPU Applications • Uses a Domain Specific Language (DSL) • Developer writes CPU code in DEFG’s DSL • DEFG generates the corresponding CPU C/C++ program • Relative to hand-written CPU code • Faster by using declarative approach • Simpler byusing design patterns, and abstraction • Developer provides standard OpenCL GPU kernels

DEFG Translator Architecture[2,3] • The DEFG generates the C/C++ code for CPU • DEFG Translator • DEFG Source Input • ANTLR-based Parser • XML-based Tree • Optimizer (Java) • Code Generator (C++) • Template Driven • C/C++ Output Translator:

DEFG Benefits and Features • Implement OpenCL applications with less effort • Requires many fewer lines of code to be written • Encourages the developer to focus on the kernels • How is this done? • With a Doman-Specific Language (DSL) • Data characteristics are declared • Uses one or more pre-defined DEFG design patterns • Many details managed inside DEFG • Technical Features • Abstracts the OpenCL APIs and their details • Automatic optimization of buffer transfers • Handles error detection • Supports multiple GPU cards • Supports anytime algorithms

DEFG Code Sample 01. declare application sobel 02. declare integer Xdim (0) 03. declare integer Ydim (0) 04. declare integer BUF_SIZE (0) 05. declare gpugpuone ( any ) 06. declare kernel sobel_filterSobelFilter_Kernels ( [[ 2D,Xdim,Ydim ]] ) 07. declare integer buffer image1 ( BUF_SIZE ) 08. integer buffer image2 ( BUF_SIZE ) 09. call init_input (image1(in) Xdim(out) Ydim(out) BUF_SIZE(out)) 10. execute run1 sobel_filter ( image1(in) image2(out) ) 11. call disp_output (image2(in) Xdim(in) Ydim(in) ) 12. end … status = clSetKernelArg(sobel_filter, 1, sizeof(cl_mem), (void *)&buffer_image2); if (status != CL_SUCCESS) { handle error } // *** execution size_tglobal_work_size[2]; global_work_size[0] = Xdim ; global_work_size[1] = Ydim ; status = clEnqueueNDRangeKernel(commandQueue, sobel_filter, 2, NULL, global_work_size, NULL, 0, NULL, NULL); if (status != CL_SUCCESS) { handle error } // *** result buffers status = clEnqueueReadBuffer(commandQueue, buffer_image2, CL_TRUE, 0, BUF_SIZE * sizeof(int), image2, 0, NULL, NULL); …

DEFG Design Patterns • Invocation Patterns (Control Flow) • Sequential Flow • Single-Kernel Repeat Sequence • Multiple-Kernel • Concurrent-GPU Patterns (Multiple-GPU Support) • Multiple-Execution • Divide-Process-Merge • Overlapped-Split-Process-Concatenate • Prefix-Allocation (Buffer Allocation, without Locking) • Dynamic-Swap (Buffer Swapping) • Code-Morsel (C/C++ Code Insertion) • Anytime algorithm (Control Flow Change on External Event) • Design patterns can be combined • Example: Multiple-Kernel + Divide-Process-Merge + Code-Morsel

DEFG Implementation • Lines of Code • ANTLR-based parser: 580 lines • Optimizer: 659 lines of Java • Code Generator: 1,513 lines of C++ • Templates and includes: 1,572 lines of C++ • Number of Diagnostic Programs: 15+ • Testing investment: Man Months • Faults tended to be in the C/C++ code generation • Most faults were in multi-GPU buffer management

Diverse New DEFG Applications • Constructed four very diverse GPU applications • Image Processing: Sobel and Median Image Filters • Showcase for multiple GPU support • Graph Theoretic: Breadth-First Search (BFS), Large Graphs • Novel use of prefix sum to avoid GPU low-level locking • BFS processing with multiple GPUs • Sorting: Sorting Roughly Sorted Data • Implementation of novel sorting approach • Use of parallel prefix calculations in sorting optimization • Also shows multiple GPU support • Numerical: Iterative Matrix Inversion, M. Altman’s Method • Demonstrates anytime algorithms use of OpenCLclMath BLAS (Basic Linear Algebra Subprograms) • When a measure is met, anytime algorithm stops the process • These four applications demonstrate DEFG’s applicability

Filter Application: Sobel Image Filter • Sobel operator detects edges in images • Pixel gradient calculated from 3x3 mask • Uses a single GPU kernel, invoked once • A base-line test application for multiple GPUs • Example of DEFG Sobel operator processing: Sobel

Filter Application: Median Filter • Median filter removes “noise” from images • Median determined for 3x3 or 5x5 mask • Also uses a single GPU kernel, invoked once • 2nd base-line test application for multiple GPUs • Example of DEFG median 5x5 filter processing:

Application: Breadth First Search (BFS) • Well-studied graph-theoretic problem • Focus: BFS with Very Large Irregular (VLI) Graphs • Social Networking, Routing, Citations, etc. • Many published GPU BFS approaches, starting with Harish [4] • Harish used “Dijkstra” BFS • Vertex frontier as a Boolean array • 1 = vertex on frontier • A GPU thread assigned to each vertex • Can result in poor thread utilization

BFS Vertex Frontier • Merrill approach to vertex buffer management [5] • Have a buffer with multiple update threads • Uses prefix sum to allocate cells • Generalize this buffer management in DEF-G • Provided as a set of kernel functions • Useful for shared buffers with multiple GPU cards

Application: Sorting Roughly Sorted Data • Goal: Improve on O(n log n) sortingbound when sequence is partially sorted • Based on the prior sorting work by T. Altman, et al. [6] • k is a measure of “sortedness” • A sequence is k-sorted if no element is more than k positions out of sequence • Knowing k allows for sorts of O(n log k) • If k is small then we obtain a substantial performance gain • The k-sorted trait can be GPU exploited • Prefix sum in calculating k • Parallel sorts of sub-sequences

Parallel Roughly Sorting Algorithm • Step LR: Left-to-right scan, compute running max • Step RL: Right-to-left scan, compute running min • Step DM: • Uses LR-max array and RL-min array as inputs • Computes each elements distance measure • Step UB: Finds distance measure upper bound • Step Sorting: Using distance measure • Perform sort pass one • Perform sort pass two Notion: Convert the large sort problem into many smaller, parallel sort operations.

Iterative Matrix Inversion (IMI) • DEFG iterative matrix inversion application using M. Altman’s method [7] • Use the Anytime-algorithm approach to manage the iterative inversion • Inversion is stoppable at “anytime” • Can balance run time against accuracy • Anytime management in DEFG, not the application • Requires GPU matrix operations • Use OpenCLclMath (APPML) • clMathintegration into DEFG

M. Altman IMI Approach The initial inverse approximation, that is R0, can be formed by: R0 = αI where α= 1 / || A || ||A|| being the Euclidean norm of A and Iis the identity matrix. To invert matrix A, each iteration calculates: Rn+1= Rn(3I– 3ARn + (ARn)2) with the result in Rn+1. • Better R0estimate provides for quicker convergence • Method is self correcting • DEFG Anytime facility stops the iterations • When inversion quality measure is met • When max iterations have occurred • When max run time has occurred

Accomplishments: DEFG Framework • Fully Implemented • Consists of approximately 5,000 code lines • 7 different applications and 15+ diagnostic programs • Complete User’s Guide • Packaged for general use • Design Patterns • 10+ Patterns • Patterns range from simple to complex • Delineation of DEF-G Limits • Explanation for success of DSL and design patterns

DEFG’s Performance • DEFG Papers • Conference: Parallel and Distributed Processing Techniques and Applications (PDPTA’13) [2] • Conference: Parallel and Distributed Processing Techniques and Applications (PDPTA’14) [3] • Analysis • Three existing OpenCL applications converted to DEFG • CPU-side re-coded in DEFG and used existing GPU kernels • Three applications: • Breadth-First Search (BFS) • All-Pairs Shortest Path (APSP/FW) • Sobel Image Filter (SOBEL) • Output results carefully verified • Comparisons between “reference” and DEFG • Lines-of-Code Comparison • Run-time Performance Comparison

Lines-of-Code Comparison On average, the DEFG code is 5.6 percent of the reference code.

Run-Time Performance Comparison • Shown are original run times (average of 10 runs) • Later, made manual changes to understand timing differences • FW reference version was slow due to a vendor coding error • Likely CPU-based BFS-4096 was fast due to CPU’s cache • Summary: DEFG provided equal, or better, performance

New Applications • Application Implementations • Filtering, BFS, Roughly Sorting, Iterative Inversion • Implementation Goals • Show general applicability of DEFG • Multiple-GPU: Filtering, BFS, and R. Sorting • Novel Algorithm: R. Sorting and Iterative Inversion • Proof of Concept: Iterative Inversion (BLAS usage) • Application Performance results • Run-time Performance • Single-GPU and multiple-GPU configurations • Problem-size characteristics • Vast majority of tests run on Hyrda server

Image Filtering Results • Image Applications Implementation • Both Sobel operator and median filter: • Overlapped-split-process-concatenatedesign pattern • Single and Multiple-GPU versions • Analysis with large images and multiple GPUs • Image neighborhoods • Sobel operator: 3x3 grid • Median filter: • 3x3 grid: less computationally intense • 5x5 grid: more computationally intense

Sobel Operator Application Results • Single-GPU refactored for multiple-GPU use • Used existing OpenCL kernel • Three simple DEFG code changes needed • New version used two GPUs • 50% image plus small overlap given to each • Produced same resultant image • Run-time performance was not impressive • Not sufficiently computationally intense • OpenCL transfer times went up • Kernel executions times stayed the same

Median Filter Application Results • CPU-side DEFG code very similar to Sobel • Developed two new OpenCL kernels • 3x3 grid kernel • 5x5 grid kernel • Performance with 3x3 grid similar to Sobel • Performance with multiple-gpu, 5x5 median • Run-time improvement with all test images • With large image: 1.062 (1 GPU) seconds down to 0.794 (2 GPUs) • Speed up: 1.34 with 2 GPUs, with 7k x 7k image • Also, 2-GPU median filter handled larger images (22k x 22k) • Performance Analysis with 2 GPUs • Kernel execution run times dropped • CPU to GPU OpenCL transfer times increased • pthreads experiment showed need for O.S. threads

Breadth-First Search Results • Breadth-First Search (BFS) Summary • DEFG generalization of Merrillapproach • Prefix-scan based buffer allocation: • “Virtual pointers” to nodes between GPUs • Used Harish sparse data structure approach • Analysis of BFS application • Characteristics • Capabilities • Run-time performance

Multiple-GPU BFS Implementation • BFSDP2GPU DEFG Application • Re-factoring of previously-ported DEFG BFS application • DEFG implementation of complex OpenCL application • Management of shared buffers with prefix sum • Run-time communications between GPUs • Tested application against LVI graphs • Test graphs from SNAP and DIMACS repositories • Stanford Network Analysis Package (SNAP) [8] • Center for Discrete Mathematic and Theoretical Computer Science [9] • Very large graph datasets: millions of vertices and edges

BFSDP2GPU Results • Analysis of BFSDP2GPU application • Characteristics • Kernel count went from 2 kernels to 6 • Used 2 GPUs • Capabilities and Run-time performance • Single-card versus multi-card performance • Performance relative to existing BFS application • Application results with LVI graphs • Processed large graphs (4.8M nodes, 69M edges) • However, unimpressive run-time performance • Run Times increased by factors of 6 to 17 • Issue: OpenCL’s lack of GPU-to-GPU communications • Lesser issue: Mixing of sparse and dense data structures

Roughly Sorting • RSORT application implementation • Divide-process-merge pattern utilized • Implementation contains five kernels:(LRmax, RLmin, DM, UB, and comb_sort) • Sort selected for OpenCL kernel: comb sort • sort-in-place design • non-recursive • similar to bubble sort but much faster • elements are compared gap apart

RSORT Results • Run-Time Comparisons using large datasets • Generated with set k value and dataset size • Fully perturbed dataset, or singly • Example with a k value of 4, 16 perturbed items:5 4 3 2 1 10 9 8 7 6 15 14 13 12 11 16 • Performance analysis over 3 configurations • QSORT on CPU, used as base line • RSORT with 1 GPU • RSORT with 2 GPUs

RSORT Results Summary • Application Implementedin 1-GPU and 2-GPU forms • RSORT run times generally faster when k is small • At K:1000, 2-GPU to 1-GPU speed up is 1.73 • 2-GPU RSORT handles larger datasets than 1-GPU

Iterative Matrix Inversion • IMIFLX application implementation • Used DEFG blasstatement to access clMath functions • Multiple-kernel-loop pattern used • Multiple blas statements per iteration • Blend of blas statements and kernels • Anytime used to end iterations at a time limit • Analysis of application • Inversion accuracy • Range of matrices: size and type • Used data from University of Florida Sparse Matrix Collection[10]

Application Sample Result • IMIFLX uses 13 iterations for this 500 x 500 matrix • Norm value: ||(A*Rn) - I|| • Graph shows convergence to a solution • Run time was 0.259 seconds

Iterative Matrix Inversion Results • Required BLAS support in DEFG • Anytime support: traded accuracy for less run time • Hydra’s NVIDIA T20 GPU • Available RAM: 2,678 MB • Limits double precision matrix to just over 8,000 by 8,000

Dissertation Accomplishments • Designed, Implemented, and Tested DEFG • Produced DEFG “Enabling” Design Patterns • Compared DEFG Applications to Hand-Written • DEFG applications required less code • DEFG applications produced equal run times • Applied DEFG to Diverse GPU Applications • Four diverse applications fully implemented • Impressive application run-time results • with the exception of BFS, due to an OpenCL limit

Future Research • Additional DEFG Design Patterns • Multiple-GPU Load Balancing • Resource sharing • Suggest DEFG Support for NVIDIA’s CUDA • Suggest a re-factored DEFG • internal DSL • More-standard programming environment • Enable support more environments • Not optimistic about declarative approach for GPU-side • Potential for other technical improvements

DEFG’s Success • DEFG is a DSL focused on: HPC with GPUs • Note the Faber suggestions for GPU performance: • “Get the data on the GPU and leave it” • “Give the GPU enough work to do” • “Focus on data reuse to avoid memory limits” • The CPU becomes the orchestrator • DEFG provides the CPU code to orchestrate • Declarations to describe the data • Design patterns to describe the orchestration • Optimization to minimize data transfers

References

Additional Slides

Raw Performance Numbersfor Three Applications, in Milliseconds

Sample DEFG Code Showing a Sequence • 01. declare application floydwarshall • 02. declare integer NODE_CNT (0) • 03. declare integer BUF_SIZE (0) • 04. declare gpugpuone ( any ) • 05. declare kernel floydWarshallPassFloydWarshall_Kernels ( [[ 2D,NODE_CNT ]] ) • 06. declare integer buffer buffer1 ( BUF_SIZE ) • 07. integer buffer buffer2 ( BUF_SIZE ) • 08. call init_input (buffer1(in) buffer2(in) NODE_CNT(out) BUF_SIZE(out)) • 09. sequence NODE_CNT times • 10. execute run1 floydWarshallPass ( buffer1(inout) buffer2(out) NODE_CNT(in) DEFG_CNT(in) ) • 11. call disp_output (buffer1(in) buffer2(in) NODE_CNT(in)) • 12. end

Sample DEFG Code Showing a Loop-While declare application bfs declare integer NODE_CNT (0) declare integer EDGE_CNT (0) declare integer STOP (0) declare gpugpuone ( any ) declare kernel kernel1 bfs_kernel ( [[ 1D,NODE_CNT ]] ) kernel kernel2 bfs_kernel ( [[ 1D,NODE_CNT ]] ) declare struct (4) buffer graph_nodes ( NODE_CNT ) integer buffer graph_edges(EDGE_CNT ) integer buffer graph_mask ( NODE_CNT ) integer buffer updating_graph_mask ( $NODE_CNT ) integer buffer graph_visited(NODE_CNT ) integer buffer cost (NODE_CNT) // note: init_input handles setting "source" node call init_input (graph_nodes(out) graph_edges(out) graph_mask(out) updating_graph_mask(out) graph_visited (out) cost (out) NODE_CNT(out) EDGE_CNT(out)) loop execute part1 kernel1 ( graph_nodes(in) graph_edges(in) graph_mask(in) updating_graph_mask(out) graph_visited(in) cost(inout) $NODE_CNT(in) ) // set STOP to zero each time thru... set STOP (0) // note: STOP value is returned... execute part2 kernel2 ( graph_mask(inout) updating_graph_mask(inout) graph_visited(inout) STOP(inout) NODE_CNT(in) ) while STOP eq 1 call disp_output (cost(in) NODE_CNT(in)) end

RSORT Data

IMIFLX Data

Dissertation Defense