Approaches for Parallelizing Reductions on Modern GPUs

Approaches for Parallelizing Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and GaganAgrawal Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210

Outline • Motivation • Challenges • Contributions • Generalized Reductions • Parallelization Approaches • Full Replication • Locking Scheme • Hybrid Scheme • Evaluation • Conclusions

Motivation • Deluge of scientific and Data-Intensive applications • Floating/ Double data types & use of shared memory, a commonplace • Different applications require different synchronization mechanisms • State-of-the-art mechanisms to avoid race conditions in CUDA applications (<= compute 1.3) • Replicate: A private copy for each thread • Fine-grained Atomic Operation on Device Memory (Only Integer) • Fine-grained Atomic Operation on Shared Memory (Only Integer) • Visible gap in application requirement & CUDA support • No floating-point operations • No robust coarse-grained locking • Disadvantages of existing mechanisms • Replication: Huge Memory & Combination overhead • Atomic Operations: Introduce High Conflicts

Challenges • Provide additional Mechanisms to avoid race conditions • Enable floating-point Atomic Operation (both Device & Shared Memory) • Enable Coarse-grained locking • Overheads of Replication • Increase in memory requirement with data size, threads, application parameter • Combination overhead increases with thread size • Mostly obviates the use of shared memory • Overheads of Locking • Heavy conflicts per word can occur with large thread numbers • How to improve the existing mechanisms? • Provide a mechanism that can balance the trade-offs between Replication & Locking

Contributions • Additional Locking Mechanisms • Wrapper, Floating-point Fine-grained Locking • A robust, dead-lock free Coarse-grained Locking • Explicit conditional branch • Explicit warp serialization • A novel Hybrid Scheme combining Replication & Locking • Balances the overheads from both replication and locking • Handle all these schemes transparently from user

Generalized Reduction Computations • Similar to MapReduce model • But only one stage, Reduction • Reduction Object, Robj, exposed to programmer • Large intermediate results avoided • Reduction operation, Reduc, is associative or commutative • Order of processing can be arbitrary • Target a particular class of applications in this work {* Outer sequential loop*} While(unfinished) { {*Reduction loop*} Foreach( element e) { (i, val) = compute(e) RObj(i) = Reduc(Robj(i), val) } }

Device Host Reduction Object Reduction Object Reduction Object Threads Threads Reduction Object Threads Reduction Object …….…… Reduction Object Reduction Object Reduction Object Threads Reduction Object Final Result Threads Reduction Object Threads Reduction Object Reduction Object …….…… Reduction Object Reduction Object Threads Threads Threads …….…… Parallelization Schemes - Full Replication Block Block Block Reduction Object Reduction Object Reduction Object Reduction Object

Device Host All threads in one block share the same copy Reduction Object Reduction Object Threads Block Threads Threads …….…… Reduction Object Reduction Object Final Result Threads Block Threads Threads …….…… Reduction Object Reduction Object Threads Block Threads Threads …….…… Parallelization Schemes – Locking Schemes • Fine-grained locking • Coarse-grained locking

Fine-grained Locking Atomic executing“*address+val” • Lock for updating a particular word (Atomic Operation) • Support floating & double point computation • Implemented by wrapping atomicCAS provided by CUDA Input: Address,old_value, new_value, val AtomicCAS(Address, old_value, new_value) *Address = old_value? Yes No *address= new_value Old_value=*address New_value=*address + value Stop

Coarse-grained Locking Locking=0 free Locking=1 busy • Lock for a group of operations (Critical section) Input: locking AtomicCAS(locking, 0, 1) Yes No locking = 0? Locking=1 Critical section Spin Lockcauses Thread Divergence in Warp Free lock Locking=0 Deadlock prone Stop

Deadlock free solutions • Explicit Warp Serialization • Explicit Conditional Branch Do = true Tthread: 0 -32 While (Do) Get lock() Get lock() Yes No Success Critical section Critical section Release lock() Release Lock() threadID++ Do=false

Hybrid Scheme • Hybrid Scheme • Balance between Full Replication & Locking Scheme • Insert a middle layer “group” in thread organization • Intra group: Locking Scheme • Inter group: Full Replication • Benefits of Hybrid Scheme varies with group size • Advantages with appropriate group size • Reduced Memory overhead • Reduced combination • Better use of shared Memory • Reduced conflicts

Device Host The threads are organized as Groups and Blocks Each group has a private copy of reduction object. BLOCK Thread Reduction object Thread Group Intermediate results Thread Intermediate results Reduction object Group ……….. Group Reduction object ……….. Final results BLOCK Thread Reduction object Thread Group Intermediate results Intermediate results Thread Reduction object Group ……….. Group Reduction object ……….. Model of Hybrid Scheme

Experiment Setup • Setup • NVIDIA Tesla C1060 • 4GB device memory, 16KB shared memory • Computing capability 1.3 • AMD Opteron 2218 processors • Applications • K-Means Clustering • Principal Component Analysis (PCA) • K-Nearest Neighbor Search (KNN) • Evaluate the performance of three parallelization techniques: • Full Replication • Lock Scheme • Hybrid Scheme • Analyze the factors influencing the performance

K-Means: K=10, data size=2GB low memory and combination overhead, but high competition Number of groups

K-Means: K=10 • Hybrid outperform both Full Replication and Locking Scheme Comparison in the best configurations

KNN: K=10, data size=20MB • High competition with small reduction size (K=10) • Using explicit warp serialization in Locking, which gives better performance than explicit conditional branch. • 32 groups in Hybrid Scheme matching the number of threads in a warp. • No two threads in same warp go to the same group. So race condition only exist among the threads in different warp. Locking is much more sensitive to the number of threads due to high overhead of coarse-grained locking

KNN: K=10 • Full Replication: 9.6 times faster than locking scheme. • Hybrid: 62.3 times faster than Locking Scheme • Comparison the best configurations

KNN: CUDA Profiler Results • Hybrid achieves a balance between divergent and global store. • 1.6% braches of Locking • 0.7% global stores of Full Replication

KNN: CUDA Profiler Results • With the increasing of k, divergent branches in Locking doesn't change much, but number of global store in Full Replication increases dramatically

KNN: varying K • Locking will outperform Full Replication with increasing K,based on the observation of changing of divergent branches and global store

Conclusions • Performance is dependent on • Characteristics of applications • Thread configurations • Choice of scheme • Full Replication • Viable when reduction object space is small and combination cost is low • Locking Scheme • Memory location is large enough to keep contention overhead low • Hybrid Scheme • Obtain a balance between combination memory overhead and synchronization cost • Achieves the best performance for the benchmarks we considered

Thank you Questions? Contacts: Xin Huo huox@cse.ohio-state.edu Vignesh Ravi raviv@cse.ohio-state.edu Wenjing Ma mawe@cse.ohio-state.edu GaganAgrawalagrawal@cse.ohio-state.edu

Approaches for Parallelizing Reductions on Modern GPUs

Approaches for Parallelizing Reductions on Modern GPUs

Presentation Transcript

Realism II: Modern Approaches

Modern Approaches

Modern Approaches

Brook for GPUs

Software Engineering Modern Approaches

List Ranking on GPUs

Parallel Computing on Manycore GPUs

Parallelizing Programs

Parallelizing Security Checks on Commodity Hardware

Exploiting Parallelism on GPUs

Linear Algebra on GPUs

Evaluating Graph Coloring on GPUs

Physical Simulation on GPUs

Parallelizing Computations

Performance Optimizations for running NIM on GPUs

Brook for GPUs

Realism II: Modern Approaches

Physical Simulation on GPUs

Performance Optimizations for running NIM on GPUs