230 likes | 345 Views
Approaches for Parallelizing Reductions on Modern GPUs. Xin Huo, Vignesh T. Ravi, Wenjing Ma and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210. Outline. Motivation Challenges Contributions Generalized Reductions
E N D
Approaches for Parallelizing Reductions on Modern GPUs Xin Huo, Vignesh T. Ravi, Wenjing Ma and GaganAgrawal Department of Computer Science and Engineering The Ohio State University Columbus, OH 43210
Outline • Motivation • Challenges • Contributions • Generalized Reductions • Parallelization Approaches • Full Replication • Locking Scheme • Hybrid Scheme • Evaluation • Conclusions
Motivation • Deluge of scientific and Data-Intensive applications • Floating/ Double data types & use of shared memory, a commonplace • Different applications require different synchronization mechanisms • State-of-the-art mechanisms to avoid race conditions in CUDA applications (<= compute 1.3) • Replicate: A private copy for each thread • Fine-grained Atomic Operation on Device Memory (Only Integer) • Fine-grained Atomic Operation on Shared Memory (Only Integer) • Visible gap in application requirement & CUDA support • No floating-point operations • No robust coarse-grained locking • Disadvantages of existing mechanisms • Replication: Huge Memory & Combination overhead • Atomic Operations: Introduce High Conflicts
Challenges • Provide additional Mechanisms to avoid race conditions • Enable floating-point Atomic Operation (both Device & Shared Memory) • Enable Coarse-grained locking • Overheads of Replication • Increase in memory requirement with data size, threads, application parameter • Combination overhead increases with thread size • Mostly obviates the use of shared memory • Overheads of Locking • Heavy conflicts per word can occur with large thread numbers • How to improve the existing mechanisms? • Provide a mechanism that can balance the trade-offs between Replication & Locking
Contributions • Additional Locking Mechanisms • Wrapper, Floating-point Fine-grained Locking • A robust, dead-lock free Coarse-grained Locking • Explicit conditional branch • Explicit warp serialization • A novel Hybrid Scheme combining Replication & Locking • Balances the overheads from both replication and locking • Handle all these schemes transparently from user
Generalized Reduction Computations • Similar to MapReduce model • But only one stage, Reduction • Reduction Object, Robj, exposed to programmer • Large intermediate results avoided • Reduction operation, Reduc, is associative or commutative • Order of processing can be arbitrary • Target a particular class of applications in this work {* Outer sequential loop*} While(unfinished) { {*Reduction loop*} Foreach( element e) { (i, val) = compute(e) RObj(i) = Reduc(Robj(i), val) } }
Device Host Reduction Object Reduction Object Reduction Object Threads Threads Reduction Object Threads Reduction Object …….…… Reduction Object Reduction Object Reduction Object Threads Reduction Object Final Result Threads Reduction Object Threads Reduction Object Reduction Object …….…… Reduction Object Reduction Object Threads Threads Threads …….…… Parallelization Schemes - Full Replication Block Block Block Reduction Object Reduction Object Reduction Object Reduction Object
Device Host All threads in one block share the same copy Reduction Object Reduction Object Threads Block Threads Threads …….…… Reduction Object Reduction Object Final Result Threads Block Threads Threads …….…… Reduction Object Reduction Object Threads Block Threads Threads …….…… Parallelization Schemes – Locking Schemes • Fine-grained locking • Coarse-grained locking
Fine-grained Locking Atomic executing“*address+val” • Lock for updating a particular word (Atomic Operation) • Support floating & double point computation • Implemented by wrapping atomicCAS provided by CUDA Input: Address,old_value, new_value, val AtomicCAS(Address, old_value, new_value) *Address = old_value? Yes No *address= new_value Old_value=*address New_value=*address + value Stop
Coarse-grained Locking Locking=0 free Locking=1 busy • Lock for a group of operations (Critical section) Input: locking AtomicCAS(locking, 0, 1) Yes No locking = 0? Locking=1 Critical section Spin Lockcauses Thread Divergence in Warp Free lock Locking=0 Deadlock prone Stop
Deadlock free solutions • Explicit Warp Serialization • Explicit Conditional Branch Do = true Tthread: 0 -32 While (Do) Get lock() Get lock() Yes No Success Critical section Critical section Release lock() Release Lock() threadID++ Do=false
Hybrid Scheme • Hybrid Scheme • Balance between Full Replication & Locking Scheme • Insert a middle layer “group” in thread organization • Intra group: Locking Scheme • Inter group: Full Replication • Benefits of Hybrid Scheme varies with group size • Advantages with appropriate group size • Reduced Memory overhead • Reduced combination • Better use of shared Memory • Reduced conflicts
Device Host The threads are organized as Groups and Blocks Each group has a private copy of reduction object. BLOCK Thread Reduction object Thread Group Intermediate results Thread Intermediate results Reduction object Group ……….. Group Reduction object ……….. Final results BLOCK Thread Reduction object Thread Group Intermediate results Intermediate results Thread Reduction object Group ……….. Group Reduction object ……….. Model of Hybrid Scheme
Experiment Setup • Setup • NVIDIA Tesla C1060 • 4GB device memory, 16KB shared memory • Computing capability 1.3 • AMD Opteron 2218 processors • Applications • K-Means Clustering • Principal Component Analysis (PCA) • K-Nearest Neighbor Search (KNN) • Evaluate the performance of three parallelization techniques: • Full Replication • Lock Scheme • Hybrid Scheme • Analyze the factors influencing the performance
K-Means: K=10, data size=2GB low memory and combination overhead, but high competition Number of groups
K-Means: K=10 • Hybrid outperform both Full Replication and Locking Scheme Comparison in the best configurations
KNN: K=10, data size=20MB • High competition with small reduction size (K=10) • Using explicit warp serialization in Locking, which gives better performance than explicit conditional branch. • 32 groups in Hybrid Scheme matching the number of threads in a warp. • No two threads in same warp go to the same group. So race condition only exist among the threads in different warp. Locking is much more sensitive to the number of threads due to high overhead of coarse-grained locking
KNN: K=10 • Full Replication: 9.6 times faster than locking scheme. • Hybrid: 62.3 times faster than Locking Scheme • Comparison the best configurations
KNN: CUDA Profiler Results • Hybrid achieves a balance between divergent and global store. • 1.6% braches of Locking • 0.7% global stores of Full Replication
KNN: CUDA Profiler Results • With the increasing of k, divergent branches in Locking doesn't change much, but number of global store in Full Replication increases dramatically
KNN: varying K • Locking will outperform Full Replication with increasing K,based on the observation of changing of divergent branches and global store
Conclusions • Performance is dependent on • Characteristics of applications • Thread configurations • Choice of scheme • Full Replication • Viable when reduction object space is small and combination cost is low • Locking Scheme • Memory location is large enough to keep contention overhead low • Hybrid Scheme • Obtain a balance between combination memory overhead and synchronization cost • Achieves the best performance for the benchmarks we considered
Thank you Questions? Contacts: Xin Huo huox@cse.ohio-state.edu Vignesh Ravi raviv@cse.ohio-state.edu Wenjing Ma mawe@cse.ohio-state.edu GaganAgrawalagrawal@cse.ohio-state.edu