210 likes | 308 Views
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations. Xin Huo , Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering The Ohio State University. Irregular Reduction - Context. A dwarf in Berkeley view on parallel computing
E N D
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations XinHuo, Vignesh T. Ravi, GaganAgrawal Department of Computer Science and Engineering The Ohio State University
Irregular Reduction - Context • A dwarf in Berkeley view on parallel computing • Unstructured grid pattern • More random and irregular accesses • Indirect memory references • Previous efforts in porting to different architectures • Distributed memory machines • Distributed shared memory machines • Shared memory machines • Cache performance improvement on uniprocessor • Many-core architecture - GPGPU (Our work in ICS 11) • No system study on heterogeneous architecture (CPU&GPU)
Why CPU + GPU ? – A Glimpse • Dominant positions in different areas, but connected tightly • GPU: Computation Intensive problems, large number of threads execute in SIMD • CPU: Data Intensive problems, branch processing, high precision and complicated computation • GPU is dependent on the scheduling and data from CPU • One of the most popular heterogeneous architectures • 3 out of 5 fastest supercomputers are based on CPU + GPU architecture (500 list on 11/2011) • Fusion in AMD, Sandy Bridge in Intel, and Denver in NVIDIA • Cloud compute instances in Amazon
Outline • Background • Irregular Reduction Structure • Partitioning-based Locking Scheme • Main Issues • Contributions • Multi-level Partitioning Framework • Runtime Support • Pipeline Scheme • Task Scheduling • Experiment Support • Conclusions
Irregular Reduction Structure • Robj: Reduction Object • e: Iteration of computation loop • IA(e,x): Indirection Array • IA: Iterators over e (Computation Space) • Robj: Accessed by Indirection Array (Reduction Space) /* Outer Sequence Loop */while( ) {/* Reduction Loop */Foreach(element e) {(IA(e,0),val1) = Process(IA(e,0));(IA(e,1),val2) = Process(IA(e,1)); Robj= Reduce(Robj(IA(e,0)),val1);Robj= Reduce(Robj(IA(e,1)),val2);}/*Global Reduction to Combine Robj*/}
Application Context Molecular Dynamics Indirection Array -> Edges (Interactions) Reduction Objects -> Molecules (Attributes) Computation Space -> Interactions b/w molecules Reduction Space -> Attributes of Molecules
Partitioning-based Locking Strategy • Reduction Space Partitioning • Efficient shared memory utilization • Eliminate intra and inter-block combination • Multi-Dimensional Partitioning Method • Balance between minimum cutting edges and partitioning time Huo et al., 25th International Conference on Supercomputing
Main Issues • Device memory limitation on GPU • Partitioning overhead • Partitioning cost increases with the increasing of data volume • GPU is in idle for waiting the results of partitioning • Low utilization of CPU • CPU only conducts partitioning • CPU is in idle when GPU doing computation
Contributions • A Novel Multi-level Partitioning Framework • Parallelize irregular reduction on heterogeneous architecture (CPU + GPU) • Eliminate device memory limitation on GPU • Runtime Support Scheme • Pipelining Scheme • Work stealing based scheduling strategy • Significant Performance Improvements • Exhaustive evaluations • Achieve 11% and 22% improvement for Euler and Molecular Dynamics
Computation Space Partitioning • Partitioning on the iterations of the computation loop • Pros • Load Balance on Computation • Cons • Unequal reduction size in each partition • Replicated reduction elements (4 out of 16 nodes) • Combination cost • Between CPU and GPU (First Level) • Between different thread blocks (Second Level) 5 Partition 1 1 3 2 2 4 4 8 6 Partition 2 15 12 12 7 7 9 13 16 11 10 14 Partition 3 Partition 4
Reduction Space Partitioning • Pros • Balance reduction space • Shared memory is feasible for GPU • Independent between each two partitions • No communication between CPU and GPU (First Level) • No communication between thread blocks (Second Level) • Avoid combination cost • Between different thread blocks • Between CPU and GPU • Cons • Imbalance on computation space • Replicated work caused by crossing edges • Partitioning on Reduction Elements 5 Partition 1 1 3 2 4 8 6 15 Partition 2 12 7 9 Partition 3 13 16 11 10 14 Partition 4
Task Scheduling Framework • Pipelining Scheme • K blocks assigned to GPU in one global loading • Pipelining between Partitioning and Computation in a global loading • Work Stealing Scheduling • Scheduling granularity (Large for GPU; Small for CPU) • Too large: Better pipelining effect, but worse load balance • Too small: Better load balance, but small pipelining length • Work Stealing can achieve both maximum pipelining length and good load balance
Experiment Setup • Platform • GPU • NVIDIA Tesla C2050 “Fermi” (14x32 = 448 cores) • 2.68GB device memory • 64KB configurable shared memory • CPU • Intel 2.27 GHz Quad Xeon E5520 • 48GB memory • x16 PCI Express 2.0 • Applications • Euler (Computational Fluid Dynamics) • MD (Molecular Dynamics)
Scalability – Molecular Dynamics • Scalability of Molecular Dynamics on Multi-core CPU and GPU across Different Datasets (MD)
Pipelining – Euler • Effect of Pipelining CPU Partitioning and GPU Computation (EU) Avoid partitioning overhead except the first partition Pipelining increasing Performance increasing
Heterogeneous Performance – EU and MD • Benefits From Dividing Computations Between CPU and GPU for EU and MD 11% 22%
Work Stealing - Euler • Comparison of Fine-grained, Coarse-grained, and Work Stealing Strategies Granularity = 1 Good load balance, Bad pipelining effect Granularity = 5 Good pipelining effect Bad load balance Good pipelining effect Good load balance
Conclusions • Multi-level Partitioning Framework to port irregular reduction on heterogeneous architectures • Pipelining Scheme can overlap partitioning on CPU and computation on GPU • Work Stealing Scheduling achieves the best pipelining effect and load balance
Thank you Questions ? Contacts: Xin Huo huox@cse.ohio-state.edu Vignesh T. Ravi raviv@cse.ohio-state.edu GaganAgrawalagrawal@cse.ohio-state.edu