220 likes | 319 Views
Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops. Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr. Motivation.
E N D
Load Balancing Hybrid Programming Models for SMP Clusters and Fully Permutable Loops Nikolaos Drosinos and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr
Motivation • fully permutable loops always a computational challenge for HPC • hybrid parallelization attractive for DSM architectures • currently, popular free message passing libraries provide limited multi-threading support • SPMD hybrid parallelization suffers from intrinsic load imbalance ICPP-HPSEC 2005
Contribution • two static thread load balancing schemes (constant-variable) for coarse-grain funneled hybrid parallelization of fully permutable loops • generic • simple to implement • experimental evaluation against micro-kernel benchmarks of different programming models • message passing • fine-grain hybrid • coarse-grain hybrid (unbalanced, balanced) ICPP-HPSEC 2005
Algorithmic model foracross tile1 do … foracross tileNdo for tilen-1do Receive(tile); Compute(A,tile); Send(tile); Restrictions: • fully permutable loops • unitary inter-process dependencies ICPP-HPSEC 2005
Message passing parallelization • tiling transformation • (overlapped?) computation and communication phases • pipelined execution • portable • scalable • highly optimized ICPP-HPSEC 2005
Hybrid parallelization So… why bother? ICPP-HPSEC 2005
Hybrid parallelization: why bother I shared memory programming model vs message passing programming model for shared memory architecture ICPP-HPSEC 2005
Hybrid parallelization: why bother II DSM architectures are popular! ICPP-HPSEC 2005
Fine-grain hybrid parallelization • incremental parallelization of loops • relatively easy to implement • popular • Amdahl’s law restricts parallel efficiency • overhead of thread structures re-initialization • restrictive programming model for many applications ICPP-HPSEC 2005
Coarse-grain hybrid parallelization • generic SPMD programming style • good parallelization efficiency • no thread re-initialization overhead • more difficult to implement • intrinsic load imbalance assuming common funneled thread support level ICPP-HPSEC 2005
fine-grain hybrid comp comp comm comm … comp coarse-grain hybrid comp comm comp … comp MPI thread support levels • single • masteronly • funneled • serialized • multiple ICPP-HPSEC 2005
Load balancing Idea Consequence master thread assumes a smaller fraction of the process tile computational load compared to other threads ICPP-HPSEC 2005
Load balancing (2) Assuming It follows T………total number of threads p………current process id ICPP-HPSEC 2005
Load balancing (3) ICPP-HPSEC 2005
Experimental Results • 8-node dual SMP Linux Cluster (800 MHz PIII, 256 MB RAM, kernel 2.4.26) • MPICH v.1.2.6 (--with-device=ch_p4, --with-comm=shared, P4_SOCKBUFSIZE=104KB) • Intel C++ compiler 8.1 (-O3 -static -mcpu=pentiumpro) • FastEthernet interconnection network ICPP-HPSEC 2005
Alternating Direction Implicit (ADI) • Stencil computation used for solving partial differential equations • Unitary data dependencies • 3D iteration space (X x Y x Z) ICPP-HPSEC 2005
ADI ICPP-HPSEC 2005
Synthetic benchmark ICPP-HPSEC 2005
Conclusions • fine-grain hybrid parallelization inefficient • unbalanced coarse-grain hybrid parallelization also inefficient • balancing improves hybrid model performance • variable balanced coarse-grain hybrid model most efficient approach overall • relative performance improvement increases for higher communication vs computation needs ICPP-HPSEC 2005
Thank You! Questions? ICPP-HPSEC 2005
ADI ICPP-HPSEC 2005
Synthetic benchmark ICPP-HPSEC 2005