Parallel Scaling of Parsec Circuit: Examining Efficiency in Multi-Node Processing

Parallel Scaling of parsparsecircuit3.c Tim Warburton

1 process per node • In these tests we only use one out of two processors per node.

blackbear: 16 processors, 16 nodes

blackbear: 16 processors, 16 nodes Apart from the mpi_allreduce calls, this is an almost perfect picture of parallelism

2 Processes Per Node • We use both processors on each node

blackbear 8 nodes, 16 processes Notice, the prevelance of waitany. Clearly this code is not working as well as itdoes when running with 1 process per node.

blackbear 8 nodes, 16 processes(zoom in) I suspect that the threaded mpi communicators for the unblockedisend and irecv are competing for cpu time with the user code. Also – there could be competition for the memory bus and the network busbetween the processors.

Timings for M=1024 (N=1024^2)(blackbear –O3)

Timings for Two Processes Per Nodes on Los Lobos Timings courtesy of Zhaoxian Zhou

Parallel Scaling of Parsec Circuit: Examining Efficiency in Multi-Node Processing

Parallel Scaling of Parsec Circuit: Examining Efficiency in Multi-Node Processing

Presentation Transcript

Scaling

Scaling

Types of Scaling

Unified Parallel C

Hybrid Parallel Programming with MPI and Unified Parallel C

.NET Parallel Programming in C#

Parallel Multidimensional Scaling Performance on Multicore Systems

VSIPL++: Parallel VSIPL Using C++

Parallel Application Scaling, Performance, and Efficiency

Scaling Up Word Sense Disambiguation via Parallel Texts

Unified Parallel C

Non C omparative Scaling Techniques

Unified Parallel C (UPC)

Unified Parallel C at NERSC

Unified Parallel C

Parallel Application Scaling, Performance, and Efficiency

Unified Parallel C

Scaling Parallel Applications

Unified Parallel C at NERSC