AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author：Y. Zhao、 C. Hu、S. Wang、 S. Zhang Source ：Proceedings of the 2nd IASTED international conference on Advances in computer science and technology Speaker ： Cheng-Jung Wu

Outline • Introduction • Extensions in EOMP • Computing Resource Definition • Hierarchical Data Layout and Data Mapping • Execution Model for EOMP • Experiments and Results • Dot Product • Matrix multiplication under EOMP execution model on SMP cluster • Conclusion

Introduction • Clusters of shared-memory multiprocessors (SMPs) • More and more popular in High Performance Computing area • SMP clusters’ hybrid architectures • Supports for a wide range of parallel paradigm • Three programming paradigms • Standard message passing • Hybrid paradigm corresponding to the underlying architecture • Shared memory paradigm built on a Software Distribute Software Memory (SDSM) • Three major metrics • Performance、Portability 、Programmability

Introduction • None of the three parallel programming paradigms can meet on all of the three metrics • New parallel paradigm (EOMP) • A compromising model • Balance the three major metrics • Features • Good programmability • Acceptable performance • Improve memory behavior • Data locality • The programs running on SMP cluster • Inter-node and intra-node data locality

Extensions in EOMP • Since OpenMP • Shared-memory systems • Lacks the support for distributed memory system • New directives • Computing resource definition • Data mapping

Computing Resource Definition • Definitions • Virtual node (VN) • Virtual processor (VP) • VNs • Physical nodes • Target units of inter-node data distribution • VPs • Physical processors • Target units of intra-node data reallocation and task scheduling during compilation

Computing Resource Definition • Semantics of computing resource definition directives • Examples for processor mapping are given

Hierarchical Data Layout and Data Mapping：Inter-node Data Mapping • Scalar data defined in the EOMP • Shared data at default • Every node gets an own copy of the data • Inter-node task parallel • Allows the shared scalar data be modified in certain nodes • Global addresses of distributed arrays • Llocal addresses • Inter-node data mapping distributes the mapped arrays to VNs • Semantics for inter-node data distribution directive: • #pragma eomp distribute a (BLOCK*) onto N

Hierarchical Data Layout and Data Mapping：Intra-node Data Mapping • Shared memory data layout takes the advantage of global address • Technically • No further data mapping is required inside the nodes • In certain cases • Improper order of data access would decrease cache performance • false sharing or long-stride access • For instance：two threads always access the nearby array elements in memory at same time • cache performance may be very poor due to severe false sharing • Optimizations for intra-node data layout will be necessary

Hierarchical Data Layout and Data Mapping：Intra-node Data Mapping • An extreme example • Experiment on that circumstance shows an overall 90% reduction of L1 cache miss after the intra-node data reallocation optimization • (On 4-cpu IA64 SMP; the array a is of 1M size).

Hierarchical Data Layout and Data Mapping：Intra-node Data Mapping • Two strategies can be adopted to reduce cache miss • Rearrange the access order of each thread • Not always possible for compiler optimization • It depends closely on the source program structure • In the interleaving data case above, this means to avoid accessing the neighboring data in memory at the same time. • Reallocate the data layout in memory, • Not change data dependencies of the source program • Assures the correctness of this optimization • Store the data that accessed by the same thread in a contiguous memory block

Hierarchical Data Layout and Data Mapping：Intra-node Data Mapping • Intra-node data reallocation • programmer-specified directives • compiler reference analysis • Intra-node data reallocating data in memory • Additional time and space overheads • Evaluating the performance speedup of this optimization • The data locations have been changed • The reallocated data should be forbidden • Semantics for intra-node data reallocation directive • #pragma eomp distribute a (CYCLIC,*) intra

Inter-node barriers and broadcasts Modifications of shared variables in the parallel section at the edges of task parallel region Maintain data consistency Inter-node communications use explicit message passing Execution Model for EOMP

Massage passing & multithreading program generated By compiler first distributes data and schedule the tasks across nodes Then deals with the intra-node data reallocation and task scheduling Execution Model for EOMP

Experiments and Results：Dot Product

Experiments and Results：Dot Product • The experiment result shows that the efficiency of the EOMP based on the runtime library is similar to the MPI+OpenMP program (better under some cases) • But not good as pure MPI, because the amount of calculations in the dot product operation are not enough, comparing to the cost of inside-node scheduling

Matrix multiplication under EOMP execution model on SMP cluster • C=A*B • A and C is distributed in rows • B is distributed in columns

Matrix multiplication under EOMP execution model on SMP cluster • Matrix size is small • The cost of inter-node scheduling and communications are relatively high (compared with the computation cost) • The three distributed memory models can not acquire a speedup • Matrix size becomes larger • The three distributed memory models achieve reasonable speedups

Matrix multiplication under EOMP execution model on SMP cluster • Notice that the EOMP model after intra-node data reallocation • Gets a high speedup when the matrix size is large • Showing that the improved intra-node cache performance can greatly benefit the overall performance of the program on SMP clusters

Matrix multiplication under EOMP execution model on SMP cluster • Peaks of EOMP-INDR curves in 500*500 and 1000*1000 cases • The effect of data reallocation is related with both the size of cache line local b • As the nodes become more and more, the size of local b on each node becomes smaller • That means the cache line may fill in more rows of local b, thus the cache misses is reduced • Explain why the peak in 500*500 multiplication case comes earlier than that of the 1000*1000 case

Conclusion • The experiment result • Feasibility of our execution model • The benefit gained from intra-node data reallocation • For future work, we plan to develop a complete source to source EOMP compiler • Be based on ORC (Open Resource Compiler for IA64) • Our current runtime library prototype • Focusing on the communication generation and data management.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER

Presentation Transcript

Study of Weakly Bound Nuclei with an Extended Cluster-Orbital Shell Model

Hybrid Programming with OpenMP and MPI

An Architecture of Enterprise Architecture

Synchronizing the timestamps of concurrent events in traces of hybrid MPI/ OpenMP applications

Perspectives on Targeting

CPE779: More on OpenMP

Hybrid openmp / mpi

On pearls and perils of hybrid OpenMP/MPI programming on the Blue Horizon

Hybrid OpenMP and MPI Programming

Hybrid OpenMP and MPI Programming and Tuning

Cluster OpenMP Benchmark of 64-bit PC Cluster

NKStars Cluster Hardware Architecture

The SAL Integrated Hybrid Cognitive Architecture

Architecture Cluster

Dprocess on SMP

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters

Hybrid PC architecture