340 likes | 498 Views
OPAL : Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters. David A. Bader Electrical & Computer Engineering Department Albuquerque High Performance Computing Center University of New Mexico dbader@eece.unm.edu http://hpc.eece.unm.edu/.
E N D
OPAL: Open Source Parallel Algorithm LibraryDesigning High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department Albuquerque High Performance Computing Center University of New Mexico dbader@eece.unm.edu http://hpc.eece.unm.edu/
High-Performance Applications using SMP Clusters • Long-term Earth science studies using terascale remotely-sensed global satellite imagery (4 km AVHRR GAC) • Computational Ecological Studies: Self-Organization of Semi-Arid Landscapes: Test of Optimality Principles • Computational Bioinformatics: Large Scale Phylogeny Reconstruction High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Research Collaborators • Joseph JáJá, University of Maryland • Bernard Moret, CS (Experimental Algorithmics), University of New Mexico • Bruce Milne, Biology (Landscape Ecology), University of New Mexico • Tandy Warnow, CS, University of Texas-Austin • IBM ACTC Group (David Klepacki, John Levesque, and others) • Current Graduate Students: • Mi Yan, Niranjan Prabhu, Vinila Yarlagadda • Laboratory Alumni: • Kavita Balakavi (Intel), Ajith Illendula (Intel) High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Acknowledgment of Support • NSF CISE Postdoctoral Research Associate in Experimental Computer Science No. 96-25668 • NSF BIO Division of Environmental Biology DEB 99-10123 • Department of Energy Sandia-University New Assistant Professorship Program (SUNAPP) Award AX-3006 • IBM SUR Grant (UNM Vista-Azul Project ) • NPACI/SDSC and NCSA/Alliance • NSF 00-* Algorithms for Irregular Discrete Computations on SMPs High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Outline • Motivation • SMP Cluster Programming (SIMPLE) • Complexity model • Message-Passing • Shared-Memory • OPAL Facets (parallel libraries) • OPAL Setting (programming framework) • Example SMP Algorithms High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Motivation • High performance computing has been leveraging COTS workstation technologies • Commodity microprocessors • High-performance networks • Operating system and compiler technology • Symmetric multiprocessor (SMP) • Hardware support for hierarchical memory management • Multithreaded operating system kernels • Optimizing compilers and runtime systems High Performance Algorithms for SMP Clusters, Prof. David A. Bader
LLNL ASCI White IBM SP (512x16) UNM/Alliance LosLobos IBM Netfinity(256x2) UNM/Alliance Roadrunner Linux SuperCluster (64x2) SMP Cluster Architectures • IBM SP (NPACI Blue Horizon 144x8) • Linux Clusters • Compaq AlphaServers (PSC/NSF Terascale 682x4) • Sun Ultra HPC (4x64) High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Message-Passing Performance High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Shared-Memory Performance • One Sun HPC E10K processor • Contiguous array; each element read exactly once • C, X = cyclic read (stride X) of contiguous array • R = random access of array High Performance Algorithms for SMP Clusters, Prof. David A. Bader
High Performance Algorithms for SMP Clusters • “SIMPLE” Model • Use a hybrid, natural combination of message-passing and shared-memory • Message passing interface between nodes • Shared-memory programming (OpenMP, POSIX Threads) on each SMP node • Methodology for adapting message-passing algorithms for SMP Clusters • Freely-available open source implementation of parallel algorithms, libraries, and programming environment, for C/C++/Fortran with GNU Public License (GPL) High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Optimizing from MPI to SIMPLE (Regular or Irregular Algorithms) • Similar Single-Program Multiple-Data (SPMD) paradigm • Replace multiple MPI tasks per node with a single task and multiple shared-memory threads • Parallelize sequential work into equivalent shared-memory algorithms • Replace MPI communication primitives with corresponding “SIMPLE” primitives High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Portability: Access from User Space High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Parallel Complexity Models High Performance Algorithms for SMP Clusters, Prof. David A. Bader
SIMPLE Complexity ModelMessage Passing Primitives High Performance Algorithms for SMP Clusters, Prof. David A. Bader
PRAM (theory) O(n) processors Global clock Synchronous shared-memory Unit cost for computation or memory access Ideal Read/Write models (EREW, CREW, CRCW) SMP (practice) “P” processors (2 to 64) Asynchronous lock-step operation Uniform memory access to main memory (< 600 ns), faster access to local cache (10-40 ns) Cache-coherency at external caches Contention for shared memory Comparison of PRAM to SMP High Performance Algorithms for SMP Clusters, Prof. David A. Bader
OPAL Complexity Model • SMP Complexity model motivated by Helman and JáJá, Ramachandran • Complexity given by the triplet (MA, ME, TC) • MA is the number of memory accesses, • ME is the maximum volume of data exchanged between any processor and memory, • TC is the computational complexity. High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Common Primitives Read/Write Replicate Barrier Scan Reduce Broadcast Allreduce Techniques Pointer-jumping Balanced Trees (Prefix-Sums) Symmetric Breaking (3-Coloring) Parallel Prefix (List Ranking) Graph Algorithms Spanning Tree Euler Tour Tree Functions Ear Decomposition Combinatorics Sorting Selection Bioinformatics (Minimum Evolution) Phylogeny Trees Computational Genomics: Breakpoints, Inversions, Translocations OPAL Facets High Performance Algorithms for SMP Clusters, Prof. David A. Bader
SMP Complexity ModelSMP Node Primitives • Read/Write • Replicate • Barrier • Scan • Reduce • Broadcast • Allreduce • Etc. • SMP Complexity model motivated by Helman and JáJá • Complexity given by the triplet (MA, ME, TC) • MA is the number of memory accesses, • ME is the maximum volume of data exchanged between any processor and memory, • TC is the computational complexity. High Performance Algorithms for SMP Clusters, Prof. David A. Bader
OPAL Setting:Programming Environment High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Local Context Parameters for Each Thread High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Control Primitives High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Memory Management Primitives High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Example Application: Radixsort • Stable sort of n integers spread evenly across a cluster of p shared-memory r-way nodes • Decompose b-bit keys into -bit digits • Perform b/ passes of counting sort on digits (LSD MSD) • Counting Sort • Compute histogram of local keys • Communicate: Alltoall primitive of histograms • Locally compute prefix-sums of histograms • Communicate: (Inverse) Alltoall of prefix-sums • Rank each local element • Perform a personalized communication (1-relation) rearranging elements into sorted order High Performance Algorithms for SMP Clusters, Prof. David A. Bader
High Performance Algorithms for SMP Clusters, Prof. David A. Bader
High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Execution Time of Radix Sort on an SMP Cluster High Performance Algorithms for SMP Clusters, Prof. David A. Bader
SMP Example: Ear Decomposition • Ear decomposition • Partitions the edges of a graph, useful in parallel processing • “Like peeling the layers of an onion” • Applied to scientific computing problems • Computational mechanics (structural rigidity) • Computational biology (molecular structure, atoms in DNA chains) • Computational fluid dynamics • Similar to other parallel algorithms for combinatorial problems • Trivial and fast sequential algorithm • Efficient PRAM algorithm • But no known practical, parallel algorithm High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Ear Decomposition Example Input Output Ears n = number of vertices m = number of edges Spanning Tree High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Message Passing: Spanning Tree Ear Decomposition Shared Memory: Spanning Tree Ear Decomposition Sequential Complexity: Ear Decomposition Complexities High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Comparison of Ear Decomposition Algorithms High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Performance of SMP Ear Decomposition on a Variety of Input Graphs n = 8192 High Performance Algorithms for SMP Clusters, Prof. David A. Bader
SMP Ear Decomposition Algorithms High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Conclusions • New hybrid model for SMP Clusters • Open Source Parallel Algorithm Library (OPAL) • High-Performance methodology • Fastest known algorithms on SMPs and SMP clusters • Preliminary experimental results High Performance Algorithms for SMP Clusters, Prof. David A. Bader
Future Work • Algorithms for SMP Clusters • Validate complexity model • Identify classes of efficient algorithms • Library of SMP algorithms • Methodology for algorithm-engineering • Clusters of Heterogeneous SMP Nodes • Varying node sizes • Nodes from different vendors & architectures • Hierarchical clusters of SMPs • Scientific Applications • Bioinformatics and Genomics • Landscape Ecology and Remote Sensing • Computational Fluid Dynamics High Performance Algorithms for SMP Clusters, Prof. David A. Bader