220 likes | 257 Views
Explore parallel algorithms like dense matrix multiplication, parallel sum, and master/worker models for designing efficient parallel programs. Learn about load balancing, functional parallelism, and recursive bisection techniques for application scaling. Discover parallel random number generation and case studies on QCD applications at UNC.
E N D
Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu
Overview • Parallel Algorithms • Parallel Random Numbers • Application Scaling • MPI Bandwidth
Domain Decompositon • Partition data across processors • Most widely used • “Owner” computes credit: George Karypis – Principles of Parallel Algorithm Design
Dense Matrix Multiply • Data sharing for MM with different partitioning • Shaded region of input matrices (A,B) are required by process that computes the shaded portion of output matrix C. credit: George Karypis – Principles of Parallel Algorithm Design
Parallel Sum • Sum for Nprocs=8 • Complete after log(Nprocs) steps credit: Designing and Building Parallel Programs – Ian Foster
Master/Workers Model • Often embarrassingly parallel • Master: • decomposes the problem into small tasks • distributes to workers • gathers partial results to produce the final result • Workers: • work • pass results back to master • request more work (optional) • Mapping/Load Balancing • Static • Dynamic Master worker worker worker worker
Master/Workers Load Balance • Iterations may have different and unpredictable run times • Systematic variance • Algorithmic variance • Goal is to balance load balance and overhead Some Schemes • Block decomposition, static chunking • Round Robin decomposition • Self scheduling • assign one iteration at a time • Guided dynamic self-scheduling • Assign 1/P of the remaining iterations (P = # procs)
Functional Parallelism • map tasks onto sets of processors • further decompose each function over data domain credit: Designing and Building Parallel Programs – Ian Foster
Recursive Bisection • Orthogonal Recursive Bisection (ORB) • good for decomposing irregular grids with mostly local communication • partition the domain by subdividing it into equal parts of work by successively subdividing along orthogonal coordinate directions. • cutting direction varied at each level of the recursion. ORB partitioning is restricted to p=2k processors.
ORB Example – Groundwater modeling at UNC-Ch Two-dimensional examples of the non-uniform domain decompositions on 16 processors: (left) rectilinear partitioning; and (right) orthogonal recursive bisection (ORB) decomposition. Geometry of the homogeneous sphere-packed medium (a) 3D isosurface view; and (b) 2D cross section view. Blue and white areas stand for solid and fluid spaces, respectively. “A high-performance lattice Boltzmann implementation to model flow in porous media” by Chongxun Pan, Jan F. Prins, and Cass T. Miller
Parallel Random Numbers • Example: Parallel Monte Carlo • Additional Requirements: • usable for arbitrary (large) number of processors • psuedo-random across processors – streams uncorrelated • generated independently for efficiency • Rule of thumb • max usable sample size is at most the square root of the period
Parallel Random Numbers • Scalable Parallel Random Number Generators Library (SPRNG) • free and source available • collects 5 RNG’s together in one package • http://sprng.cs.fsu.edu
QCD Application • MILC • (MIMD Lattice Computation) • quarks and gluons formulated on a space-time lattice • mostly asynchronous PTP communication • MPI_Send_init, MPI_Start, MPI_Startall • MPI_Recv_init, MPI_Wait, MPI_Waitall
UNC Capability Computing - Topsail • Compute nodes: 520 dual socket, quad core Intel “Clovertown” processors. • 4M L2 cache per socket • 2.66 GHz processors • 4160 processors • 12 GB memory/node • Shared Disk : 39TB IBRIX Parallel File System • Interconnect: Infiniband • 64 bit OS cluster photos: Scott Sawyer, Dell
MPI PTP on baobab • Need large messages to achieve high rates • Latency cost dominates small messages • MPI_Send crossover from buffered to synchronous • These are instructional only • not a benchmark
MPI PTP on Topsail • Infiniband (IB) interconnnect • Note higher bandwidth • lower latency • Two modes of standard send
Community Atmosphere Model (CAM) • global atmosphere model for weather and climate research communities (from NCAR) • atmospheric component of Community Climate System Model (CCSM) • hybrid MPI/OpenMP • run here with MPI only • running Eulerian dynamical core with spectral truncation of 31 or 42 • T31: 48x96x26 (lat x lon x nlev) • T42: 64x128x26 • spectral dynamical cores domain decomposed over just latitude
CAM Performance T31 T42