220 likes | 257 Views
Parallel Algorithms. Research Computing UNC - Chapel Hill Instructor: Mark Reed Email : markreed@unc.edu. Overview. Parallel Algorithms Parallel Random Numbers Application Scaling MPI Bandwidth. Domain Decompositon. Partition data across processors Most widely used “Owner” computes.
E N D
Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu
Overview • Parallel Algorithms • Parallel Random Numbers • Application Scaling • MPI Bandwidth
Domain Decompositon • Partition data across processors • Most widely used • “Owner” computes credit: George Karypis – Principles of Parallel Algorithm Design
Dense Matrix Multiply • Data sharing for MM with different partitioning • Shaded region of input matrices (A,B) are required by process that computes the shaded portion of output matrix C. credit: George Karypis – Principles of Parallel Algorithm Design
Parallel Sum • Sum for Nprocs=8 • Complete after log(Nprocs) steps credit: Designing and Building Parallel Programs – Ian Foster
Master/Workers Model • Often embarrassingly parallel • Master: • decomposes the problem into small tasks • distributes to workers • gathers partial results to produce the final result • Workers: • work • pass results back to master • request more work (optional) • Mapping/Load Balancing • Static • Dynamic Master worker worker worker worker
Master/Workers Load Balance • Iterations may have different and unpredictable run times • Systematic variance • Algorithmic variance • Goal is to balance load balance and overhead Some Schemes • Block decomposition, static chunking • Round Robin decomposition • Self scheduling • assign one iteration at a time • Guided dynamic self-scheduling • Assign 1/P of the remaining iterations (P = # procs)
Functional Parallelism • map tasks onto sets of processors • further decompose each function over data domain credit: Designing and Building Parallel Programs – Ian Foster
Recursive Bisection • Orthogonal Recursive Bisection (ORB) • good for decomposing irregular grids with mostly local communication • partition the domain by subdividing it into equal parts of work by successively subdividing along orthogonal coordinate directions. • cutting direction varied at each level of the recursion. ORB partitioning is restricted to p=2k processors.
ORB Example – Groundwater modeling at UNC-Ch Two-dimensional examples of the non-uniform domain decompositions on 16 processors: (left) rectilinear partitioning; and (right) orthogonal recursive bisection (ORB) decomposition. Geometry of the homogeneous sphere-packed medium (a) 3D isosurface view; and (b) 2D cross section view. Blue and white areas stand for solid and fluid spaces, respectively. “A high-performance lattice Boltzmann implementation to model flow in porous media” by Chongxun Pan, Jan F. Prins, and Cass T. Miller
Parallel Random Numbers • Example: Parallel Monte Carlo • Additional Requirements: • usable for arbitrary (large) number of processors • psuedo-random across processors – streams uncorrelated • generated independently for efficiency • Rule of thumb • max usable sample size is at most the square root of the period
Parallel Random Numbers • Scalable Parallel Random Number Generators Library (SPRNG) • free and source available • collects 5 RNG’s together in one package • http://sprng.cs.fsu.edu
QCD Application • MILC • (MIMD Lattice Computation) • quarks and gluons formulated on a space-time lattice • mostly asynchronous PTP communication • MPI_Send_init, MPI_Start, MPI_Startall • MPI_Recv_init, MPI_Wait, MPI_Waitall
UNC Capability Computing - Topsail • Compute nodes: 520 dual socket, quad core Intel “Clovertown” processors. • 4M L2 cache per socket • 2.66 GHz processors • 4160 processors • 12 GB memory/node • Shared Disk : 39TB IBRIX Parallel File System • Interconnect: Infiniband • 64 bit OS cluster photos: Scott Sawyer, Dell
MPI PTP on baobab • Need large messages to achieve high rates • Latency cost dominates small messages • MPI_Send crossover from buffered to synchronous • These are instructional only • not a benchmark
MPI PTP on Topsail • Infiniband (IB) interconnnect • Note higher bandwidth • lower latency • Two modes of standard send
Community Atmosphere Model (CAM) • global atmosphere model for weather and climate research communities (from NCAR) • atmospheric component of Community Climate System Model (CCSM) • hybrid MPI/OpenMP • run here with MPI only • running Eulerian dynamical core with spectral truncation of 31 or 42 • T31: 48x96x26 (lat x lon x nlev) • T42: 64x128x26 • spectral dynamical cores domain decomposed over just latitude
CAM Performance T31 T42