1 / 21

Parallel Algorithms

Parallel Algorithms. Research Computing UNC - Chapel Hill Instructor: Mark Reed Email : markreed@unc.edu. Overview. Parallel Algorithms Parallel Random Numbers Application Scaling MPI Bandwidth. Domain Decompositon. Partition data across processors Most widely used “Owner” computes.

Download Presentation

Parallel Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Algorithms Research Computing UNC - Chapel Hill Instructor: Mark Reed Email: markreed@unc.edu

  2. Overview • Parallel Algorithms • Parallel Random Numbers • Application Scaling • MPI Bandwidth

  3. Domain Decompositon • Partition data across processors • Most widely used • “Owner” computes credit: George Karypis – Principles of Parallel Algorithm Design

  4. Dense Matrix Multiply • Data sharing for MM with different partitioning • Shaded region of input matrices (A,B) are required by process that computes the shaded portion of output matrix C. credit: George Karypis – Principles of Parallel Algorithm Design

  5. Dense Matrix Multiply

  6. Parallel Sum • Sum for Nprocs=8 • Complete after log(Nprocs) steps credit: Designing and Building Parallel Programs – Ian Foster

  7. Master/Workers Model • Often embarrassingly parallel • Master: • decomposes the problem into small tasks • distributes to workers • gathers partial results to produce the final result • Workers: • work • pass results back to master • request more work (optional) • Mapping/Load Balancing • Static • Dynamic Master worker worker worker worker

  8. Master/Workers Load Balance • Iterations may have different and unpredictable run times • Systematic variance • Algorithmic variance • Goal is to balance load balance and overhead Some Schemes • Block decomposition, static chunking • Round Robin decomposition • Self scheduling • assign one iteration at a time • Guided dynamic self-scheduling • Assign 1/P of the remaining iterations (P = # procs)

  9. Functional Parallelism • map tasks onto sets of processors • further decompose each function over data domain credit: Designing and Building Parallel Programs – Ian Foster

  10. Recursive Bisection • Orthogonal Recursive Bisection (ORB) • good for decomposing irregular grids with mostly local communication • partition the domain by subdividing it into equal parts of work by successively subdividing along orthogonal coordinate directions. • cutting direction varied at each level of the recursion. ORB partitioning is restricted to p=2k processors.

  11. ORB Example – Groundwater modeling at UNC-Ch Two-dimensional examples of the non-uniform domain decompositions on 16 processors: (left) rectilinear partitioning; and (right) orthogonal recursive bisection (ORB) decomposition. Geometry of the homogeneous sphere-packed medium (a) 3D isosurface view; and (b) 2D cross section view. Blue and white areas stand for solid and fluid spaces, respectively. “A high-performance lattice Boltzmann implementation to model flow in porous media” by Chongxun Pan, Jan F. Prins, and Cass T. Miller

  12. Parallel Random Numbers • Example: Parallel Monte Carlo • Additional Requirements: • usable for arbitrary (large) number of processors • psuedo-random across processors – streams uncorrelated • generated independently for efficiency • Rule of thumb • max usable sample size is at most the square root of the period

  13. Parallel Random Numbers • Scalable Parallel Random Number Generators Library (SPRNG) • free and source available • collects 5 RNG’s together in one package • http://sprng.cs.fsu.edu

  14. QCD Application • MILC • (MIMD Lattice Computation) • quarks and gluons formulated on a space-time lattice • mostly asynchronous PTP communication • MPI_Send_init, MPI_Start, MPI_Startall • MPI_Recv_init, MPI_Wait, MPI_Waitall

  15. MILC – Strong Scaling

  16. MILC – Strong Scaling

  17. UNC Capability Computing - Topsail • Compute nodes: 520 dual socket, quad core Intel “Clovertown” processors. • 4M L2 cache per socket • 2.66 GHz processors • 4160 processors • 12 GB memory/node • Shared Disk : 39TB IBRIX Parallel File System • Interconnect: Infiniband • 64 bit OS cluster photos: Scott Sawyer, Dell

  18. MPI PTP on baobab • Need large messages to achieve high rates • Latency cost dominates small messages • MPI_Send crossover from buffered to synchronous • These are instructional only • not a benchmark

  19. MPI PTP on Topsail • Infiniband (IB) interconnnect • Note higher bandwidth • lower latency • Two modes of standard send

  20. Community Atmosphere Model (CAM) • global atmosphere model for weather and climate research communities (from NCAR) • atmospheric component of Community Climate System Model (CCSM) • hybrid MPI/OpenMP • run here with MPI only • running Eulerian dynamical core with spectral truncation of 31 or 42 • T31: 48x96x26 (lat x lon x nlev) • T42: 64x128x26 • spectral dynamical cores domain decomposed over just latitude

  21. CAM Performance T31 T42

More Related