1 / 57

Parallel Application Scaling, Performance, and Efficiency

This talk by David Skinner focuses on parallel scaling of MPI codes, discussing topics such as distribution of work, computation placement, performance costs of communication, and understanding scaling performance terminology.

vickersl
Download Presentation

Parallel Application Scaling, Performance, and Efficiency

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Application Scaling, Performance, and Efficiency David Skinner NERSC/LBL

  2. Parallel Scaling of MPI Codes A practical talk on using MPI with focus on: • Distribution of work within a parallel program • Placement of computation within a parallel computer • Performance costs of various types of communication • Understanding scaling performance terminology

  3. Topics • Introduction • Load Balance • Synchronization • Simple stuff • File I/O • Performance profiling

  4. 31 31 31 32 Let’s introduce these topics through a familiar example: Sharks and Fish II • Sharks and Fish II : N2 force summation in parallel • E.g. 4 CPUs evaluate force for a global collection of 125 fish • Domain decomposition: Each CPU is “in charge” of ~31 fish, but keeps a fairly recent copy of all the fishes positions (replicated data) • Is it not possible to uniformly decompose problems in general, especially in many dimensions • Luckily this problem has fine granularity and is 2D, let’s see how it scales

  5. Sharks and Fish II : Program Data: n_fish is global my_fish is local fishi = {x, y, …} Dynamics: MPI_Allgatherv(myfish_buf, len[rank], .. for (i = 0; i < my_fish; ++i) { for (j = 0; j < n_fish; ++j) { // i!=j ai += g * massj * ( fishi – fishj ) / rij } } Move fish

  6. Sharks and Fish II: How fast? Running on a machine ~seaborg.nersc.gov • 100 fish can move 1000 steps in 1 task  5.459s 32 tasks  2.756s • 1000 fish can move 1000 steps in 1 task  511.14s 32 tasks  20.815s • What’s the “best” way to run? • How many fish do we really have? • How large a computer do we have? • How much “computer time” i.e. allocation do we have? • How quickly, in real wall time, do we need the answer? x 1.98 speedup x 24.6 speedup

  7. Scaling: Good 1st Step: Do runtimes make sense? Running fish_sim for 100-1000 fish on 1-32 CPUs we see 1 Task … 32 Tasks

  8. Scaling: Walltimes Walltime is (all)important but let’s define some other scaling metrics

  9. Scaling: definitions • Scaling studies involve changing the degree of parallelism. Will we be change the problem also? • Strong scaling • Fixed problem size • Weak scaling • Problem size grows with additional resources • Speed up = Ts/Tp(n) • Efficiency = Ts/(n*Tp(n)) Be aware there are multiple definitions for these terms

  10. Scaling: Speedups

  11. Scaling: Efficiencies Remarkably smooth! Often algorithm and architecture make efficiency landscape quite complex

  12. Scaling: Analysis • Why does efficiency drop? • Serial code sections  Amdahl’s law • Surface to Volume  Communication bound • Algorithm complexity or switching • Communication protocol switching  Whoa!

  13. Scaling: Analysis • In general, changing problem size and concurrency expose or remove compute resources. Bottlenecks shift. • In general, first bottleneck wins. • Scaling brings additional resources too. • More CPUs (of course) • More cache(s) • More memory BW in some cases

  14. Scaling: Superlinear Speedup # CPUs (OMP)

  15. Scaling: Communication Bound 64 tasks , 52% comm 192 tasks , 66% comm 768 tasks , 79% comm • MPI_Allreduce buffer size is 32 bytes. • Q: What resource is being depleted here? • A: Small message latency • Compute per task is decreasing • Synchronization rate is increasing • Surface : Volume ratio is increasing

  16. Topics • Introduction • Load Balance • Synchronization • Simple stuff • File I/O

  17. Load Balance : cartoon Unbalanced: Universal App Balanced: Time saved by load balance

  18. Load Balance : performance data Communication Time: 64 tasks show 200s, 960 tasks show 230s MPI ranks sorted by total communication time

  19. 960 x 64 x Load Balance: ~code while(1) { do_flops(Ni); MPI_Alltoall(); MPI_Allreduce(); }

  20. Flops Exchange Sync Load Balance: real code MPI Rank  Time 

  21. Load Balance : analysis • The 64 slow tasks (with more compute work) cause 30 seconds more “communication” in 960 tasks • This leads to 28800 CPU*seconds (8 CPU*hours) of unproductive computing • All imbalance requires is one slow task and a synchronizing collective! • Pair well problem size and concurrency. • Parallel computers allow you to waste time faster!

  22. Load Balance : FFT Q: When is imbalance good? A: When is leads to a faster Algorithm.

  23. Flops Exchange Sync MPI Rank  Time  Dynamical Load Balance: Motivation

  24. Load Balance: Summary • Imbalance most often a byproduct of data decomposition • Must be addressed before further MPI tuning can happen • Good software exists for graph partitioning / remeshing • Dynamical load balance may be required for adaptive codes • For regular grids consider padding or contracting

  25. Topics • Introduction • Load Balance • Synchronization • Simple stuff • File I/O • Performance profiling

  26. Scaling of MPI_Barrier() four orders of magnitude

  27. Synchronization: definition MPI_Barrier(MPI_COMM_WORLD); T1 = MPI_Wtime(); e.g. MPI_Allreduce(); T2 = MPI_Wtime()-T1; How synchronizing is MPI_Allreduce? • For a code running on N tasks what is the distribution of the T2’s? • The average and width of this distribution tell us how • synchronizing e.g. MPI_Allreduce is • Completions semantics of MPI functions • Local : leave based on local logic (MPI_Comm_rank) • Partially synchronizing : leave after messaging M<N tasks (MPI_Bcast, MPI_Reduce) • Fully synchronizing : leave after every else enters (MPI_Barrier, MPI_Allreduce)

  28. seaborg.nersc.gov • It’s very hard to discuss synchronization outside of the context a particular parallel computer • So we will examine parallel application scaling on an IBM SP which is largely applicable to other clusters

  29. 16 way SMP NHII Node G P F S Main Memory GPFS seaborg.nersc.gov basics IBM SP 380 x Colony Switch CSS0 CSS1 • 6080 dedicated CPUs, 96 shared login CPUs • Hierarchy of caching, speeds not balanced • Bottleneck determined by first depleted resource HPSS

  30. 16 way SMP NHII Node G P F S Main Memory GPFS MPI on the IBM SP • 2-4096 way concurrency • MPI-1 and ~MPI-2 • GPFS aware MPI-IO • Thread safety • Ranks on same node • bypass the switch Colony Switch CSS0 CSS1 HPSS

  31. 16 way SMP NHII Node 16 way SMP NHII Node Main Memory Main Memory GPFS GPFS Seaborg : point to point messaging Switch bandwidth is often stated in optimistic terms inter-connect Intranode Internode

  32. 16 way SMP NHII Node Main Memory GPFS MPI: seaborg.nersc.gov css1 css0 • Lower latency  can satisfy more syncs/sec • What is the benefit of two adapters? • Can a single csss

  33. Inter-Node Bandwidth • Tune message size to optimize throughput • Aggregate messages when possible csss css0

  34. MPI Performance is often Hierarchical message size and task placement are key Intra Inter

  35. MPI: Latency not always 1 or 2 numbers The set of all possibly latencies describes the interconnect from the application perspective

  36. Synchronization: measurement MPI_Barrier(MPI_COMM_WORLD); T1 = MPI_Wtime(); e.g. MPI_Allreduce(); T2 = MPI_Wtime()-T1; How synchronizing is MPI_Allreduce? For a code running on N tasks what is the distribution of the T2’s? Let’s measure this…

  37. Synchronization: MPI Collectives Beyond load balance there is a distribution on MPI timings intrinsic to the MPI Call 2048 tasks

  38. Synchronization: Architecture …and from the machine itself t is the frequency kernel process scheduling Unix : cron et al.

  39. Intrinsic Synchronization : Alltoall

  40. Intrinsic Synchronization: Alltoall Architecture makes a big difference!

  41. This leads to variability in Execution Time

  42. Synchronization : Summary • As a programmer you can control • Which MPI calls you use (it’s not required to use them all). • Message sizes, Problem size (maybe) • The temporal granularity of synchronization • Language Writers and System Architects control • How hard is it to do last two above • The intrinsic amount of noise in the machine

  43. Topics • Introduction • Load Balance • Synchronization • Simple stuff • File I/O • Performance profiling

  44. Simple Stuff Parallel programs are easier to mess up than serial ones. Here are some common pitfalls.

  45. What’s wrong here?

  46. MPI_Barrier • Is MPI_Barrier time bad? Probably. Is it avoidable? • ~three cases: • The stray / unknown / debug barrier • The barrier which is masking compute balance • Barriers used for I/O ordering Often very easy to fix

  47. Topics • Introduction • Load Balance • Synchronization • Simple stuff • File I/O • Performance profiling

  48. Parallel File I/O : Strategies MPI Disk Some strategies fall down at scale

  49. Parallel File I/O: Metadata • A parallel file system is great, but it is also another place to create contention. • Avoid uneeded disk I/O, know your file system • Often avoid file per task I/O strategies when running at scale

  50. Topics • Introduction • Load Balance • Synchronization • Simple stuff • File I/O • Performance profiling

More Related