550 likes | 561 Views
Learn how to optimize MPI codes on the IBM SP Seaborg for faster scientific computations and better load balancing. Understand MPI performance tuning and upcoming tools.
E N D
Writing, Running and Tuning MPI Codes on the IBM SPDavid Skinner, NERSC, Berkeley Lab
Outline • Seaborg Basics • Parallel Environment • Latency, Bandwidth and the Colony switch • Load Balance • Synchronization • Gotchas • NERSC Parallel Profiling tools • Upcoming Parallel Environment and MPI Library
Seaborg: HPC Resource Seaborg is a tool for Scientific Computation Workstations can deliver: • Moore’s law speedup for serial computations • Dedicated access to compute resources Large scale parallelism can deliver: • Reliable compute resources • Scientific results in shorter time • Concurrency offsets serial performance • Scientific results not attainable on workstations • Fast switch, TBs of RAM, parallel data movement Use the right tool for the job
16 way SMP NHII Node G P F S Main Memory GPFS seaborg.nersc.gov IBM SP 380 x Colony Switch CSS0 CSS1 • 6080 dedicated CPUs, 96 shared login CPUs • Hierarchy of caching, prefetching to hide latency • Bottleneck determined by first depleted resource HPSS
Running on the SP: Fundamentals PROGRAM hello IMPLICIT NONE INCLUDE 'mpif.h' INTEGER:: rank, size, ierr CALL MPI_INIT( ierr ) CALL MPI_COMM_RANK( MPI_COMM_WORLD,rank,ierr ) CALL MPI_COMM_SIZE( MPI_COMM_WORLD,size,ierr ) PRINT *, rank, " of ", size CALL MPI_FINALIZE(ierr) END PROGRAM hello #include <stdio.h> #include <stdlib.h> (need for 64bit) #include <mpi.h> int main(int argc, char *argv[]) { int rank,size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf(“%d of %d\n”, rank, size); MPI_Finalize(); } mpcc hello.c mpxlf hello.f ./a.out –nodes 2 –tasks_per_node 16 • mp compiled programs run on dedicated batch nodes • Parallel operating environment (POE) makes this mostly transparent • See http://www.nersc.gov/nusers/resources/SP/running_jobs/
Parallel Start Up • Prior to poe your command or batch script is serial • After poe execution is parallel • POE and MPI are distinct but related • poe /bin/hostname
16 way SMP NHII Node G P F S Main Memory GPFS MPI on the IBM SP • 2-4096 way concurrency • MPI-1 and ~MPI-2 • GPFS aware MPI-IO • Thread safety • Ranks on same node • bypass the switch Colony Switch CSS0 CSS1 HPSS
16 way SMP NHII Node Main Memory GPFS MPI: seaborg.nersc.gov css1 css0 • What is the benefit of two adapters? • This is for a single pair of tasks csss
Inter-Node Bandwidth • Tune message • size to optimize throughput • Aggregate messages when possible csss css0
MPI: General Considerations • For data decomposed applications with some locality partition problem along SMP boundaries (minimize surface to volume ratio) • Use MP_SHAREDMEMORY to minimize switch traffic • csss is most often the best route to the switch
MPI: Memory, 64 bit MPI • 32 bit MPI has inconvenient memory limits • 256MB per task default and 2GB maximum (-bmaxdata) • 1.7GB can be used in practice, but depends on MPI usage • The scaling of this internal usage is complicated, but larger concurrency jobs have more of their memory “stolen” by MPI’s internal buffers and pipes • 64 bit MPI removes these barriers • 64 bit MPI is fully supported • Just remember to use “_r” compilers and “-q64” • Seaborg nodes have 16, 32 or 64 GB available • Virtual memory is ~not an option
Getting the Most out of The SP switch What we do: • Use MP_SHAREDMEMORY=yes (default) • Use MP_EUIDEVICE=csss (default) What you can do: • Load balance • Tune message sizes • Reduce synchronizing MPI calls
Load Balance • If one task lags the others in time to complete synchronization suffers, e.g. a 3% slowdown in one task can mean a 50% slowdown for the code overall • Seek out and eliminate sources of variation • Decompose problem uniformly among nodes/cpus
Load Balance: contd. Unbalanced: key Balanced: Time saved by load balance
Load Balance: Real World Application MPI Rank Time
Load Balance: Real World Application MPI Rank Time
Load Balance: Summary • Imbalance most often a byproduct of data decomposition • Must be addressed before further MPI tuning can happen • How to quickly identify and quantify imbalance? • NERSC consultant can help with visual analysis • poe+ provides a simple quantitative means • Good software exists to help with graph partitioning / remeshing • For regular grids consider padding or contracting
Code Topics Continued • Once load is balanced move on to • Message Sizes • Synchronization • MPI Implementation gotchas
MPI: Synchronization • On the SP each SMP image is scheduled independently and while use code is waiting, OS will schedule other tasks • A fully synchronizing MPI call requires everyone’s attention • By analogy, imagine trying to go to lunch with 1024 people • Probability that everyone is ready at any given time scales poorly • Isend/Irecv < Send/Recv < Bcast < Reduce < Barrier, Allreduce
Synchronization: User Driven Improvements • Performance variation reported by users running with > 1024 tasks • USG/NSG/IBM identified and resolved slight asymmetries in how the CWS polls batch nodes about their health. • Direct benefit for highly parallel applications • Process driven by feedback from users about performance variability.
Synchronization: Hidden Multithreading • ESSL and IBM Fortran (RANDOM_NUMBER) have autotasking like “features” which function via creation of unspecified numbers of threads. • Synchronization problems are unpredictable using these features. Performance and variability impacted by thread congestion. • Generally, HPC users want control over concurrency. We now enforce this through our default runtime settings.
Synchronization (continued) • MPI_Alltoall and MPI_Allreduce can be particularly bad in the range of 512 tasks and above • Use MPI_Bcast if possible which is not fully synchronizing • Remove un-needed MPI_Barrier calls • Many algorithms may be tuned or modified to deal with synchronization • Use Immediate Sends and Asynchronous I/O when possible
Synchronization : threads and OpenMP • Using a mixed model, even when no underlying fine grained parallelism is present can take strain off of the MPI implementation at high concurrency. Consider SMP libraries when possible (esslsmp, FFTW, etc.) e.g. on seaborg a 2048 way job can run with only 128 MPI tasks and 16 OpenMP threads • No good if threads are out of sync • Having hybrid code whose concurrencies can be tuned between MPI and OpenMP tasks has portability advantages
Gotchas: MP_PIPE_SIZE memory = 2*PIPE_SIZE*(ntasks-1)
Gotchas: How to measure MPI memory usage? 2048 tasks
Gotchas: MP_LABELIO, phost • PE’s hostlist environment variable breaks for large jobs • Run NERSC tool /usr/common/usg/bin/phost prior to your parallel program to map machine names to POE tasks • MPI and LAPI versions available • Hostslists are useful for large runs (I/O perf, failure) • Labeled I/O will let you know which task generated the message “segmentation fault” , gave wrong answer, etc. export MP_LABELIO=yes
Gothcas: Core files • Core dumps don’t scale • MP_COREDIR=none No corefile I/O • MP_COREFILE_FORMAT=light_core Less I/O • LL script to save just one full fledged core file, throw away others … if MP_CHILD !=0 export MP_COREDIR=/dev/null endif …
Debugging • In general debugging 512 and above is error prone and cumbersome. • Debug at a smaller scale when possible. • Use shared memory device MPICH on a workstation with lots of memory as a mock up high concurrency environment. • For crashed jobs examine LL logs for memory usage history. (ask a NERSC consultant for help with this)
poe+ : Motivation • A tool for quickly getting at all the issues just discussed • Provides an easy to use low overhead (to user and code) interface to performance metrics. • Uses hpmcount to gather and aggregate HPM data. • Can generate an MPI Profile • Load balance information • Clear, concise performance reports to user and to NERSC center • There are other options PMAPI / PAPI / HPMLIB • How do you feel about ERCAP question 16? • poe+ emits the numbers that are requested ERCAP GFLOP/S : 502.865839 GFLOP/S ERCAP MB/TASK : 96.01953125 MB
poe+ : Usage usage: poe+ [-hpm_group n] [-mpi] executable • “-hpm_group” selects HPM group • Default group 1 for flops and TLB • Group 2 for L1 cache load/store misses • “-mpi” maps MPI* calls to PMPI* • MPI calls get wrapped to records data movement and timings • ~1.5 microsecond overhead to each MPI call • When MPI_Finalize is reached • Application level summary • Task level summary • Load balance histogram • Don’t use for shell scripts
poe+ : rusage Output Execution time (wall clock time): 133.128812 seconds on 64 tasks ## Resource Usage Statistics Average Total MIN MAX ## Wall Clock Time (in sec.) : 132.465758 8477.808501 130.002884 133.128812 s Time in user mode (in sec.) : 116.304219 7443.470000 107.020000 117.990000 s Time in system mode (in sec.): 2.216562 141.860000 1.000000 4.990000 s Maximum resident set size : 98324 6292764 97952 98996 KB Shared mem use in text seg. : 37889 2424926 35043 38309 KB*s Unshared mem use in data seg.: 11265782 721010109 10498632 11365248 KB*s Page faults w/out IO activity: 26440 1692189 26320 27002 Page faults with IO activity : 14 942 8 37 Times process was swapped out: 0 0 0 0 Times file system perf. INPUT: 0 0 0 0 Times file system perf.OUTPUT: 0 0 0 0 IPC messages sent : 0 0 0 0 IPC messages received : 0 0 0 0 signals delivered : 315 20196 314 317 voluntary context switches : 2530 161961 594 7705 involuntary context switches : 1322 84613 429 8737
poe+ : HPM Output Utilization rate : 86.870 % % TLB misses per cycle : 0.039 % number of loads per TLB miss : 849.357 Total load and store operations : 1445855.716 M Instructions per load/store : 1.967 MIPS : 21357.726 Instructions per cycle : 1.024 HW Float point instructions per Cycle : 0.331 Total Floating point instructions + FMAs (flips) : 1446556.207 M Flip rate (flips / WCT) : 10865.839 Mflip/sec Flips / avg user time : 12508.231 Mflip/sec FMA percentage : 72.870 % Computation intensity : 1.000 ERCAP GFLOP/S : 10.865839 GFLOP/S ERCAP MB/TASK : 96.01953125 MB
poe+ : MPI Output 0 : ---------------Times----------------------------------------------- 0 : 63 Wall 119.701 Usr 107.230 Sys 1.710 MPI 30.732 0 : ------------------------------------------------------------------- 0 : MPI Routine #calls avg. bytes time(sec) 0 : ---------------Aggregate------------------------------------------- 0 : MPI_Comm_size 1 0.0 0.000 0 : MPI_Comm_rank 1345 0.0 0.004 0 : MPI_Bcast 15 559.7 0.003 0 : MPI_Barrier 1 0.0 0.000 0 : MPI_Allgather 8 4.0 0.095 0 : MPI_Reduce 104 3988.0 0.430 0 : MPI_Alltoall 115 131584.0 30.200 0 : ---------------Distribution---------------------------------------- 0 : MPI_Bcast 3 4.0 0.000 0 : MPI_Bcast 8 24.0 0.001 0 : MPI_Bcast 4 2048.0 0.002 0 : MPI_Allgather 8 4.0 0.095 0 : MPI_Reduce 58 4.0 0.342 0 : MPI_Reduce 7 28.6 0.051 0 : MPI_Reduce 2 96.0 0.000 0 : MPI_Reduce 4 2048.0 0.002 0 : MPI_Reduce 33 12301.1 0.035 0 : MPI_Alltoall 115 131584.0 30.200
poe+ : Summary Easy to use, low overhead Performance Profiling Benefits Everyone • User Applications: • Scalar performance (HPM) • Parallel efficiency (MPI) • Disk I/O performance (TBA) • Center Policies: • Runtime settings • Queues • SW Versioning • Compute Resources: • System settings • Parallel effciency (MPI) • Future Machines! • Understanding workload Getting more science done! • Feature requests, feedback, suggestions welcome.
Parallel Enviroment 4.1 • New MPI library is available. • Not based on MPCI PIPES layer but rather over LAPI. Solves PIPEs memory issues. • Latency is currently higher that PE3.2, IBM is working on this • Several improvments to MPI Collectives • Though LAPI uses threads, your code need not • A pass through library for non “_r” is provided
PE 4.1 : usage • Simply do “module load USG pe” • No need to recompile • This is still beta software (but close to release) • We turn off threading by default for performance reasons. To get it back, e.g. to use certain MPI-2 features, unset MP_SINGLE_THREAD • Best way to estimate impact on your code is to try it
Parallel I/O • Can be a significant source of variation in task completion prior to synchronization • Limit the number of readers or writers when appropriate. Pay attention to file creation rates. • Output reduced quantities when possible
Blocked data: Often memory address space is segmented by a logical blocking of how data is distributed on disk. Block cyclic, multidimensional, etc. Memory Disk Block Size
t0 = MPI_Wtime(); MPI_Type_vector(n/bn, bn, size*bn, MPI_DOUBLE, &vectype); MPI_Type_commit(&vectype); MPI_Type_size(vectype,&bvect); bvect/=sizeof(int); MPI_File_open(MPI_COMM_WORLD, fname, MPI_MODE_CREATE |MPI_MODE_RDWR, MPI_INFO_NULL, &fh); MPI_File_set_view(fh, rank*bn*sizeof(double), MPI_BYTE, vectype, "native", MPI_INFO_NULL); /* MPI_File_preallocate(fh,nbyte*size); */ MPI_File_write_all(fh, data, bvect, MPI_INT, &s); MPI_File_sync(fh); MPI_File_close(&fh); MPI_Type_free(&vectype); MPI_Barrier(MPI_COMM_WORLD); t1 = MPI_Wtime(); t0 = MPI_Wtime(); fp=fopen(fname,"w"); MPI_Barrier(MPI_COMM_WORLD); for(i=0;i<n/bn;i++) { fseek( fp,((i*size+rank)*bn*sizeof(DATA_T)), SEEK_SET); fwrite(data+i*bn,bn*sizeof(DATA_T),1,fp); } fclose(fp); MPI_Barrier(MPI_COMM_WORLD); t1= MPI_Wtime I/O methods: block data