240 likes | 316 Views
Scaling Up User Codes on the SP David Skinner, NERSC Division, Berkeley Lab. Motivation. NERSC’s focus is on capability computation Capability == jobs that use ¼ or more of the machines resources
E N D
Scaling Up User Codes on the SPDavid Skinner, NERSC Division, Berkeley Lab
Motivation • NERSC’s focus is on capability computation • Capability == jobs that use ¼ or more of the machines resources • Scientists whose work involves large scale computation or HPC should keep ahead of workstation sized problems • “Big Science” problems are more interesting!
Challenges • CPU’s are outpacing memory bandwidth and switches, leaving FLOPs increasingly isolated. • Vendors often have machines < ½ the size of NERSC machines: system software may be operating in uncharted regimes • MPI implementation • Filesystem metadata systems • Batch queue system Users need information on how to mitigate the impact of these issues for large concurrency applications.
Switch Adapater Performance csss css0
Switch considerations • For data decomposed applications with some locality partition problem along SMP boundaries (minimize surface to volume ratio) • Use MP_SHAREDMEMORY to minimize switch traffic • csss is most often the best route to the switch
Synchronization • On the SP each SMP image is scheduled independently and while use code is waiting, OS will schedule other tasks • A fully synchronizing MPI call requires everyone’s attention • By analogy, imagine trying to go to lunch with 1024 people • Probability that everyone is ready at any given time scales poorly
Synchronization (continued) • MPI_Alltoall and MPI_Allreduce can be particularly bad in the range of 512 tasks and above • Use MPI_Broadcast if possible • Not fully synchronizing • Remove un-needed MPI_Barrier calls • Use Asynchronous I/O when possible
Load Balance • If one task lags the others in time to complete synchronization suffers, e.g. a 3% slowdown in one task can mean a 50% slowdown for the code overall • Seek out and eliminate sources of variation • Distribute problem uniformly among nodes/cpus
Alternatives to MPI • CHARM++ and NAMD • Spatially decomposed molecular dynamics with periodic load balancing, data decomposition is adaptive • AMPI http://charm.cs.uiuc.edu/ • An automatic approach to load balancing • BlueGene L type machines with > 10K cpus will need re-examine these issues altogether
The SP switch • Use MP_SHAREDMEMORY=yes (default) • Use MP_EUIDEVICE=csss for 32 bit applications (default) • Run /usr/common/usg/bin/phost prior to your parallel program to map machine names to POE tasks • MPI and LAPI versions available • Hostslists are useful in general
64 bit MPI • 32 bit MPI has inconvenient memory limits • 256MB per task default and 2GB maximum • 1.7GB can be used in practice, but depends on MPI usage • The scaling of this internal usage is complicated, but larger concurrency jobs have more of their memory “stolen” by MPI’s internal buffers and pipes • 64 bit MPI removes these barriers • But must run on css0 only, less switch bandwidth • Seaborg has 16,32, and 64 GB per node available
64 bit MPI Howto At compile time: * module load mpi64 * compile with the "-q64" option using mpcc_r, mpxlf_r, or mpxlf90_r. At run time: * module load mpi64 * use "#@ network.MPI = css0,us,shared" in your job scripts. The multilink adapter "csss" is not currently supported. * run your POE code as you normally would
MP_LABELIO, phost • Labeled I/O will let you know which task generated the message “segmentation fault” , gave wrong answer, etc. export MP_LABELIO=yes • Run /usr/common/usg/bin/phost prior to your parallel program to map machine names to POE tasks • MPI and LAPI versions available • Hostslists are useful in general
Core files • Core dumps don’t scale (no parallel work) • MP_COREDIR=/dev/null No corefile I/O • MP_COREFILE_FORMAT=light_core Less I/O • LL script to save just one full fledged core file, throw away others … if MP_CHILD !=0 export MP_COREDIR=/dev/null endif …
Debugging • In general debugging 512 and above is error prone and cumbersome. • Debug at a smaller scale when possible. • Use shared memory device MPICH on a workstation with lots of memory to simulate 1024 cpus. • For crashed jobs examine LL logs for memory usage history.
Parallel I/O • Can be a significant source of variation in task completion prior to synchronization • Limit the number of readers or writers when appropriate. Pay attention to file creation rates. • Output reduced quantities when possible
OpenMP • Using a mixed model, even when no underlying fine grained parallelism is present can take strain off of the MPI implementation, e.g. on seaborg a 2048 way job can run with only 128 MPI tasks and 16 OpenMP threads • Having hybrid code whose concurrencies can be tuned between MPI and OpenMP tasks has portability advantages
Summary • Resources are present to face the challenges posed by scaling up MPI applications on seaborg. • Scientists should expand their problem scopes to tackle increasingly challenging computational problems. • NERSC consultants can provide help in achieving scaling goals.