210 likes | 313 Views
Evolution of the NERSC SP System NERSC User Services. Original Plans Phase 1 Phase 2 Programming Models and Code Porting Using the System. Original Plans: The NERSC-3 Procurement. Complete, reliable, high-end scientific system High availability and MTBF
E N D
Evolution of the NERSC SP SystemNERSC User Services Original Plans Phase 1 Phase 2 Programming Models and Code Porting Using the System
Original Plans: The NERSC-3 Procurement • Complete, reliable, high-end scientific system • High availability and MTBF • Fully configured - processing, storage, software, networking, support • Commercially available components • The greatest amount of computational power for the money • Can be integrated with existing computing environment • Can be evolved with product line • Much careful benchmarking and acceptance testing done SP Evolution - NERSC User Services
Original Plans: The NERSC-3 Procurement • What we wanted: • >1 teraflop of peak performance • 10 terabytes of storage • 1 terabyte of memory • What we got in phase 1 • 410 gigaflops of peak performance • 10 terabytes of storage • 512 gigabytes of memory • What we will get in phase 2 • 3 teraflops of peak performance • 15 terabytes of storage • 1 terabyte of memory SP Evolution - NERSC User Services
Hardware, Phase 1 • 304 Power 3+ nodes: Nighthawk 1 • Node usage: • 256 compute/batch nodes = 512 CPUs • 8 login nodes = 16 CPUs • 16 GPFS nodes = 32 CPUs • 8 network nodes = 16 CPUs • 16 service nodes = 32 CPUs • 2 processors/node • 200 MHz clock • 4 flops/clock (2 multiply-add ops) = 800 Mflops/CPU, 1.6 Gflops/node • 64 KB L-1 d-cache per CPU @ 5 nsec & 3.2 GB/sec • 4 MB L-2 cache per CPU @ 45 nsec & 6.4 GB/sec • 1 GB RAM per node @ 175 nsec & 1.6 GB/sec • 150 MB/sec switch bandwidth • 9 GB local disk (two-way RAID) SP Evolution - NERSC User Services
Hardware, Phase 2 • 152 Power 3+ nodes: Winterhawk 2 • Node usage: • 128 compute/batch nodes = 2048 CPUs • 2 login nodes = 32 CPUs • 16 GPFS nodes = 256 CPUs • 2 network nodes = 32 CPUs • 4 service nodes = 64 CPUs • 16 processors/node • 375 MHz clock • 4 flops/clock (2 multiply-add ops) = 1.5 Gflops/CPU, 22.4 Gflops/node • 64 KB L-1 d-cache per CPU @ 5 nsec & 3.2 GB/sec • 8 MB L-2 cache per CPU @ 45 nsec & 6.4 GB/sec • 8 GB RAM per node @ 175 nsec & 14.0 GB/sec • ~2000 (?) MB/sec switch bandwidth • 9 GB local disk (two-way RAID) SP Evolution - NERSC User Services
Programming Models, phase 1 • Phase 1 will reply on MPI, with availability of threading • OpenMP directives • Pthreads • IBM SMP directives • MPI now does intra-node communications efficiently • Mixed-model programming not currently very advantageous • PVM and LAPI messaging systems are also available • SHMEM is “planned”… • The SP has cache and virtual memory, which means • There are more ways to reduce code performance • There are more ways to lose portability SP Evolution - NERSC User Services
Programming Models, phase 2 • Phase 2 will offer more payback for mixed model programming • Single node parallelism is a good target for PVP users • Vector and shared-memory codes can be “expanded” into MPI • MPI codes can be ported from the T3E • Threading can be added within MPI • In both cases, re-engineering will be required, to exploit new and different levels of granularity • This can be done along with increasing problem sizes SP Evolution - NERSC User Services
Porting Considerations, part 1 • Things to watch out for in porting codes to the SP • Cache • Not enough on the T3E to make worrying about it worth the trouble • Enough on the SP to boost performance, if it’s used well • Tuning for cache is different than tuning for vectorization • False sharing caused by cache can reduce perfomrance • Virtual memory • Gives you access to 1.75 GB of (virtual) RAM address space • To use all of virtual (or even real) memory, must explicitly request “segments” • Causes performance degradation due to paging • Data types • Default sizes are different on PVP, T3E, and SP systems • “integer”, “int”, “real”, and “float” must be used carefully • Best to say what you mean: “real*8”, integer*4” • Do the same in MPI calls: “MPI_REAL8”, “MPI_INTEGER4” • Be careful with intrinsic function use, as well SP Evolution - NERSC User Services
Porting Considerations, part 2 • More things to watch out for in porting codes to the SP • Arithmetic • Architecture tuning can help exploit special processor instructions • Both T3E and SP can optimize beyond IEEE arithmetic • T3E and PVP can also do fast reduced precision arithmetic • Compiler options on T3E and SP can force IEEE compliance • Compiler options can also throttle other optimizations for safety • Special libraries offer faster intrinsics • MPI • SP compilers and runtime will catch loose usage that was accepted on the T3E • Communication bandwidth on SP Phase 1 is lower than on the T3E • Message latency on the SP Phase 1 is higher than on the T3E • We expect approximate parity with T3E in these areas, on the Phase 2 system • Limited number of communication ports per node - approximately one per CPU • “Default” versus “eager” buffer management in MPI_SEND SP Evolution - NERSC User Services
Porting Considerations, part 3 • Compiling & linking • “Version” is dependent on language and parallelization scheme • Language version • Fortran 77: f77, xlf • Fortran 90: xlf90 • Fortran 95: xlf95 • C: cc, xlc, c89 • C++: xlC • MPI-included: mpxlf, mpxlf90, mpcc, mpCC • Thread-safe: xlf_r, xlf90_r, xlf95_r, mpxlf_r, mpxlf90_r • Preprocessing can be ordered by compiler flag or source file suffix • Use consistently, for all related compilations; the following may NOT produce a parallel executable: mpxlf90 -c *.F xlf90 -o foo *.o • Use-bmaxdata:bytesoption to get more than a single 256 MB segment (up to 7 segments, or ~1.75 GB can be specified; only 3, or 0.75 GB, are real) SP Evolution - NERSC User Services
Porting: MPI • MPI codes should port relatively well • Use one MPI task per node or processor • One per node during porting • One per processor during production • Let MPI worry about where it’s communicating to • Environment variables, execution parameters, and/or batch options can specify • # tasks per node • Total # tasks • Total # processors • Total # nodes • Communications subsystem in use • User Space is best in batch jobs • IP may be best for interactive developmental runs • There is a debug queue/class in batch SP Evolution - NERSC User Services
Porting: Shared Memory • Don’t throw away old shared memory directives • OpenMP will work as is • Cray Tasking directives will be useful for documentation • We recommend porting Cray directives to OpenMP • Even small-scale parallelism can be useful • Larger scale parallelism will be available next year • If your problems and/or algorithms will scale to larger granularities and greater parallelism, prepare for message passing • We recommend MPI SP Evolution - NERSC User Services
From Loop-slicing to MPI, before... allocate(A(1:imax,1:jmax)) !OMP$ PARALLEL DO PRIVATE(I, J), SHARED(A, imax, jmax) do I = 1, imax do J = 1, jmax A(I,J) = deep_thought(A, I, J,…) enddo enddo Sanity checking • Run the program on one CPU to get baseline answers • Run on several CPUs to see parallel speedups and answers • Optimization • Consider changing memory access patterns to improve cache usage • How big can your problem get before you run out of real memory? SP Evolution - NERSC User Services
From Loop-slicing to MPI, after... call MPI_COMM_RANK(MPI_COMM_WORLD, my_id, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr) call my_indices(my_id, nprocs, my_imin, my_imax, my_jmin, my_jmax) allocate(A(my_imin : my_imax, my_jmin : my_jmax)) !OMP$ PARALLEL DO PRIVATE(I, J), SHARED(A, my_imax, my_jmax, my_imax, my_jmax) do I = my_imin, my_imax do J = my_jmin, my_jmax A(I,J) = deep_thought(A, I, J,…) enddo enddo ! Communicate the shared values with neighbors… if(odd(my_ID)) then call MPI_SEND(my_left(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_RECV(my_right(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_SEND(my_top(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_RECV(my_bottom(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) else call MPI_RECV(my_right(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_SEND(my_left(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_RECV(my_bottom(...), rightsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) call MPI_SEND(my_top(...), leftsize, MPI_REAL, tag, MPI_COMM_WORLD, ierr) endif SP Evolution - NERSC User Services
From Loop-slicing to MPI, after... • You now have one MPI task and many OpenMP threads per node • The MPI task does all the communicating between nodes • The OpenMP threads do the parallelizable work • Do NOT use MPI within an OpenMP parallel region • Sanity checking • Run on one node and one CPU to check baseline answers • Run on one node and several CPUs to see parallel speedup and answers • Run on several nodes, one CPU per node, and check answers • Run on several nodes, several CPUs per node, and check answers • Scaling checking • Run a larger version of a similar problem on the same set of ensemble sizes • Run the same sized problem on a larger ensemble • (Re-)Consider your I/O strategy… SP Evolution - NERSC User Services
From MPI to Loop-slicing • Add OpenMP directives to existing code • Perform sanity and scaling checks, as before • Results in same overall code structure as on previous slides • One MPI task and several OpenMP threads per node • For irregular codes, Pthreads may serve better, at the cost of increased complexity • Nobody really expects it to be this easy... SP Evolution - NERSC User Services
Using the Machine, part 1 • Somewhat similar to the Crays • Interactive and batch jobs are possible SP Evolution - NERSC User Services
Using the Machine , part 2 • Interactive runs • Sequential executions run immediately on your login node • Every login will likely put you on a different node, so be careful about looking for your executions - “ps” returns info about only the node you’re logged into. • Small scale parallel jobs may be rejected if LoadLeveler can’t find the resources • There are two pools of nodes that can be used for interactive jobs: • Login nodes • A small subset of the compute nodes • Parallel execution can often be achieved by • Trying again, after initial rejection • Changing communication mechanisms from User Space to IP • Using the other pool SP Evolution - NERSC User Services
Using the Machine , part 3 • Batch jobs • Currently, very similar in capability to the T3E • Similar run times, processor counts • More memory available on the SP • Limits and capabilities may change, as we learn the machine • LoadLeveler is similar to, but simpler than NQE/NQS on the T3E • Jobs are submitted, monitored, and cancelled by special commands • Each batch job requires a script that is essentially a shell script • The first few lines contain batch options that look like comments to the shell • The rest of the script can contain any shell constructs • Scripts can be debugged by executing them interactively • Users are limited to 3 running jobs, 10 queued jobs, and 30 submitted jobs, at any given time SP Evolution - NERSC User Services
Using the Machine , part 4 • File systems • Use the environment variables to let the system manage your file usage • Sequential work can be done in $HOME (not backed up) or $TMPDIR (transient) • Medium performance, node-local • Parallel work can be done in $SCRATCH (transient) or /scratch/username (purgeable) • High performance, located in GPFS • HPSS is available from batch jobs, via HSI, and interactively via FTP, PFTP, and HIS • There are quotas on space and inode usage SP Evolution - NERSC User Services
Using the Machine , part 4 • The future? • The allowed scale of parallelism (CPU counts) may change • Max now = 512 CPUs, same as on T3E • The allowed duration of runs may change • Max now = 4 hours; Max on T3E = 12 hours • The size of possible problems will definitely change • More CPUs in phase 1 than the T3E • More memory per cpu, in both phases, than on T3e • The amount of work possible per unit time will definitely change • CPUs in both phases are faster than those on the T3E • Phase 2 interconnect will be faster than on Phase 1 • Better machine management • Checkpointing will be available • We will learn what can be adjusted in the batch system • There will be more and better tools for monitoring and tuning • HPM, KAP, Tau, PAPI... • Some current problems will go away (e.g. memory mapped files) SP Evolution - NERSC User Services