240 likes | 384 Views
SDSC Blue Gene: Overview. Mahidhar Tatineni Introduction to and Optimization for SDSC Systems Workshop October 16, 2006. BG System Overview: Novel, massively parallel system from IBM. Full system installed at LLNL from 4Q04 to 3Q05 65,000+ compute nodes in 64 racks
E N D
SDSC Blue Gene: Overview Mahidhar Tatineni Introduction to and Optimization for SDSC Systems Workshop October 16, 2006
BG System Overview:Novel, massively parallel system from IBM • Full system installed at LLNL from 4Q04 to 3Q05 • 65,000+ compute nodes in 64 racks • Each node being two low-power PowerPC processors + memory • Compact footprint with very high processor density • Slow processors & modest memory per processor • Very high peak speed of 367 Tflop/s • #1 Linpack speed of 280 Tflop/s • Two applications (CPMD & ddcMD) run at over 100 Tflop/s • 1024 compute nodes in single rack installed at SDSC in 4Q04 • Maximum I/O-configuration with 128 I/O nodes for data-intensive computing • Has achieved more than 4 GB/s for writes using GPFS • Systems at 14 sites outside IBM & 4 within IBM as of 2Q06 • Need to select apps carefully • Must scale (at least weakly) to many processors (because they’re slow) • Must fit in limited memory
BG System Overview: Processor Chip(= System-on-a-chip) • Two 700-MHz PowerPC 440 processors • Each with two floating-point units • Each with 32-kB L1 data caches that are not coherent • 4 flops/proc-clock peak (=2.8 Gflop/s-proc) • 2 8-B loads or stores / proc-clock peak in L1 (=11.2 GB/s-proc) • Shared 2-kB L2 cache (or prefetch buffer) • Shared 4-MB L3 cache • Five network controllers (though not all wired to each node) • 3-D torus (for point-to-point MPI operations: 175 MB/s nom x 6 links x 2 ways) • Tree (for most collective MPI operations: 350 MB/s nom x 3 links x 2 ways) • Global interrupt (for MPI_Barrier: low latency) • Gigabit Ethernet (for I/O) • JTAG (for machine control) • Memory controller for 512 MB of off-chip, shared memory
BG System Overview:SDSC’s single-rack system • 1024 compute nodes & 128 I/O nodes (each with 2p) • Most I/O-rich configuration possible (8:1 compute:I/O node ratio) • Identical hardware in each node type with different networks wired • Compute nodes connected to: torus, tree, global interrupt, & JTAG • I/O nodes connected to: tree, global interrupt, Gigabit Ethernet, & JTAG • Two half racks (also confusingly called midplanes) • Connected via link chips • Front-end nodes (2 B80s, each with 4 pwr3 processors, 1 pwr5 node) • Service node (Power 275 with 2 Power4+ processors) • Two parallel file systems using GPFS • Shared /gpfs-wan serviced by 58 NSD nodes (each with 2 IA-64s) • Local /bggpfs serviced by 12 NSD nodes (each with 2 IA-64s)
BG System Overview:Multiple operating systems & functions • Compute nodes: run Compute Node Kernel (CNK = blrts) • Each run only one job at a time • Each use very little memory for CNK • I/O nodes: run Embedded Linux • Run CIOD to manage compute nodes • Perform file I/O • Run GPFS • Front-end nodes: run SuSE Linux • Support user logins • Run cross compilers & linker • Run parts of mpirun to submit jobs & LoadLeveler to manage jobs • Service node: runs SuSE Linux • Uses DB2 to manage four system databases • Runs control system software, including MMCS • Runs other parts of mpirun & LoadLeveler • Software comes in drivers: We are currently running Driver V1R3
SDSC Blue Gene: Getting Started Logging on & moving files • Logging on sshbglogin.sdsc.edu or ssh -lusername bglogin.sdsc.edu • Moving files scpfile username@bglogin.sdsc.edu:~ or scp -rdirectory username@bglogin.sdsc.edu:~
SDSC Blue Gene: Getting started Places to store your files • /users (home directory) • 1.1 TB NFS mounted file system • Recommended for storing source / important files. • Do not write data/output to this area: Slow and limited in size! • Regular backups • /bggpfs available for parallel I/O via GPFS • ~18.5 TB accessed via IA-64 NSD servers • No backups • 225 TB /gpfs-wan available for parallel I/O and shared with DataStar and TG IA-64 cluster.
Using the compilers:Important programming considerations • Front-end nodes have different processors & run different OS than compute nodes • Hence codes must be cross compiled • Care must be taken with configure scripts • Discovery of system characteristics during compilation (e.g., via configure) may require modifications to the configure script. • Make sure that if code has to be executed during the configure, it runs on the compute nodes. • Alternately, system characteristics can be specified by user and the configure modified to take this into account. • Some system calls are not supported by the compute node kernel
Using the compilers:Compiler versions, paths, & wrappers • Compilers (version numbers the same as on DataStar) XL Fortran V10.1: blrts_xlf & blrts_xlf90 XL C/C++ V8.0: blrts_xlc & blrts_xlC • Paths to compilers in default .bashrc export PATH=/opt/ibmcmp/xlf/bg/10.1/bin:$PATH export PATH=/opt/ibmcmp/vac/bg/8.0/bin:$PATH export PATH=/opt/ibmcmp/vacpp/bg/8.0/bin:$PATH • Compilers with MPI wrappers (recommended) mpxlf, mpxlf90, mpcc, & mpCC • Path to MPI-wrapped compilers in default .bashrc export PATH=/usr/local/apps/bin:$PATH
Using the compilers: Options & example • Compiler options -qarch=440 uses only single FPU per processor (minimum option) -qarch=440d allows both FPUs per processor (alternate option) -qtune=440 (after -qarch) seems superfluous, but avoids warnings -O3 gives minimal optimization with no SIMDization -O3 -qhot=simd adds SIMDization (seems to be the same as -O5) -O4 adds compile-time interprocedural analysis -O5 adds link-time interprocedural analysis (but sometimes has problems) -qdebug=diagnostic gives SIMDization info • Current recommendation -O3 -qarch=440 • Example using MPI-wrapped compiler mpxlf90 -O3 -qarch=440 -o hello hello.f
Using libraries • ESSL • Version 4.2 is available in /usr/local/apps/lib • MASS/MASSV • Version 4.3 is available in /usr/local/apps/lib • FFTW • Versions 2.1.5 and 3.1.2 available in both single & double precision. The libraries are located in /usr/local/apps/V1R3 • NETCDF • Versions 3.6.0p1 and 3.6.1 are available in /usr/local/apps/V1R3 • Example link paths -Wl,--allow-multiple-definition -L/usr/local/apps/lib -lmassv -lmass -lesslbg -L/usr/local/apps/V1R3/fftw-3.1.2s/lib -lfftw3f
Running jobs: Overview • There are two compute modes • Coprocessor (CO) mode: one compute processor per node • Virtual node (VN) mode: two compute processors per node • Jobs run in partitions or blocks • These are typically powers of two • Blocks must be allocated (or booted) before run & are restricted to a single user at a time • Only batch jobs are supported • Batch jobs are managed by LoadLeveler • Users can monitor jobs using llq –b & llq -x
Running jobs:LoadLeveler for batch jobs • Here is an example LoadLeveler run script (test.cmd) #!/usr/bin/ksh #@ environment = COPY_ALL; #@ job_type = BlueGene #@ class = parallel #@ bg_partition = <partition name; for example: top> #@ output = file.$(jobid).out #@ error = file.$(jobid).err #@ notification = complete #@ notify_user = <your email address> #@ wall_clock_limit = 00:10:00 #@ queue mpirun -mode VN -np <number of procs> -exe <your executable> -cwd <working directory> • Submit as follows: llsubmit test.cmd
Running jobs: mpirun options • Key mpirun options are -mode compute mode: CO or VN -connect connectivity: TORUS or MESH -np number of compute processors -mapfile logical mapping of processors -cwd full path of current working directory -exe full path of executable -args arguments of executable (in double quotes) -env environmental variables (in double quotes) (These are mostly different than for TeraGrid) • See mpirun user’s manual for syntax
Running jobs: mpirun options • -mode may be CO (default) or VN • Generally you must specify partition to run in VN mode • For given number of nodes, VN mode is usually faster than CO mode • Memory per processor in VN mode is half that of CO mode • Recommendation: Use VN mode unless there is not enough memory • -connect may be TORUS or MESH • Option only applies if -partition not specified (with MESH the default) • Performance is generally better with TORUS • Recommendation: Use a predefined partition, which ensures TORUS
Running jobs: mpirun options • -np gives number of processors • Must fit in available partition • -mapfile gives logical mapping of processors • Can improve MPI performance in some cases • Can be used to ensure VN mode for 2*np ≤ partition size • Can be used to change ratio of compute nodes to I/O nodes • Recommendation: Contact SDSC if you you want to use mapfile, since no documentation is available • -cwd gives current working directory • Needed if there is an input file
Running jobs: Partition Layout and Usage Guidelines • To make effective use of the Blue Gene, production runs should generally use one-fourth or more of the machine, i.e., 256 or more compute nodes. Thus the following predefined partitions are provided for production runs. • rack:all 1,024 nodes • top & bot: 512 nodes • top256–1 & top256–2 256 nodes in each half of the top midplane • bot256–1 & bot256–2 256 nodes in each half of the bottom midplane • Smaller 64 (bot64-1, …, bot64-8) and 128 (bot128-1 , … , bot128-4) node partitions are available for test runs.
BG System Overview: References • Blue Gene Web site at SDSC http://www.sdsc.edu/user_services/bluegene • Loadleveler guide http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.loadl.doc/loadl331/am2ug30305.html • Blue Gene Application development guide (from IBM redbooks) http://www.redbooks.ibm.com/abstracts/sg247179.html