Blue Gene extreme I/O

Blue Gene extreme I/O Giri Chukkapalli San Diego Supercomputer Center July 29, 2005

BlueGene/L design • Almost 4 years of collaboration between LLNL, IBM and SDSC and others • Applications people, computational scientists involved in the design and implementation of the machine • Computation/communication characteristics of the machine are designed to mimic that of the physics and the algorithms

BG/L: Broad design concepts • Communication characteristics of the scientific computing algorithms • Large amount of node neighbor communication • Small amount of global communication • Second processor on the node to allow truly overlapping communication and computation • Flexible I/O • Designed to exploit fine grain parallelism • Lean kernel, low latency interconnects, small memory

Here’s a view of Blue Gene from chips to racks

BG System Overview:Novel, massively parallel system from IBM • Full system scheduled for installation at LLNL in 3Q05 • 65,000+ compute nodes in 64 racks • Each node being two low-power PowerPC processors + memory • Compact footprint with very high processor density • Slow processors & modest memory per processor • Very high peak speed of 360 Tflops • 1/2 built now with #1 Linpack speed of 137 Tflops • 1024 compute nodes in single rack installed at SDSC • Maximum I/O-configuration with 128 I/O nodes for data-intensive computing • Has already achieved more than 3 GB/s read rate • Systems at >10 other sites • Need to select apps carefully • Must scale (at least weakly) to many processors (because they’re slow) • Must fit in limited memory

BG System Overview: Processor Chip (1)(= System-on-a-chip) • Two 700-MHz PowerPC 440 processors • Each with two floating-point units • Each with 32-kB L1 data caches that are noncoherent • 4 flops/proc-clock peak (=2.8 Gflops/proc) • 2 8-B loads or stores / proc-clock peak in L1 (=11.2 GBps/proc) • Shared 2-kB L2 cache (or prefetch buffer) • Shared 4-MB L3 cache • Five network controllers (though not all wired to each node) • 3-D torus (for point-to-point MPI operations: 175 MBps nom x 6 links x 2 ways) • Tree (for most collective MPI operations: 350 MBps nom x 3 links x 2 ways) • Global interrupt (for MPI_Barrier: low latency) • Gigabit Ethernet (for I/O) • JTAG (for machine control) • Memory controller for 512 MB of off-chip, shared memory • No concept of virtual memory or TLB miss

BG System Overview: Processor Chip (2)

BG System Overview: Integrated system

BG System Overview:SDSC’s single-rack system (1) • 1024 compute nodes & 128 I/O nodes (each with 2p) • Most I/O-rich configuration possible (8:1 compute:I/O node ratio) • Identical hardware in each node type with different networks wired • Compute nodes connected to: torus, tree, global interrupt, & JTAG • I/O nodes connected to: tree, global interrupt, Gigabit Ethernet, & JTAG • Two half racks (also confusingly called midplanes) • Connected via link chips • Front-end nodes (4 B80s, each with 4p) • Service node (p275 with 2p) • Large file system (~400 TB in /idgpfs) serviced by NSD nodes (IA-64s, each with 2p)

BG System Overview:SDSC’s single-rack system (2)

BG System Overview:Multiple operating systems & functions • Compute nodes: run Compute Node Kernel (CNK = blrts) • Each run only one job at a time • Each use very little memory for CNK • I/O nodes: run Embedded Linux • Run CIOD to manage compute nodes • Perform file I/O • Run GPFS • Front-end nodes: run SuSE SLES9 Linux/PPC64 • Support user logins • Run cross compilers & linker • Run parts of mpirun to submit jobs & LoadLeveler to manage jobs • Service node: runs SuSE SLES8 Linux/PPC64 • Uses DB2 to manage four system databases • Runs control system software, including MMCS • Runs other parts of mpirun & LoadLeveler • (Software comes in drivers: currently running Driver 202)

BG System Overview:Parallel I/O via GPFS

Getting started:Logging on & moving files • Logging on sshbglogin.sdsc.edu or ssh -lusername bglogin.sdsc.edu • Moving files scpfile username@bglogin.sdsc.edu:~ or scp -rdirectory username@bglogin.sdsc.edu:~

Getting started:Places to store your files • /users (home directory) • 18-GB file system on front-end node • Will increase soon to ~1 TB • Still won’t be able to store much there • Regular backups • /gpfs-wan available for parallel I/O via GPFS • ~400 TB accessed via IA-64 NSD servers • GPFS shared as Global File System across other high-end systems at SDSC • No backups

Using the compilers:Important programming considerations • Front-end nodes have different processors & run different OS than compute nodes • Hence codes must be cross compiled • Discovery of system characteristics during compilation (e.g., via configure) may require code change • Some system calls are not supported by the compute node kernel

Using the compilers:Dual FPUs & SIMDization • Good performance depends upon using • both FPUs per processor* • SIMD vectorization • These work only for • data that are 16-B aligned to support quadword (16-B = 128-b) loads & stores • Full bandwidth is obtained only for stride-one accesses *All floating-point computations are done in double precision, even though rest of processor is 32 bit

Using the compilers:Compiler versions, paths, & wrappers • Compilers (version numbers the same as on DataStar) XL Fortran V9.1: blrts_xlf & blrts_xlf90 XL C/C++ V7.0: blrts_xlc & blrts_xlC • Paths to compilers in default .bashrc export PATH=/opt/ibmcmp/xlf/9.1/bin:$PATH export PATH=/opt/ibmcmp/vac/7.0/bin:$PATH export PATH=/opt/ibmcmp/vacpp/7.0/bin:$PATH • Compilers with MPI wrappers (recommended) mpxlf, mpxlf90, mpcc, & mpCC • Path to MPI-wrapped compilers in default .bashrc export PATH=/usr/local/apps/bin:$PATH

Using the compilers: Options & example • Compiler options -qarch=440 uses only single FPU per processor (minimum option) -qarch=440d allows both FPUs per processor (alternate option) -qtune=440 (after -qarch) seems superfluous, but avoids warnings -O3 gives minimal optimization with no SIMDization -O3 -qhot=simd adds SIMDization (seems to be the same as -O5) -O4 adds compile-time interprocedural analysis -O5 adds link-time interprocedural analysis (but sometimes has problems) -qdebug=diagnostic gives SIMDization info • Big problem now! Second FPU is seldom used, i.e., -O5 is seldom better than -O3 • Current recommendation -O3 -qarch=440 • Example using MPI-wrapped compiler mpxlf90 -O3 -qarch=440 -o hello hello.f

Using libraries: Math libraries • ESSL • ~500 routines implemented • Mostly optimized for -O3 -qarch=440 now • Beta version available; formal release in October 05 • Currently having correctness problems • MASS/MASSV • Initial release of Version 4.2 available • Still being optimized • FFTW • Versions 2.1.5 & 3.0.1 available in both single & double precision • Performance comparisons with ESSL in progress • Example link paths -Wl,--allow-multiple-definition -L/usr/local/apps/lib -lmassv -lmass -lessln -L/usr/local/apps/fftw301s/lib -lfftw3f • Reference: Ramendra Sahoo’s slides (SSW16)

Using libraries:Message passing via MPI • MPI is based on MPICH2 from ANL • All MPICH2 routines (other than MPI-IO) are implemented: some still are being optimized • Most MPI-IO routines are implemented: optimization is underway • Compilation & linking are facilitated by MPI wrappers at /usr/local/apps/bin • References: George Almási’s slides (SSW09), Rusty Lusk’s slides (SS10), & Hao Yu’s slides (SSW11)

Running jobs: Overview • There are two compute modes • Coprocessor (CO) mode: one compute processor per node • Virtual node (VN) mode: two compute processors per node • Jobs run in partitions or blocks • These are typically powers of two • Blocks must be allocated (or booted) before run & are restricted to a single user at a time • Job submission & block allocation are done by mpirun • Sys admins may also boot blocks with MMCS; this avoids allocation overhead for repeat runs • Interactive & batch jobs are both supported • Batch jobs are managed by LoadLeveler

Running jobs: mpirun (1) • Jobs are submitted from front-end nodes via mpirun • Here are two example for interactive runs mpirun -partition bot64-1 -np 8 -exe /users/pfeiffer/hello/hello mpirun -partition bot256-1 -mode VN -np 512 -exe /users/pfeiffer/NPB2.4/NPB2.4-MPI/binO3fix2/cg.C.512 | tee cg.C.256v.out • mpirun occasionally hangs, but control-C usually allows exit

Running jobs: mpirun (2) • Key mpirun options are -partition predefined partition name -mode compute mode: CO or VN -connect connectivity: TORUS or MESH -np number of compute processors -mapfile logical mapping of processors -cwd full path of current working directory -exe full path of executable -args arguments of executable (in double quotes) -env environmental variables (in double quotes) (These are mostly different than for TeraGrid) • See mpirun user’s manual for syntax

Running jobs: mpirun (3) • -partition may be specified explicitly (or not) • If specified, partition must be predefined in database (which can be viewed via Web page) • If not specified, partition will be at least a half rack • Recommendation: Always use a predefined partition • -mode may be CO (default) or VN • Generally you must specify partition to run in VN mode • For given number of nodes, VN mode is usually faster than CO mode • Memory per processor in VN mode is half that of CO mode • Recommendation: Use VN mode unless there is not enough memory • -connect may be TORUS or MESH • Option only applies if -partition not specified (with MESH the default) • Performance is generally better with TORUS • Recommendation: Use a predefined partition, which ensures TORUS

Running jobs: mpirun (4) • -np gives number of processors • Must fit in available partition • -mapfile gives logical mapping of processors • Can improve MPI performance in some cases • Can be used to ensure VN mode for 2*np ≤ partition size • Can be used to change ratio of compute nodes to I/O nodes • Recommendation: Contact SDSC if you you want to use mapfile, since no documentation is available • -cwd gives current working directory • Needed if there is an input file

Running jobs:LoadLeveler for batch jobs (1) • Batch jobs are managed with LoadLeveler (in a similar manner as on DataStar) • You generate LoadLeveler run script that includes mpirun • Then you submit job via llsubmit • You can monitor status with llq -x & llstatus • Additional BG-specific commands are available • Problem: scheduler is not working now! • Once this is fixed, LoadLeveler will be recommended way to make production runs • See LoadLeveler user guide for more information

Running jobs:LoadLeveler for batch jobs (2) • Here is an example LoadLeveler run script, say cg.C.512v.run #!/usr/bin/ksh #@ environment = COPY_ALL;MMCS_SERVER_IP=bgsn-e.sdsc.edu; BACKEND_MPIRUN_PATH=/usr/bin/mpirun_be; #@ job_type = parallel #@ class = parallel #@ input = /dev/null #@ output = cg.C.512v.$(jobid).out #@ error = cg.C.512v.$(jobid).err #@ wall_clock_limit = 00:10:00 #@ queue mpirun -partition bot256-1 -mode VN -np 512 -exe /users/pfeiffer/NPB2.4/NPB2.4-MPI/binO3fix2/cg.C.512 • Submit as follows: llsubmit cg.C.512v.run

Running jobs:Usage guidelines during weekdays

Running jobs:Predefined partitions • Production or test partitions rack all 1,024 nodes top & bot 512 nodes in top & 512 nodes in bottom top256-1 & top256-2 256 nodes in each half of top bot256-1 & bot256-2 256 nodes in each half of bottom • Test partitions bot128-1, …, bot128-4 128-node quarters of bottom bot64-1, …, bot64-8 64-node eighths of bottom • GPFS partitions rackGPFS all 1,024 nodes topGPFS & botGPFS 512 nodes in top & 512 nodes in bottom

Monitoring jobs: Block & job status via Web Web site at bgsn.sdsc.edu (password protected)

Monitoring jobs:Life cycles of blocks & jobs • Successive block states • FREE • ALLOCATED • CONFIGURING • BOOTING (may hang in this state; control-C to exit) • INITIALIZED • Successive job states • QUEUED • STARTING • RUNNING • DYING (if killed by user) • TERMINATED

BG System Overview: References • Special Blue Gene issue of IBM Journal of Research and Development, v. 49 (2/3), March/May 2005 www.research.ibm.com/journal/rd49-23.html • Blue Gene Web site at SDSC www.sdsc.edu/user_services/bluegene • Slides from Blue Gene System Software Workshop www-unix.mcs.anl.gov/~beckman/bluegene/SSW-Utah-2005

BG/L Architecture: Drawbacks • Code has to be memory scalable • Reengineer the code to overlap computation and communication • Understanding the process geometry • Must be parallel I/O • Cross compilation issues • Codes have to exhibit fine-grain parallelism

BG/L Architecture: Advantages • High intra, inter-node Bytes/Flops ratio • Two separate networks to handle two distinct type of communications • On the network global reductions • Extremely repeatable performance • Very low OS overhead • In I/O rich configuration very high I/O bytes to FLOPS • Very low watts/FLOPS ratio, square feet/FLOPS

Advantages cntd. • Truly RISC architecture • Quad load instructions • Familiar environment • Linux front end • XL compilers • Totalview debugger • HPM/MPI profiling tools

Good matching codes • Spectral element codes, CFD • QCD, QM/MD • Codes involving Data streaming • Ab-initio protein folding codes • Early production runs • NREL simulation of cellulase linker domain using CPMD • Caltech spectral element simulation of Sumatra earthquake • DOT, a protein-protein docking code • MPCUGLES, an LES CFD code • MD simulations using NAMD

Very Preliminary experience • Compiler doesn’t generate dual floating point or quad load instructions yet • MPI calls are not yet fully optimized for the Tree • Performance numbers are very preliminary • User environment is still rough • Parallel file system still not user friendly

NAMD doesn’t scale as well on BG as on p655s;VN mode is only a little worse than CO mode (per p)

ENZO • Example of performance impact of codes containing O(N^2) or O(N^3) algorithms where N is the number of processors

Original Enzo performance on DS,BG/L

Improved Enzo performance on BH and DS

I/O performance with GPFS has been measuredfor two benchmarks & one application;max rates on Blue Gene are comparable to DataStar for benchmarks,but slower for application in VN mode 2048p DS p655s BG CO BG VN BG VN 8p/node 8p/IO node 16p/IO node 16p/IO node Code & quantity (MB/s) (MB/s) (MB/s) (MB/s) IOR write 1,793 1,797 1,478 1,585 IOR read 1,755 2,291 2,165 2,306 mpi-tile-io write 2,175 2,040 1,720 1,904 mpi-tile-io read 1,698 3,481 2,929 2,933 mpcugles write 1,391 905 387 IOR & mpi-tile-io results are on 1024p, except for last column mpcugles results are on 512p

IOR weak scaling scans with GPFS showBG has higher max for reads (2.3 vs 1.9 GB/s), whileDS has higher max (than BG VN) for writes (1.8 vs 1.6 GB/s)

User related issues • Effectiveness of comp/comm overlapping using coprocessor mode still not tested • Performance variability based on • Physical slice given by the job scheduler • Mapping of the mpi tasks onto the physical slice

Blue Gene extreme I/O