Running on the SDSC Blue Gene

Running on the SDSC Blue Gene Mahidhar Tatineni Blue Gene Workshop SDSC, April 5, 2007

BG System Overview:SDSC’s three-rack system

BG System Overview: Integrated system

BG System Overview:Multiple operating systems & functions • Compute nodes: run Compute Node Kernel (CNK = blrts) • Each run only one job at a time • Each use very little memory for CNK • I/O nodes: run Embedded Linux • Run CIOD to manage compute nodes • Perform file I/O • Run GPFS • Front-end nodes: run SuSE Linux • Support user logins • Run cross compilers & linker • Run parts of mpirun to submit jobs & LoadLeveler to manage jobs • Service node: runs SuSE Linux • Uses DB2 to manage four system databases • Runs control system software, including MMCS • Runs other parts of mpirun & LoadLeveler • Software comes in drivers: We are currently running Driver V1R3M1

SDSC Blue Gene: Getting Started Logging on & moving files • Logging on sshbglogin.sdsc.edu or ssh -lusername bglogin.sdsc.edu Alternate login node: bg-login4.sdsc.edu (We will use bg-login4 for the workshop) • Moving files scpfile username@bglogin.sdsc.edu:~ or scp -rdirectory username@bglogin.sdsc.edu:~

SDSC Blue Gene: Getting started Places to store your files • /users (home directory) • 1.1 TB NFS mounted file system • Recommended for storing source / important files. • Do not write data/output to this area: Slow and limited in size! • Regular backups • /bggpfs available for parallel I/O via GPFS • ~18.5 TB accessed via IA-64 NSD servers • No backups • 700 TB /gpfs-wan available for parallel I/O and shared with DataStar and TG IA-64 cluster.

SDSC Blue Gene: Checking your allocation • Use the reslist command to check your allocation on the SDSC Blue Gene • Sample output is as follows bg-login1 mahidhar/bg_workshop> reslist -u ux452208 Querying database, this may take several seconds ... Output shown is local machine usage. For full usage on roaming accounts, please use tgusage. SBG: Blue Gene at SDSC SU Hours SU Hours Name UID ACID ACC PCTG ALLOCATED USED USER ux452208 452208 1606 U 100 99999 0 Guest8, Hpc MKG000 1606 99999 40

Accessing HPSS from the Blue Gene • What is HPSS The centralized, long-term data storage system at SDSC is the High Performance Storage System (HPSS) • Setup your authentication run ‘get_hpss_keytab’ script • Use hsi, and htar clients to connect to HPSS. For example hsi put mytar.tar htar -c -f mytar.tar -L file_or_directory

Using the compilers:Important programming considerations • Front-end nodes have different processors & run different OS than compute nodes • Hence codes must be cross compiled • Care must be taken with configure scripts • Discovery of system characteristics during compilation (e.g., via configure) may require modifications to the configure script. • Make sure that if code has to be executed during the configure, it runs on the compute nodes. • Alternately, system characteristics can be specified by user and the configure modified to take this into account. • Some system calls are not supported by the compute node kernel

Using the compilers:Compiler versions, paths, & wrappers • Compilers (version numbers the same as on DataStar) XL Fortran V10.1: blrts_xlf & blrts_xlf90 XL C/C++ V8.0: blrts_xlc & blrts_xlC • Paths to compilers in default .bashrc export PATH=/opt/ibmcmp/xlf/bg/10.1/bin:$PATH export PATH=/opt/ibmcmp/vac/bg/8.0/bin:$PATH export PATH=/opt/ibmcmp/vacpp/bg/8.0/bin:$PATH • Compilers with MPI wrappers (recommended) mpxlf, mpxlf90, mpcc, & mpCC • Path to MPI-wrapped compilers in default .bashrc export PATH=/usr/local/apps/bin:$PATH

Using the compilers: Options • Compiler options -qarch=440 uses only single FPU per processor (minimum option) -qarch=440d allows both FPUs per processor (alternate option) -qtune=440 tunes for the 440 processor -O3 gives minimal optimization with no SIMDization -O3 –qarch=440d adds backend SIMDization -O3 –qhot adds TPO (a high-level inter-procedural optimizer) SIMDization, more loop optimization -O4 adds compile-time interprocedural analysis -O5 adds link-time interprocedural analysis (TPO SIMDization default with –O4 and –O5) • Current recommendation: Start with -O3 –qarch=440d –qtune=440 Try –O4, -O5 next

Using libraries • ESSL • Version 4.2 is available in /usr/local/apps/lib • MASS/MASSV • Version 4.3 is available in /usr/local/apps/lib • FFTW • Versions 2.1.5 and 3.1.2 available in both single & double precision. The libraries are located in /usr/local/apps/V1R3 • NETCDF • Versions 3.6.0p1 and 3.6.1 are available in /usr/local/apps/V1R3 • Example link paths -Wl,--allow-multiple-definition -L/usr/local/apps/lib -lmassv -lmass -lesslbg -L/usr/local/apps/V1R3/fftw-3.1.2s/lib -lfftw3f

Running jobs: Overview • There are two compute modes • Coprocessor (CO) mode: one compute processor per node • Virtual node (VN) mode: two compute processors per node • Jobs run in partitions or blocks • These are typically powers of two • Blocks must be allocated (or booted) before run & are restricted to a single user at a time • Only batch jobs are supported • Batch jobs are managed by LoadLeveler • Users can monitor jobs using llq –b & llq -x

Running jobs: LoadLeveler for batch jobs • Here is an example LoadLeveler run script (test.cmd) #!/usr/bin/ksh #@ environment = COPY_ALL; #@ job_type = BlueGene #@ account_no = <your user account> #@ class = parallel #@ bg_partition = <partition name; for example: top> #@ output = file.$(jobid).out #@ error = file.$(jobid).err #@ notification = complete #@ notify_user = <your email address> #@ wall_clock_limit = 00:10:00 #@ queue mpirun -mode VN -np <number of procs> -exe <your executable> -cwd <working directory> • Submit as follows: llsubmit test.cmd

Running jobs: mpirun options • Key mpirun options are -mode compute mode: CO or VN -np number of compute processors -mapfile logical mapping of processors -cwd full path of current working directory -exe full path of executable -args arguments of executable (in double quotes) -env environmental variables (in double quotes) (These are mostly different than for TeraGrid)

Running jobs: Partition Layout and Usage Guidelines • To make effective use of the Blue Gene, production runs should generally use one-fourth or more of the machine, i.e., 256 or more compute nodes. Thus predefined partitions are provided for production runs. • SDSC: All 3076 nodes • R01R02: 2048 nodes combing rack 1 & 2 • rack, R01, R02:all 1,024 nodes of each rack 0, rack 1, and rack 2 • top, bot; R01-top, R01-bot; R02-top, R02-bot: 512 nodes • top256–1 & top256–2 256 nodes in each half of the top midplane of rack 0 • bot256–1 & bot256–2 256 nodes in each half of the bottom midplane of rack 0 • Smaller 64 (bot64-1, …, bot64-8) and 128 (bot128-1 , … , bot128-4) node partitions are available for test runs. • Use the /usr/local/apps/utils/showq command to get more information on partition requests of jobs in the queue.

Running jobs: Partition Layout

Running Jobs: Reservation • There is a reservation in place for today’s workshop for all the guest users. • The reservation ID is bgsn.76.r • Set the LL_RES_ID variable to bgsn.76.r. This will automatically bind jobs to the reservation. • csh/tcsh: setenv LL_RES_ID bgsn.76.r • bash: export LL_RES_ID=bgsn.76.r

Running Jobs: Example 1 • The examples featured in today’s talk are included in the following directory: /bggpfs/projects/bg_workshop • Copy them to your directory by using the following command: cp -r /bggpfs/projects/bg_workshop /users/<your_dir> • In the first example we will compile a simple mpi program (mpi_hello_c.c/mpi_hello_f.f), use the sample Loadleveler script (example1.cmd) to submit and run the job.

Example 1 (contd.) • Compile the example files using the mpcc/mpxlf wrappers mpcc -o hello mpi_hello_c.c mpxlf –o hello mpi_hello_f.f • Modify the loadleveler submit file (example1.cmd). Add the account number, partition name, email address, and mpirun options • Use llsubmit to put the job in the queue llsubmit example1.cmd

Running Jobs: Example 2 • In example 2 we will use a IO benchmark (IOR) to illustrate the use of arguments with mpirun • The mpirun line is as follows mpirun -np 64 -mode CO -cwd /bggpfs/projects/bg_workshop –exe /bggpfs/projects/bg_workshop/IOR -args "-a MPIIO -b 32m -t 4m -i 3“ • The –mode, -exe, and –args options are used in this example. The –args option is used to pass options to the IOR executable.

Checkpoint-Restart on the Blue Gene • Checkpoint and restart are among the primary techniques for fault recovery on the Blue Gene. • The current version of the checkpoint library requires users to manually insert calls in their code to checkpoint their code at the proper place in their codes. • The process can be initialized by calling the BGLCheckpointInit() function. • Checkpoint files can be written by making a call to BGLCheckpoint(). This can be done any number of times and the checkpoint files are distinguished by a sequence number. • The environment variables BGL_CHKPT_RESTART_SEQNO and BGL_CHKPT_DIR_PATH control the restart number and location.

Example for Checkpoint-Restart • Let us look at the entire checkpoint restart process using the example provided in the /bggpfs/projects/bg_workshop directory. • We are using a simple Poisson solver to illustrate the checkpoint process (file: poisson-chkpt.f) • Compile the program using mpxlf and including the checkpoint library: mpxlf –o pchk poisson-chkpt.f /bgl/BlueLight/ppcfloor/bglsys/lib/libchkpt.rts.a • Use the chkpt.cmd file to submit the job • The program writes checkpoint files after every 1000 steps. The checkpoint files are tagged with the node ids and the sequence number. For example: ckpt.x06-y01-z00.1.2

Example for Checkpoint-Restart (Contd.) • Verify that the checkpoint restart works • From the first run (when the checkpoint files were written): Done Step # 3997 ; Error= 1.83992678887004613 Done Step # 3998 ; Error= 1.83991115295111185 Done Step # 3999 ; Error= 1.83989551716504351 Done Step # 4000 ; Error= 1.83987988151185511 Done Step # 4001 ; Error= 1.83986424599153198 Done Step # 4002 ; Error= 1.83984861060408078 Done Step # 4003 ; Error= 1.83983297534951951 • From the second run (continued from step 4000, sequence 4) Done Step # 4000 ; Error= 1.83987988151185511 Done Step # 4001 ; Error= 1.83986424599153198 Done Step # 4002 ; Error= 1.83984861060408078 • We get identical results from both runs

BG System Overview: References • Blue Gene Web site at SDSC http://www.sdsc.edu/us/resources/bluegene • Loadleveler guide http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.loadl.doc/loadl331/am2ug30305.html • Blue Gene Application development guide (from IBM redbooks) http://www.redbooks.ibm.com/abstracts/sg247179.html

Running on the SDSC Blue Gene

Running on the SDSC Blue Gene

Presentation Transcript

Performance Analysis on Blue Gene/P

SDSC Blue Gene: Optimization and Debugging Mahidhar Tatineni SDSC, April 6, 2007

PAPI 3.0.8.1 on Blue Gene L

Case Study: Blue Gene P

Running jobs on SDSC Resources

Blue Gene extreme I/O

Blue Gene / C

BLUE GENE/L

SDSC Blue Gene: Overview

Early Experiences with KTAU on the Blue Gene / L

Blue Gene Bring Up

Early Experiences with KTAU on the IBM Blue Gene / L

Analysis of Cluster Failures on Blue Gene Supercomputers

Running jobs on SDSC Resources

Blue Gene Simulator

The IBM Blue Gene/L System Architecture

Application Performance Analysis on Blue Gene/L

The Blue Gene Experience

Blue Gene/P Navigator

Running Jewelry Charms | Running on the Wall

Blue Gene / C