1 / 25

Running on the SDSC Blue Gene

Running on the SDSC Blue Gene. Mahidhar Tatineni Blue Gene Workshop SDSC, April 5, 2007. BG System Overview: SDSC’s three-rack system . BG System Overview: Integrated system. BG System Overview: Multiple operating systems & functions. Compute nodes: run Compute Node Kernel (CNK = blrts)

rianne
Download Presentation

Running on the SDSC Blue Gene

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Running on the SDSC Blue Gene Mahidhar Tatineni Blue Gene Workshop SDSC, April 5, 2007

  2. BG System Overview:SDSC’s three-rack system

  3. BG System Overview: Integrated system

  4. BG System Overview:Multiple operating systems & functions • Compute nodes: run Compute Node Kernel (CNK = blrts) • Each run only one job at a time • Each use very little memory for CNK • I/O nodes: run Embedded Linux • Run CIOD to manage compute nodes • Perform file I/O • Run GPFS • Front-end nodes: run SuSE Linux • Support user logins • Run cross compilers & linker • Run parts of mpirun to submit jobs & LoadLeveler to manage jobs • Service node: runs SuSE Linux • Uses DB2 to manage four system databases • Runs control system software, including MMCS • Runs other parts of mpirun & LoadLeveler • Software comes in drivers: We are currently running Driver V1R3M1

  5. SDSC Blue Gene: Getting Started Logging on & moving files • Logging on sshbglogin.sdsc.edu or ssh -lusername bglogin.sdsc.edu Alternate login node: bg-login4.sdsc.edu (We will use bg-login4 for the workshop) • Moving files scpfile username@bglogin.sdsc.edu:~ or scp -rdirectory username@bglogin.sdsc.edu:~

  6. SDSC Blue Gene: Getting started Places to store your files • /users (home directory) • 1.1 TB NFS mounted file system • Recommended for storing source / important files. • Do not write data/output to this area: Slow and limited in size! • Regular backups • /bggpfs available for parallel I/O via GPFS • ~18.5 TB accessed via IA-64 NSD servers • No backups • 700 TB /gpfs-wan available for parallel I/O and shared with DataStar and TG IA-64 cluster.

  7. SDSC Blue Gene: Checking your allocation • Use the reslist command to check your allocation on the SDSC Blue Gene • Sample output is as follows bg-login1 mahidhar/bg_workshop> reslist -u ux452208 Querying database, this may take several seconds ... Output shown is local machine usage. For full usage on roaming accounts, please use tgusage. SBG: Blue Gene at SDSC SU Hours SU Hours Name UID ACID ACC PCTG ALLOCATED USED USER ux452208 452208 1606 U 100 99999 0 Guest8, Hpc MKG000 1606 99999 40

  8. Accessing HPSS from the Blue Gene • What is HPSS The centralized, long-term data storage system at SDSC is the High Performance Storage System (HPSS) • Setup your authentication run ‘get_hpss_keytab’ script • Use hsi, and htar clients to connect to HPSS. For example hsi put mytar.tar htar -c -f mytar.tar -L file_or_directory

  9. Using the compilers:Important programming considerations • Front-end nodes have different processors & run different OS than compute nodes • Hence codes must be cross compiled • Care must be taken with configure scripts • Discovery of system characteristics during compilation (e.g., via configure) may require modifications to the configure script. • Make sure that if code has to be executed during the configure, it runs on the compute nodes. • Alternately, system characteristics can be specified by user and the configure modified to take this into account. • Some system calls are not supported by the compute node kernel

  10. Using the compilers:Compiler versions, paths, & wrappers • Compilers (version numbers the same as on DataStar) XL Fortran V10.1: blrts_xlf & blrts_xlf90 XL C/C++ V8.0: blrts_xlc & blrts_xlC • Paths to compilers in default .bashrc export PATH=/opt/ibmcmp/xlf/bg/10.1/bin:$PATH export PATH=/opt/ibmcmp/vac/bg/8.0/bin:$PATH export PATH=/opt/ibmcmp/vacpp/bg/8.0/bin:$PATH • Compilers with MPI wrappers (recommended) mpxlf, mpxlf90, mpcc, & mpCC • Path to MPI-wrapped compilers in default .bashrc export PATH=/usr/local/apps/bin:$PATH

  11. Using the compilers: Options • Compiler options -qarch=440 uses only single FPU per processor (minimum option) -qarch=440d allows both FPUs per processor (alternate option) -qtune=440 tunes for the 440 processor -O3 gives minimal optimization with no SIMDization -O3 –qarch=440d adds backend SIMDization -O3 –qhot adds TPO (a high-level inter-procedural optimizer) SIMDization, more loop optimization -O4 adds compile-time interprocedural analysis -O5 adds link-time interprocedural analysis (TPO SIMDization default with –O4 and –O5) • Current recommendation: Start with -O3 –qarch=440d –qtune=440 Try –O4, -O5 next

  12. Using libraries • ESSL • Version 4.2 is available in /usr/local/apps/lib • MASS/MASSV • Version 4.3 is available in /usr/local/apps/lib • FFTW • Versions 2.1.5 and 3.1.2 available in both single & double precision. The libraries are located in /usr/local/apps/V1R3 • NETCDF • Versions 3.6.0p1 and 3.6.1 are available in /usr/local/apps/V1R3 • Example link paths -Wl,--allow-multiple-definition -L/usr/local/apps/lib -lmassv -lmass -lesslbg -L/usr/local/apps/V1R3/fftw-3.1.2s/lib -lfftw3f

  13. Running jobs: Overview • There are two compute modes • Coprocessor (CO) mode: one compute processor per node • Virtual node (VN) mode: two compute processors per node • Jobs run in partitions or blocks • These are typically powers of two • Blocks must be allocated (or booted) before run & are restricted to a single user at a time • Only batch jobs are supported • Batch jobs are managed by LoadLeveler • Users can monitor jobs using llq –b & llq -x

  14. Running jobs: LoadLeveler for batch jobs • Here is an example LoadLeveler run script (test.cmd) #!/usr/bin/ksh #@ environment = COPY_ALL; #@ job_type = BlueGene #@ account_no = <your user account> #@ class = parallel #@ bg_partition = <partition name; for example: top> #@ output = file.$(jobid).out #@ error = file.$(jobid).err #@ notification = complete #@ notify_user = <your email address> #@ wall_clock_limit = 00:10:00 #@ queue mpirun -mode VN -np <number of procs> -exe <your executable> -cwd <working directory> • Submit as follows: llsubmit test.cmd

  15. Running jobs: mpirun options • Key mpirun options are -mode compute mode: CO or VN -np number of compute processors -mapfile logical mapping of processors -cwd full path of current working directory -exe full path of executable -args arguments of executable (in double quotes) -env environmental variables (in double quotes) (These are mostly different than for TeraGrid)

  16. Running jobs: Partition Layout and Usage Guidelines • To make effective use of the Blue Gene, production runs should generally use one-fourth or more of the machine, i.e., 256 or more compute nodes. Thus predefined partitions are provided for production runs. • SDSC: All 3076 nodes • R01R02: 2048 nodes combing rack 1 & 2 • rack, R01, R02:all 1,024 nodes of each rack 0, rack 1, and rack 2 • top, bot; R01-top, R01-bot; R02-top, R02-bot: 512 nodes • top256–1 & top256–2 256 nodes in each half of the top midplane of rack 0 • bot256–1 & bot256–2 256 nodes in each half of the bottom midplane of rack 0 • Smaller 64 (bot64-1, …, bot64-8) and 128 (bot128-1 , … , bot128-4) node partitions are available for test runs. • Use the /usr/local/apps/utils/showq command to get more information on partition requests of jobs in the queue.

  17. Running jobs: Partition Layout

  18. Running Jobs: Reservation • There is a reservation in place for today’s workshop for all the guest users. • The reservation ID is bgsn.76.r • Set the LL_RES_ID variable to bgsn.76.r. This will automatically bind jobs to the reservation. • csh/tcsh: setenv LL_RES_ID bgsn.76.r • bash: export LL_RES_ID=bgsn.76.r

  19. Running Jobs: Example 1 • The examples featured in today’s talk are included in the following directory: /bggpfs/projects/bg_workshop • Copy them to your directory by using the following command: cp -r /bggpfs/projects/bg_workshop /users/<your_dir> • In the first example we will compile a simple mpi program (mpi_hello_c.c/mpi_hello_f.f), use the sample Loadleveler script (example1.cmd) to submit and run the job.

  20. Example 1 (contd.) • Compile the example files using the mpcc/mpxlf wrappers mpcc -o hello mpi_hello_c.c mpxlf –o hello mpi_hello_f.f • Modify the loadleveler submit file (example1.cmd). Add the account number, partition name, email address, and mpirun options • Use llsubmit to put the job in the queue llsubmit example1.cmd

  21. Running Jobs: Example 2 • In example 2 we will use a IO benchmark (IOR) to illustrate the use of arguments with mpirun • The mpirun line is as follows mpirun -np 64 -mode CO -cwd /bggpfs/projects/bg_workshop –exe /bggpfs/projects/bg_workshop/IOR -args "-a MPIIO -b 32m -t 4m -i 3“ • The –mode, -exe, and –args options are used in this example. The –args option is used to pass options to the IOR executable.

  22. Checkpoint-Restart on the Blue Gene • Checkpoint and restart are among the primary techniques for fault recovery on the Blue Gene. • The current version of the checkpoint library requires users to manually insert calls in their code to checkpoint their code at the proper place in their codes. • The process can be initialized by calling the BGLCheckpointInit() function. • Checkpoint files can be written by making a call to BGLCheckpoint(). This can be done any number of times and the checkpoint files are distinguished by a sequence number. • The environment variables BGL_CHKPT_RESTART_SEQNO and BGL_CHKPT_DIR_PATH control the restart number and location.

  23. Example for Checkpoint-Restart • Let us look at the entire checkpoint restart process using the example provided in the /bggpfs/projects/bg_workshop directory. • We are using a simple Poisson solver to illustrate the checkpoint process (file: poisson-chkpt.f) • Compile the program using mpxlf and including the checkpoint library: mpxlf –o pchk poisson-chkpt.f /bgl/BlueLight/ppcfloor/bglsys/lib/libchkpt.rts.a • Use the chkpt.cmd file to submit the job • The program writes checkpoint files after every 1000 steps. The checkpoint files are tagged with the node ids and the sequence number. For example: ckpt.x06-y01-z00.1.2

  24. Example for Checkpoint-Restart (Contd.) • Verify that the checkpoint restart works • From the first run (when the checkpoint files were written): Done Step # 3997 ; Error= 1.83992678887004613 Done Step # 3998 ; Error= 1.83991115295111185 Done Step # 3999 ; Error= 1.83989551716504351 Done Step # 4000 ; Error= 1.83987988151185511 Done Step # 4001 ; Error= 1.83986424599153198 Done Step # 4002 ; Error= 1.83984861060408078 Done Step # 4003 ; Error= 1.83983297534951951 • From the second run (continued from step 4000, sequence 4) Done Step # 4000 ; Error= 1.83987988151185511 Done Step # 4001 ; Error= 1.83986424599153198 Done Step # 4002 ; Error= 1.83984861060408078 • We get identical results from both runs

  25. BG System Overview: References • Blue Gene Web site at SDSC http://www.sdsc.edu/us/resources/bluegene • Loadleveler guide http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=/com.ibm.cluster.loadl.doc/loadl331/am2ug30305.html • Blue Gene Application development guide (from IBM redbooks) http://www.redbooks.ibm.com/abstracts/sg247179.html

More Related