Computing Workshop for Users of NCAR’s SCD machines

Computing Workshop for Users of NCAR’s SCD machines Christiane Jablonowski (cjablono@ucar.edu) NCAR ASP/SCD 31 January 2006 ML Mesa Lab, Chapman Room video conference facilities: FL EOL Atrium and CG1 3150

Overview • Current machine architectures at NCAR (SCD) • Some basics on parallel computing • Batch queuing systems at NCAR • GAU resources & how to obtain a GAU account • Insights into GAU charges • The Mass Storage System • How to monitor the GAUs • Some practical tips on benchmarks, debugging tools, restarts… • ???

Computer architectures • SCD’s machines are UNIX-based parallel computing architectures • Two types: • Hybrid(shared and distributed memory) machines like bluesky (IBM Power4) bluevista (IBM Power5)lightning(AMD Opteron system running Linux) • Shared memory system liketempest (SGI, 128 CPUs), predominantly used for post-processing jobs

Parallel Programming • Parallel machines require parallel programming techniques in the user application: • MPI (Message Passing Interface) for distributed memory systems, can also be used on shared memory systems • OpenMP for shared memory systems • Hybrid (MPI & OpenMP) programming technique common on the IBMs at NCAR • Pure MPI parallelization often the fastest option, computational domain is split into pieces that can communicate over the network (via messages) • OpenMP: Parallelization of (mostly) loops via compiler directives • Parallelization provided in CAM/CCSM/WRF

Most common: Hybrid hardware architectures • Combined shared and distributed memory architecture: • Shared-memory symmetric multiprocessor (SMP) nodes,processors on a node have direct access to memory • Nodes are connected via the network (distributed memory)

MPI example Processors communicate via messages

MPI Example • Initialize & finalize MPI in your program via function/subroutine calls to the MPI library. Examples include:MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Finalize Example fromprevious pagein C notation(unoptimized): Important to note: such an operation (computing a global sum) is very common, therefore MPI provides a highly optimized function, also called a ‘reduction operation’ MPI_Reduce (…) that can replace the example above

Example: domain decompositions for MPI Each color presentsa processor

OpenMP Example Parallel loops via compiler directives (here: in Fortran notation) Before program is called set: setenv OMP_NUM_THREADS #proc Add compiler directives in your code: !$OMP PARALLEL DO DO i = 1, n a(i) = b(i) + c(i) END DO !$OMP END PARALLEL DO master thread team master thread Assume n=1000 & #proc=4: The loop will be split into 4 ‘threads’ that run in parallel with loop indices 1…250, 251…500, 501…750, 751…1000

SCD’s machines • Bluesky (web page) • ‘Oldest’ machine at NCAR (2002) • Lots of user experience at NCAR, easy access to help • CAM/CCSM/WRF are set up for this architecture (Makefiles) • Batch queuing system LoadLeveler, short interactive runs possible • Batch queues are listed under http://www.cisl.ucar.edu/computers/bluesky/queue.charge.html • Lots of additional software available: e.g. math libraries, graphics packages, Totalview debugger

SCD’s machines • Bluevista (web page) • Newest machine on the floor (Jan. 2006) • CAM/CCSM/WRF are (probably) set up for this architecture • Batch queuing system LSF (Load Sharing Facility) • Queue names different from bluesky: premium, regular, economy, standby, debug, sharehttp://www.cisl.ucar.edu/computers/bluevista/queue.charge.html • Some additional software available: e.g. math libraries, Totalview debugger

SCD’s machines • Lightning (web page) • Linux cluster • Compilers different from the IBMs:Portland Group or Pathscale • Batch queuing system LSF • Same queue names as bluevista • Some support software • Tempest (web page) • for data post-processing with yet another batch queuing system NQS • Lots of support software • Interactive use possible

Example of a LoadLeveler job script Parallel job with 32 MPI processes, com_reg32 queue (32-way node) regular queue #@ class = com_rg32 #@ node = 1 #@ tasks_per_node = 32 #@ output = out.$(jobid) #@ error = out.$(jobid) #@ job_type = parallel #@ wall_clock_limit = 00:20:00 #@ network.MPI = csss,not_shared,us #@ node_usage = not_shared #@ account_no = 54042108 #@ ja_report = yes #@ queue… setenv OMP_NUM_THREADS 1… 32 MPI processesper 32-way node Submit the job via: llsubmit job_script

Example of a LoadLeveler job script Hybrid parallel job with 8 MPI processes and 4 OpenMP threads economy queue #@ class = com_ec32 #@ node = 1 #@ tasks_per_node = 8 #@ output = out.$(jobid) #@ error = out.$(jobid) #@ job_type = parallel #@ wall_clock_limit = 00:20:00 #@ network.MPI = csss,not_shared,us #@ node_usage = not_shared #@ account_no = 54042108 #@ ja_report = yes #@ queue … setenv OMP_NUM_THREADS 4… 8 MPI processesper 32-way node Submit the job via: llsubmit job_script

Example of an LSF job script (lightning) Parallel job with 8 MPI processes (on 4 2-way nodes) #! /bin/csh ## #BSUB -a 'mpich_gm' #BSUB -P 54042108 #BSUB -q regular #BSUB -W 00:30 #BSUB -x #BSUB -n 8 #BSUB -R "span[ptile=2]" #BSUB -o fvcore_amr.out.%J #BSUB -e fvcore_amr.err.%J #BSUB -J test0.path ## mpirun.lsf -v ./dycore select on lightning regular queue wallclock limit 30 min 8 MPI processes (total) 2 MPI processes per node name of the job (listedin the SCD Portal) Submit the job via: bsub < job_script

Example of an LSF job script (bluevista) Parallel job with 8 MPI processes (on 1 8-way node) #! /bin/csh ## #BSUB -a poe #BSUB -P 54042108 #BSUB -q economy #BSUB -W 00:30 #BSUB -x #BSUB -n 8 #BSUB -R "span[ptile=8]" #BSUB -o fvcore_amr.out.%J #BSUB -e fvcore_amr.err.%J #BSUB -J test0.path ## mpirun.lsf -v ./dycore select ‘poe’ on bluevista economy queue exclusive use (not shared) Allows up to 8 MPI processes on a node Submit the job via: bsub < job_script

More information on SCD’s machines • Web page: SCD’s Support and Consulting services • SCD’s costomer support sometimes you even get help on the weekends or in the evenings • Email: consult1@ucar.edu • Phone: 303 497 1278 • Walk-in support at the Mesa Lab • Check out SCD’s Daily Bulletin (scheduled machine downtimes, etc.) • Subscribe to the hpcstatus mailing list (short e-mails about machine status, system updates)

GAU resources • ASP has a monthly allocation of 3850 GAUs (General Accounting Units) • A GAU is a measure for some compute time on the supercomputers maintained by NCAR’s Scientific Computing Division (SCD):http://www.cisl.ucar.edu/ • Access to these machines require • an SCD login account (dbs@ucar.edu or 303-497-1225) • a GAU account (for ASP: contact Maura, otherwise contact your division / apply for a university account) • ssh environment • and a crypto card (for secure access) • SCD contacts: Dick Valent & Mike Page (here today), Juli Rew, Siddhartha Gosh, Ginger Caldwell (GAUs)

GAU resources • GAUs: Use it or lose it - strategy • In ASP: We share the resource among the ASP postdocs & graduate fellows • Distribution is flexible and will be discussed occasionally, e.g. monthly, either via meetings or e-mail discussions: email: asp-gau-users@asp.ucar.edu • GAUs are also charged for • storing files in the Mass Storage System (MSS) • file transfers from MSS to other machines

ASP GAU account • ASP GAU account number: 54042108(also project number) • Needs to be specified in the batch job scripts • ASP account number is not your default account number • Therefore: everybody needs a second (default) GAU account: • divisional GAU account • so-called University account (small request form for 1500 GAUs http://www.cisl.ucar.edu/resources/compServ.html)these GAUs do not expire every month, one-time allocation • Second GAU account should be used for the accumulating MSS charges • automatic when using CAM / CCSM’s MSS option

GAU charges on SCD’s supercomputers • You are charged GAUs for how much time you use a processor (on bluesky, bluevista, lightning, tempest) • On bluesky, there are actually two formulas: • Shared-node usage:GAUs charged = CPU hours used  computer factor  class charging factor • Dedicated-node usage:GAUs charged = wallclock hours used  number of nodes used  number of processors in that node computer factor  class charging factor Slides on GAU charges: Modified from an earlier presentation by George Bryan, NCAR MMM

“Number of nodes used” and“Number of processors in that node” • Self explanatory (?) • Bluesky: • 76 8-way (processors) nodes • 25 32-way (processors) nodes • Bluevista: • 78 8-way (processors) nodes • Lightning • 128 2-way (processors) nodes

“CPU hours used” and “Wallclock hours used” • Measure of how long you “used” a processor • NOTE: This includes all time you were allocated the use of a processor, whether you actually used it or not • Example: you used two 8-processor nodes on bluesky. The job started at 1:00 PM and finished at 2:30 PM. You are charged for 1.5 hrs

“Computer factor” • A measure of how powerful a computer is • Bluesky: 0.24 • Bluevista: 0.5 • Lightning: 0.34 • This “levels the playing field”

“Class charging factor” • Tied to queuing system: “How quickly do you want your results, and how much are you willing to pay for it?” • Current setting on all SCD supercomputers: • Premium = 1.5 (highest priority, fastest turnaround) • Regular = 1.0 • Economy = 0.5 • Standby = 0.1 (lowest priority, slow turnaround)

Example • Recall dedicated-node usage on bluesky • GAUs charged = wallclock hours used  number of nodes used  number of processors in that node  computer factor  class charging factor • 1.5 hours using two 8-processor nodes • Bluesky regular queue • GAUs used = 1.5  2  8  0.24  1.0 = 5.76 GAUs • In premium queue, this would be 8.64 GAUs • In standby queue, this would be 0.576 GAUs

Recommendations: Queuing systems • Check the queue before you submit any job: • If the queue is not busy, try using the standby or economy queues • The queue tends to be “emptier” evenings, weekends, and holidays • Job will start sooner when specifying a wallclock limit in the job script (scheduler tries to ‘squeeze in’ short jobs) • The less processors you request, the sooner you start • Use the premium queue sparingly • Short debug jobs (there is also a special debug queue on lightning) • When that conference paper is due

Recommendations: # of processors vs. run times • If you are using more processors, you might wait longer in the queue, but usually the actual runtime of your job is reduced • Caveat: it usually costs more GAUs • Example: you run the same job, but using • Using 8 processors, the job ran in 24 hours • Using 64 processors, the job ran in 4 hours • 1st example used 46 GAUs • 2nd example used 61 GAUs

The Mass Storage System • MSS: Mass storage system (disks and cartridges) for your big data sets • MSS connected to the SCD machines, sometimes also to divisional computers • MSS user have directories like mss:/LOGIN_NAME/ • Quick online reference (mss commands):http://www.cisl.ucar.edu/docs/mss/mss-commandlist.html • You are charged GAUs for using the MSS • The GAU equation for MSS is more complicated ....

MSS Charges • GAUs charged = .0837  R + .0012  A + N  (.1195  W + .2050  S) • where: • R = Gigabytes read • W = Gigabytes created or written • A = Number of disk drive or tape cartridge accesses • S = Data stored, in gigabyte-years • N = Number of copies of file: 1 if economy reliability selected; 2 if standard reliability selected

Recommendations: The MSS • MSS charges seem small, but they add up! • Examples: FY04 MSS usage • ACD: 24,000 of 60,000 GAUs • CGD: 94,500 of 181,000 GAUs • HAO: 22,000 of 122,000 GAUs • MMM: 34,000 of 139,000 GAUs • RAP: 32,000 of 35,000 GAUs

Recommendations: The MSS • Recommendation for ASP users: • use an account in your home division or your so-called ‘university’ account (1500 GAUs for postdocs, you need to apply) for MSS charges – leave ASP GAUs for supercomputing

GAU Usage Strategy: 30-day and 90-day averages • The allocation actually works through 30-day and 90-day averages • Limits: 120% for 30-day use 105% for 90-day use • It is helpful to spread usage out evenly • How to check GAU usage: • Type “charges” on command line of a supercomputer • Check the “daily summary” output (next page) • SCD Portal: look for the link on SCD’s main page: http://www.cisl.ucar.edu/

Web page: http://www.cisl.ucar.edu/dbsg/dbs/ASP/ ASP 30 Day Percent = 57.0 % ASP 90 Day Percent = 48.3 %30 Day Allocation = 3850 90 Day Allocation = 1155030 Day Use = 2193 90 Day Use = 5575 90 DAY ST -- 30 DAY ST -- LAST DAY 01-NOV-05 31-DEC-05 29-JAN-06 ASP Gaus Used by Day 01-NOV-05 9.3603-NOV-05 .03 04-NOV-05 141.45… 22-JAN-06 .0423-JAN-06 44.29 24-JAN-06 170.83 25-JAN-06 120.30 26-JAN-06 91.67 27-JAN-06 41.97 28-JAN-06 15.59 29-JAN-06 16.95

What happens when we use too many GAUs? • Your jobs will be thrown into a very low priority: the dreaded hold queue • It will be hard to get work done • But, jobs will still run • ASP Users: You can use more than 3850 GAUs / month • Experience says, it’s better to use too many than not enough

What happens when we use too many/too few GAUs? Too many: • Recommendation: when the 30- and 90-day averages are running high, use the economy or standby queue ... conserve GAUs • But, don’t worry about going overToo few: • ASP’s allocation will be cut in the long run if the 3850 GAUs per month allocation is not used

How to catch up when behind • Be wasteful: • Use the premium queue • Use more processors than you need • Have fun: • Try something you always wanted to do, but never had the resources

How to conserve GAUs • Be frugal: • Use the economy and standby queues • Use fewer processors • Use divisional GAUs (if possible) or your ‘university’ GAU account

How to share & monitor GAUs in ASP • Communicate! • Occasionally, we (ASP postdocs) use the e-mail list:asp-gau-users@asp.ucar.edu to announce a ‘busy’ GAU period • Keep watching the ASP GAU usage on the webpage http://www.cisl.ucar.edu/dbsg/dbs/ASP/or in the SCD Portal • Look for the SCD Portal link on the SCD page:http://www.cisl.ucar.edu/

SCD Portal • Online tool that helps you monitor the GAU charges and the current machine status (e.g. batch queues), display can be customized • Information on the machine status requires a setup-command on roy.scd.ucar.edu via the crypto-card access, just enter ‘scdportalkey hostname’ (e.g. lightning) after logging on with the crypto-card • At this time (Jan/31/2006) the GAU charges on bluevista are not itemized: will be included in the next release in Spring 2006

Other IBM resources • Sources of information on the IBM machines bluesky (from the command line), batchview also works on bluevista & lightning • batchview for overview of jobs with their rankings • llq for list of all submitted jobs, no ranking • spinfo : queue limits, memory quotas on home file system and the temporary file system /ptmp • Useful IBM LoadLeveler keywords in the script:#@account_no=54042108 -> ASP account #@ja_report=yes -> job report (see example on the next page) • Useful LoadLeveler commands: llsubmit script_file, llcancel job_id

Example: IBM Job Report • If selected, one email per job is sent to you at midnight, Output on the IBM machines, here blackforest (meanwhile decommisioned): Job Accounting - Summary Report =============================== Operating System : blackforest AIX51 User Name (ID) : cjablono (7568) Group Name (ID) : ncar (100) Account Name : 54042108 Job Name : bf0913en.26921 Job Sequence Number : bf0913en.26921 Job Starts : 12/20/04 17:56:33 Job Ends : 12/20/04 23:26:34 Elapsed Time (Wall-Clock * #CPU): 633632 s Number of Nodes (not_shared) : 8 Number of CPUs : 32 Number of Steps : 1

IBM Job Report (continued) Charge Components Wall-clock Time : 5:30:01 Wall-clock CPU hours : 176.00889 hrs Multiplier for com_ec Queue : 0.50 Charge before Computer Factor : 88.00444 GAUs Multiplier for computer blackforest: 0.10 Charged against Allocation : 8.80044 GAUsProject GAUs Allocated : 5000.00 GAUs Project GAUs Used, as of 12/16/04:1889.20 GAUs Division GAUs 30-Day Average : 103.3% Division GAUs 90-Day Average : 58.6%

How to increase the efficiency • Get a feel for the GAUs for long jobs:benchmarkthe application on target machine • Run a short but relevant test problem and measure the run time (wall clock time) via MPI commands (function MPI_WTIME) or UNIX timing commands like time or timex (output formats are shell-script dependent) • Vary number of processors to assess the scaling • If application scales poorly, avoid using a large number of processors (waste of GAUs), instead use smaller number with numerous restarts • Make sure your job fits into the queue (finishes before the max. time is up) • Use compiler options, especially the optimization options • In case of programming problems: the Totalview debugger can save you days, weeks or even monthson the IBM’s: compile your program with the compiler options:-g -qfullpath-d

Restarts • Restart files are important for long simulations • Queue limits are up to 6 wallclock hours (hard limit, job fails afterwards), then a restart becomes necessary • Get information on the queue limits (SCD web page) and select the job’s integration time accordingly • Restarts built into CAM/CCSM/WRF, must only be activated • Restarts for other user applications must probably be programmed

Questions ?

Computing Workshop for Users of NCAR’s SCD machines