U sing ITaP clusters for large scale statistical analysis with R

Using ITaP clusters for large scale statistical analysis with R Doug Crabill Purdue University

Topics • Running multiple R jobs on departmental Linux servers serially, and in parallel • Cluster concepts and terms • Use cluster to run same R program many times • Use cluster to run same R program many times with different parameters • Running jobs on an entire cluster node

Invoking R in a batch mode • R CMD BATCH vs Rscript • Rscriptt.R > t.out • Rscriptt.R > t.out & # run in background • nohupRscriptt.R > t.out & # run in background, even if logout • Can launch several such jobs with &’s simultaneously, but best to stick with between 2 and 8 jobs per departmental server. • Invoking manually doesn’t scale well for running dozens of jobs

Launching several R jobs serially • Create a file like “run.sh” that contains several Rscript invocations. The first job will run until it completes, then the second job will run, etc. • Do NOT use “&” at the end or it could crash server Rscriptt.R > tout.001 Rscriptt.R > tout.002 Rscriptt.R > tout.003 Rscriptt.R > tout.004 Rscriptt.R > tout.005 Rscriptt.R > tout.006 Rscriptt.R > tout.007 Rscriptt.R > tout.008

Creating the “run.sh” script programatically • Extra cool points for creating the “run.sh” script using R > sprintf("Rscriptt.R > tout.%03d", 1:8) [1] "Rscriptt.R > tout.001" "Rscriptt.R > tout.002" "Rscriptt.R > tout.003" [4] "Rscriptt.R > tout.004" "Rscriptt.R > tout.005" "Rscriptt.R > tout.006" [7] "Rscriptt.R > tout.007" "Rscriptt.R > tout.008" > write(sprintf("Rscriptt.R > tout.%03d", 1:8), "run.sh")

Invoking “run.sh” • sh run.sh # run every job in dorun.sh one at a time • nohupsh run.sh & # run every job in dorun.sh one at a time, and keep running even if logout • nohupxargs -d '\n' -n1 -P4 sh -c < run.sh & # run every job in run.sh keeping 4 jobs running simultaneously keep running even if logout

Supercomputers and clusters • Supercomputer = A collection of computer nodes in a cluster managed by scheduler • Node = one computer in a cluster (dozens to hundreds) • Core = one CPU core in a Node (often 8 to 48 cores per node) • Front end = one or more computers used for launching jobs on the cluster • PBS / Torque is the scheduling software. PBS is like the maître d’, seating groups of various sizes for varying times at available tables, with a waiting list, a bar, reservations, bad customers that spoil the party

ITaP / RCAC clusters • Conte – was fastest supercomputer on any academic campus in the world when built June 2013. Has Intel Phi coprocessors • Carter – Has NVIDIA GPU-accelerated nodes • Hansen, Rossmann, Coates • Scholar – uses part of Carter, for instructional use • Hathi – Hadoop • Radon – Accessible by all researchers on campus • Anecdotes… <queue patriotic music>

More info on radon cluster • https://www.rcac.purdue.edu/ then select Computation->Radon • Read the User’s guide on left sidebar

Logging into radon • Get accounts on the RCAC website previously mentioned or ask me • Make an SSH connection to radon.rcac.purdue.edu • From Linux (or Mac terminal), type this to log into one of the cluster front ends (as user dgc): • ssh –X radon.rcac.purdue.edu –l dgc • Do not run big jobs on the front ends! They are only used for submitting jobs to the cluster and light testing and debugging

File storage on radon • Home directory quota is ~10GB (type “myquota”) • Can be increased to 100GB via Boiler Backpack settings at http://www.purdue.edu/boilerbackpack • Scratch storage of around 1TB per user. This directory differs per user, and is accessible by the $RCAC_SCRATCH environment variable: • All nodes can see all files in home and scratch radon-fe01 ~ $ cd $RCAC_SCRATCH radon-fe01 /scratch/radon/d/dgc $

Software on radon • List of applications installed on radon can be found in the users guide previously mentioned • The module command is used to “load” software packages for use by the current login session • module avail #See the list of applications available • module load r # Add “R” • Must be included as part of every R job to be run on the cluster

PBS scheduler commands • qstat# See list of jobs in the queue • qstat –u dgc # See list of jobs in the queue submitted by dgc • qsub jobname.sh # Submit jobname.sh to run on the cluster • qdel JOBIDNUMBER # delete a previously submitted job from the queue

Simple qsub submission file • Qsub accepts command line arguments, or embedded comments that are ignored by the script, but honored by qsub. • The JOBID of this particular job is 683369 radon-fe01 ~/cluster $ cat myjob.sh #!/bin/sh -l #PBS -l nodes=1:ppn=1 #PBS -l walltime=00:10:00 cd $PBS_O_WORKDIR /bin/hostname radon-fe01 ~/cluster $ qsub myjob.sh 683369.radon-adm.rcac.purdue.edu

Viewing status and the results • Use qstat or qstat -u dgc to check job status • Output of job 683369 goes to myjob.sh.o683369 • Errors from job 683369 goes to myjob.sh.e683369 • Inconvenient to collect the results from a dynamically named file like myjob.sh.o683369. Best to write output to a filename of your choosing by writing directly to filenames of your choosing in your R program or directing the output to a file in your job submission file

Our first R job submission • Say we want to run the R program t.R on radon: • Create R1.sh with contents: • Submit using qsub R1.sh summary(1 + rgeom(10^7, 1/1000 )) #!/bin/sh -l #PBS -l nodes=1:ppn=1 #PBS -l walltime=00:10:00 cd $PBS_O_WORKDIR module add r Rscriptt.R > out1

Let’s do that 100 times • Using our “R1.sh” file as a template, create files prog001.sh through prog100.sh, changing the output file for each job to be out.NNN. In R: • Submit all 100 jobs by typing sh –x runall.sh • Generating files using bash instead (all on one line): s<-scan("R1.sh", what='c', sep="\n") sapply(1:100, function(i) { s[6]=sprintf("Rscriptt.R > out.%03d", i); write(s, sprintf("prog%03d.sh", i)); }) write(sprintf("qsub prog%03d.sh", 1:100), "runall.sh") for i in `seq -w 1 100`; do (head -6 R1.sh; echo "Rscriptt.R > out.$i") > prog$i.sh; echo "qsub prog$i.sh"; done > runall.sh

Coupon collector problem • I want to solve the coupon collector problem with large parameters but it will take much too long on a single computer (around 2.5 days): • The obvious approach is to break it into 10,000 smaller R jobs and submit them to the cluster. • Better to break it into 250 jobs, each operating on 40 numbers. • Create an R script that accepts command line arguments to process many numbers at a time. Estimate walltimecarefully! sum(sapply(1:10000, function(y) {mean(1 + rgeom(10^8, y/10000))}))

Coupon collector R code • t2.R, read arguments into “args”, process each • Can test via: • Generate 250 scripts with 40 arguments each args <- commandArgs(TRUE) sapply(as.integer(args), function(y) {mean(1 + rgeom(10^8, y/10000))}) Rscript t2.R 100 125 200 # Change reps from 10^8 to 10^5 for test s<-scan("R2.sh", what='c', sep="\n") sapply(1:250, function(y) { s[6]=sprintf("Rscript t2.R %s > out.%03d", paste((y*40-39):(y*40), collapse=" "), y); write(s, sprintf("prog%03d.sh", y)); }) write(sprintf("qsub prog%03d.sh", 1:250), "runall.sh")

Coupon collector results • Output is in the files out.001 through out.250: • It’s hard to read 250 files with that stupid leading column. UNIX tricks to the rescue! • Cha-ching! radon-fe00 ~/cluster/R2done $ cat out.001 [1] 9999.8856 5000.4830 3333.0443 2499.8564 1999.7819 1666.2517 1428.6594 [8] 1249.9841 1110.9790 1000.0408 909.1430 833.3409 769.1818 714.2486 [15] 666.6413 624.9357 588.3044 555.5487 526.3795 500.0021 476.2695 [22] 454.5702 434.7949 416.6470 399.9255 384.5739 370.3412 357.1366 [29] 344.8375 333.2978 322.5507 312.5258 303.0307 294.1573 285.7368 [36] 277.8168 270.2709 263.1612 256.3872 249.9905 sum(scan(pipe("cat out* | colrm 1 5"))) # works for small indexes only sum(scan(pipe("cat out* | sed -e 's/.*]//'"))) # works for all index sizes

Using all cores on a single node • When running your job on a single core of a node shared with strangers, some may misbehave and use too much RAM or CPU. Solution is to request entire nodes, and fill them with just your jobs so you never share a node with anyone else. • Job submission file should include: • Forces PBS to exclusively schedule a node for you. If it is a single R job, you are using just one core! Must use xargs or a similar trick to launch 8 simultaneous R jobs. Only submit 1/8th the jobs. #PBS -l nodes=1:ppn=8

All cores example one #!/bin/sh -l #PBS -l nodes=1:ppn=8 #PBS -l walltime=00:30:00 cd $PBS_O_WORKDIR module add r Rscript t3.R >out1 & Rscript t3.R >out2 & Rscript t3.R >out3 & Rscript t3.R >out4 & Rscript t3.R >out5 & Rscript t3.R >out6 & Rscript t3.R >out7 & Rscript t3.R >out8 & wait

All cores example two #!/bin/sh -l #PBS -l nodes=1:ppn=8 #PBS -l walltime=00:30:00 cd $PBS_O_WORKDIR module add r xargs -d '\n' -n1 -P8 sh -c < batch1.sh • Where batch1.sh contains (could be > 8 lines!): Rscriptt3.R >out1 Rscript t3.R >out2 Rscript t3.R >out3 Rscript t3.R >out4 Rscript t3.R >out5 Rscript t3.R >out6 Rscript t3.R >out7 Rscript t3.R >out8

Thanks! • Thanks to Prof. Mark Daniel Ward for all his help with the examples used in this talk! • URL for these notes: • http://www.stat.purdue.edu/~dgc/cluster.pptx • http://www.stat.purdue.edu/~dgc/cluster.pdf (copy and paste works poorly with PDF!)

U sing ITaP clusters for large scale statistical analysis with R