Introduction to Supercomputing at ARSC

Introduction to Supercomputing at ARSC Kate Hedstrom, Arctic Region Supercomputing Center (ARSC) kate@arsc.edu Jan, 2004

Topics • Introduction to Supercomputers at ARSC • Computers • Accounts • Getting an account • Kerberos • Getting help • Architectures of parallel computers • Programming models • Running Jobs • Compilers • Storage • Interactive and batch

Introduction to ARSC Supercomputers • They’re all Parallel Computers • Three Classes: • Shared Memory • Distributed Memory • Distributed & Shared Memory

Cray X1: klondike • 128 MSPs • 4 MSP/node • 4 Vector CPU/MSP, 800 MHz • 512 GB Total • 21 TB Disk • 1600 GFLOPS peak • NAC required

Cray SX-6: rime • 8 500MHz NEC Vector CPUs • 64 GB of shared memory • 1 TB RAID-5 Disk • 64 GFLOPS peak • Only one in the USA • On loan from Cray • Non-NAC

Cray SV1ex: chilkoot • 32 Vector CPUs, 500 MHz • 32 GB Shared memory • 2 TB Disk • 64 GFLOPS peak • NAC required

Cray T3E: yukon • 272 CPUs, 450 MHz • 256 MB per processor • 69.6 GB total distributed memory • 230 GFLOPS peak • NAC required

IBM Power4: iceberg • 2 nodes of 32 p690+s, 1.7 GHz (2 cabinets) 256 GB each • 92 nodes of 8 p655+s, 1.5 GHz (6 cabinets) • 6 nodes of 8 p655s 1.1 GHz (1 cabinet) • 16 GB Mem/Node • 22 TB Disk • 5000 GFLOPS • NAC required

IBM Regatta: iceflyer • 8-way, 16GB front end coming soon • 32 1.7 GHz Power4 CPUs in • 24-way SMP node • 7-way interactive node • 1 test node • 32-way SMP node soon • 256 GB Memory • 217 GFLOPS • Non-NAC

IBM SP Power3: icehawk • 50 4-Way SMP Nodes => 200 CPUs, 375 MHz • 2 GB Memory/Node • 36 GB Disk/Node • 264 GFLOPS peak for 176 CPUs (max per job) • Leaving soon • NAC required

Storing Files • Robotic tape silos • Two Sun storage servers • Nanook • Non-NAC systems • Seawolf • NAC systems

Accounts, Logging In • Getting an Account/Project • Doing a NAC • Logging in with Kerberos

Getting an Account/Project • Academic Applicant for resources is a PI: • Full time faculty or staff research person • Non-commercial work, must reside in USA • PI may add users to their project • http://www.arsc.edu/support/accounts/acquire.html • DoD Applicant • http://www.hpcmo.hpc.mil/Htdocs/SAAA • Commercial, Federal, State • Contact User Services Director • Barbara Horner-Miller, horner@arsc.edu • Academic guidelines apply

Doing a National Agency Check (NAC) • Required for HPCMO Resources only • Not required for workstations, Cray SX-6, or IBM Regatta • Not a security clearance • But there are detailed questions covering last 5-7 years • Electronic Personnel Security Questionnaire (EPSQ) • Windows only software • Fill out EPSQ cover sheet • http://www.arsc.edu/support/policy/pdf/OPM_Cover.pdf • Fingerprinting, Proof of Citizenship (passport, visa, etc.) • See http://www.arsc.edu/support/policy/accesspolicy.html

Logging in with Kerberos • On non-ARSC systems, download kerberos5 client • http://www.arsc.edu/support/howtos/krbclients.html • Used with SecureID • Uses a pin to generate a key at login time • Login requires user name, pass phrase, & key • Don’t share your pin or SecureID with anyone • Foreign Nationals or others with problems • Contact ARSC to use ssh to connect to ARSC gateway • Still need Kerberos & SecureID after connecting

SecureID

From ARSC System From ARSC System • Enter username • Enter <return> for principle • Enter pass phrase • Enter SecureID passcode • From that system: ssh iceflyer • ssh handles X11 handshaking

From Your System • Get Kerberos clients installed • Get ticket kinit username@ARSC.EDU • See tickets klist • Login into arsc system krlogin -l username iceflyer ssh -l username iceflyer ktelnet -l username iceflyer

Rime and Rimegate • Log into rimegate as usual, with your rimegate username (arscxxx) ssh -l arscksh rimegate • Compile on rimegate (sxf90, sxc++) • Log into rime from rimegate ssh rime • Rimegate $HOME is /rimegate/users/username on rime

Supercomputer Architectures • They’re all Parallel Computers • Three Classes: • Shared Memory • Distributed Memory • Distributed & Shared Memory

Shared Memory ArchitectureCray SV1, SX-6, IBM Regatta

Distributed Memory ArchitectureCray T3E

Cluster ArchitectureIBM iceberg, icehawk, Cray X1 • Scalable, distributed, shared-memory parallel processor

Programming Models • Vector Processing • compiler detection or manual directives • Threaded Processing (SMP) • OpenMP, Pthreads, java threads • shared memory only • Distributed Processing (MPP) • message passing with MPI • shared or distributed memory

Vector Programming • Vector CPUs are specialized for array/matrix operations • 64-element (SV1, X1), 256-element (SX-6) Vector Registers • Operations proceed assembly-line fashion • High memory-to-CPU bandwidth • Less CPU time wasted waiting for data from memory • Once loaded, produces one result per clock cycle • Compiler does a lot of the work

Vector Programming • Codes will run without modification. • Cray compilers automatically detect loops which are safe to vectorize. • Request listing file to find out what vectorized. • Programmer can assist the compiler: • Directives and pragmas can force vectorization • Eliminate conditions which inhibit vectorization (e.g., subroutine calls and data dependencies in loops)

Threaded Programming on Shared-Memory Systems • OpenMP • Directives/pragmas added to serial programs • A portable standard implemented on Cray (one node), SGI, IBM (one node), etc... • Other Threaded Paradigms • Java Threads • Pthreads

OpenMP Fortran Example !$omp parallel do do n = 1,10000 A(n) = x * B(n) + c end do ___________________________________________________ On 2 CPUS, this pragma divides work as follows: CPU 1: do n = 1,5000 A(n) = x * B(n) + c end do CPU 2: do n = 5001,10000 A(n) = x * B(n) + c end do

OpenMP C Example #pragma omp parallel for for (n = 0; n < 10000; n++) A[n] = x * B[n] + c; ___________________________________________________ On 2 CPUS, this pragma divides work as follows: CPU 1: for (n = 0; n < 5000; n++) A[n] = x * B[n] + c; CPU 2: for (n = 5000; n < 10000; n++) A[n] = x * B[n] + c;

Threads Dynamically Appear and DisappearNumber set by Environment

Distributed Processing Concept: 1) Divide the problem explicitly 2) CPUs Perform tasks concurrently 3) Recombine results 4) All processors may or may not be doing the same thing Branimir Gjetvaj

Distributed Processing • Data needed by a given CPU must be stored in the memory associated with that CPU • Performed on distributed or shared memory computer • Multiple copies of code are running • Messages/data are passed between CPUs • Multi-level: can be combined with vector and/or OpenMP

Distributed Processing using MPI (Fortran) • Initialization • Simple send/receive call mpi_init(ierror) call mpi_comm_size (MPI_COMM_WORLD, npes, ierror); call mpi_comm_rank (MPI_COMM_WORLD, my_rank, ierror); ! Processor 0 sends individual messages to others if (my_rank == 0) then do dest = 1, npes-1 call mpi_send(x, max_size, MPI_FLOAT, dest, 0, comm, ierr); end do else call mpi_recv(x, max_size, MPI_FLOAT, 0, 0, comm, status, ierr); end if

Distributed Processing using MPI (C) • Initialization • Simple send/receive MPI_Init(&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &npes); MPI_Comm_rank (MPI_COMM_WORLD, &my_rank); /* Processor 0 sends individual messages to others */ if (my_rank == 0) { for (dest = 1; dest < npes; dest++) { MPI_Send(x, max_size, MPI_FLOAT, dest, 0, comm); } } else { MPI_Recv(x, max_size, MPI_FLOAT, 0, 0, comm, &status); }

Number of Processes ConstantNumber set by Environment

Message Passing Activity Example

Cluster Programming • Shared-memory between processors on one node: • OpenMP, threads, or MPI • Distributed-memory methods between processors on multiple nodes • MPI • Mixed mode • MPI distributes to nodes, OpenMP within node

Programming Environments • Compilers • File Systems • Running jobs • Interactive • Batch • See individual machine documentation • http://www.arsc.edu/support/resources/hardware.html

Cray Compilers • SV1, T3E • f90, cc, CC • X1 • ftn, cc, CC • SX-6 front end (rimegate) • sxf90, sxc++ • SX-6 (rime) • f90, cc, c++ • No extra flags for MPI, OpenMP

IBM Compilers • Serial • xlf, xlf90, xlf95, xlc, xlC • OpenMP • Add -qsmp=omp, _r extension for thread-safe libraries, e.g. xlf_r • MPI • mpxlf, mpxlf90, mpxlf95, mpcc, mpCC • Might be best to always use _r extension (mpxlf90_r)

File Systems • Local storage • $HOME • /tmp or /wrktmp or /wrkdir -> $WRKDIR • /scratch -> $SCRATCH • Permanent storage • $ARCHIVE • Quotas • quota -v on Cray • qcheck on IBM

Running a job • Get files from $ARCHIVE to system’s disk • Keep source in $HOME, but run in $WRKDIR • Use $SCRATCH for local-to-node temporary files, clean up before job ends • Put results out to $ARCHIVE • $WRKDIR is purged

Iceflyer Filesystems • Smallish $HOME • Larger /wrkdir/username • $ARCHIVE for longterm storage, especially larger files • qcheck to check quotas

SX6 Filesystems • Separate from the rest of ARSC systems • Rimegate has /home, /scratch • Rime mounts them as /rimegate/home, /rimegate/scratch • Rime has own home, /tmp, /atmp, etc.

Interactive • Works on the command line • Limits exist on resources (time, # cpus, memory) • Good for debugging • Larger jobs must be submitted to the batch system

Batch Schedulers • Cray: NQS • Commands: • qsub, qstat, qdel • IBM: LoadLeveler • Commands: • llclass, llq, llsubmit, llcancel, llmap, xloadl

NQS Script (rime) #@$-q batch # job queue class #@$-s /bin/ksh # which shell #@$-eo # stdout and stderr together #@$-lM 100 MW #@$-lT 30:00 # time requested h:m:s #@$-c 8 # 8 cpus #@$ # required last command # beginning of shell script cd $QSUB_WORKDIR # cd to submission directory export F_PROGINF=DETAIL export OMP_NUM_THREADS=8 ./my_job

NQS Commands • qstat to find out job status, list of queues • qsub to submit job • qdel to delete job from queue

LoadLeveler Script (iceflyer) #!/bin/ksh #@ total_tasks = 4 #@ node_usage = shared #@ wall_clock_limit = 1:00:00 #@ job_type = parallel #@ output = out.$(jobid) #@ error = err.$(jobid) #@ class = large #@ notification = error #@ queue poe ./my_job

Loadleveler Commands • llclass to find list of classes • llq to see list of jobs in queue • llsubmit to submit job • llcancel to delete job from queue • llmap is local program to see load on machine • xloadl X11 interface to loadleveler

Introduction to Supercomputing at ARSC