240 likes | 365 Views
Advanced High Performance Computing Workshop HPC 201. Dr. Charles J Antonelli LSAIT RSG October 30, 2013. Roadmap. Flux review More advanced troubleshooting Array & dependent scheduling Graphical output GPUs on Flux Scientific applications R, Python, MATLAB Parallel programming
E N D
Advanced High PerformanceComputing WorkshopHPC 201 Dr. Charles J Antonelli LSAIT RSGOctober 30, 2013
Roadmap • Flux review • More advanced troubleshooting • Array & dependent scheduling • Graphical output • GPUs on Flux • Scientific applications • R, Python, MATLAB • Parallel programming • Debugging & tracing cja 2013 2
Flux review cja 2013
The Flux cluster Login nodes Compute nodes Data transfernode Storage … cja 2013
A Flux node 48 GB – 1 TB RAM 8 GPUs (optional) 12-40 Intel cores Local disk cja 2013
Programming Models • Two basic parallel programming models • Message-passingThe application consists of several processes running on different nodes and communicating with each other over the network • Used when the data are too large to fit on a single node, and simple synchronization is adequate • “Coarse parallelism” • Implemented using MPI (Message Passing Interface) libraries • Multi-threadedThe application consists of a single process containing several parallel threads that communicate with each other using synchronization primitives • Used when the data can fit into a single process, and the communications overhead of the message-passing model is intolerable • “Fine-grained parallelism” or “shared-memory parallelism” • Implemented using OpenMP (Open Multi-Processing) compilers and libraries • Both cja 2013
Using Flux • Three basic requirements:A Flux login accountA Flux allocationAn MToken (or a Software Token) • Logging in to Fluxssh flux-login.engin.umich.eduCampus wired or MwirelessVPNssh login.itd.umich.edufirst cja 2013
Cluster batch workflow • You create a batch script and submit it to PBS • PBS schedules your job, and it enters the flux queue • When its turn arrives, your job will execute the batch script • Your script has access to all Flux applications and data • When your script completes, anything it sent to standard output and error are saved in files stored in your submission directory • You can ask that email be sent to you when your jobs starts, ends, or aborts • You can check on the status of your job at any time,ordelete it if it’s not doing what you want • A short time after your job completes, it disappears from PBS cja 2013
Loosely-coupled batch script #PBS -N yourjobname #PBS -V #PBS -A youralloc_flux #PBS -l qos=flux #PBS -q flux #PBS –l procs=12,walltime=00:05:00 #PBS -M youremailaddress #PBS -m abe #PBS -j oe #Your Code Goes Below: cat $PBS_NODEFILEcd $PBS_O_WORKDIR mpirun ./c_ex01 cja 2013
Tightly-coupled batch script #PBS -N yourjobname #PBS -V #PBS -A youralloc_flux #PBS -l qos=flux #PBS -q flux #PBS –l nodes=1:ppn=12,walltime=00:05:00 #PBS -M youremailaddress #PBS -m abe #PBS -j oe #Your Code Goes Below: cd $PBS_O_WORKDIR matlab -nodisplay -r script cja 2013
Copying data Three ways to copy data to/from Flux • Use scp from login server:scpflux-login.engin.umich.edu:hpc201/example563.png . • Use scp from transfer host:scpflux-xfer.engin.umich.edu:hpc201/example563.png . • Use Globus Connect cja 2013
Globus Online • Features • High-speed data transfer, much faster than SCP or SFTP • Reliable & persistent • Minimal client software: Mac OS X, Linux, Windows • GridFTP Endpoints • Gateways through which data flow • Exist for XSEDE, OSG, … • UMich: umich#flux, umich#nyx • Add your own client endpoint! • Add your own server endpoint: contact flux-support@umich.edu • More information • http://cac.engin.umich.edu/resources/login-nodes/globus-gridftp cja 2013
Job Arrays • Submit copies of identical jobs • Invoked via qsub –t: qsub –t array-spec pbsbatch.txt Where array-spec can be m-n a,b,c m-n%slotlimit e.g. qsub –t 1-50%10 Fifty jobs, numbered 1 through 50, only ten can run simultaneously • $PBS_ARRAYID records array identifier cja 2013 13
Dependent scheduling • Submit jobs whose scheduling depends on other jobs • Invoked via qsub –W: qsub -W depend=type:jobid[:jobid]… Where depend can be after Schedule this job after jobids have startedafteranySchedule this job after jobids have finishedafterok Schedule this job after jobids have finishedwith no errors afternotok Schedule this job after jobids have finishedwith errors beforejobids scheduled after this job startsbeforeanyjobidsscheduled after this job completes beforeokjobids scheduled after this job completes with no errorsbeforenotokjobids scheduled after this job completes with errors cja 2013 14
Troubleshooting • showq [-r][-i][-b][-w user=uniq] # running/idle/blocked jobs • qstat -f jobno # full info • qstat -n jobno # nodes/cores where job running • diagnose -p # job prio and components • pbsnodes # nodes, states, properties • pbsnodes -l # list nodes marked down • checkjob [-v] jobno# why job jobno not running • mdiag -a # allocs & users (flux) • tracejobjobno # info for compl jobs (not flux) • freenodes# aggregate node/core busy/free • mdiag -u uniq # allocs for uniq (flux) • mdiag -a alloc_flux# cores active, alloc (flux) cja 2013
GPUs on Flux cja 2013
Scientific applications cja 2013
Parallel programming cja 2013
Debugging & tracing cja 2013
Debugging with GDB • Command-line debugger • Start programs or attach to running programs • Display source program lines • Display and change variables or memory • Plant breakpoints, watchpoints • Examine stack frames • Excellent tutorial documentation • http://www.gnu.org/s/gdb/documentation/ cja 2013 20
GDB symbols • GDB requires program symbols to be generated by the compiler • GDB will work without symbols • But you’d better be fluent in machine instructions and hexadecimal • Add –g flag to your compilation • gcc –g hello.c –o chello • gfortran –f hello.f90 –o fhello • Do not use –O with –g • Most compilers won’t optimize code for debugging • gcc and gfortran will, but you often won’t recognize the resulting source code cja 2013 21
Useful GDB commands gdb exec start gdb on executable exec gdb exec core start gdb on executable exec with core file core l [m,n] list source disasdisassemble function enclosing current instruction disasfunc disassemble function func b func set breakpoint at entry to func b line# set breakpoint at source line# b *0xaddr set breakpoint at address addr i b show breakpoints d bp# delete beakpointbp# r [args] run program with optional args bt show stack backtrace c continue execution from breakpoint stepsingle-step one source linenext single-step, don’t step into function stepi single-step one instruction p var display contents of variable var p *vardisplay value pointed to by var p &var display address of var p arr[idx] display element idx of array arr x 0xaddr display hex word at addr x *0xaddr display hex word pointed to by addr x/20x 0xaddr display 20 words in hex starting at addr i r display registers i r ebp display register ebpset var = expression set variable var to expression q quit gdb cja 2013
Resources • http://cac.engin.umich.edu/started • Cluster news, RSS feed and outages listed here • http://cac.engin.umich.edu/ • Getting an account, training, basic tutorials • http://cac.engin.umich.edu/resources/systems/flux/ • Getting an allocation, Flux On-Demand, Flux info • For assistance: flux-support@umich.edu • Read by a team of people • Cannot help with programming questions, but can help with scheduler issues cja 2013
References • CAC supported Flux software, http://cac.engin.umich.edu/resources/software/index.html, (accessed August 2011) • Free Software Foundation, Inc., “GDB User Manual,” http://www.gnu.org/s/gdb/documentation/ (accessed October 2011). • Infiniband, http://en.wikipedia.org/wiki/InfiniBand (accessed August 2011). • Intel C and C++ Compiler 1.1 User and Reference Guide, http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/index.htm (accessed August 2011). • Intel Fortran Compiler 11.1 User and Reference Guide,http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/fortran/lin/compiler_f/index.htm (accessed August 2011). • Lustre file system, http://wiki.lustre.org/index.php/Main_Page (accessed August 2011). • Torque User’s Manual, http://www.clusterresources.com/torquedocs21/usersmanual.shtml (accessed August 2011). cja 2013