320 likes | 484 Views
Advanced High Performance Computing Workshop HPC 201. Charles J Antonelli Mark Champe Seth Meyer LSAIT ARS September, 2014. Roadmap. Flux review Globus Connect Advanced PBS Array & dependent scheduling Tools GPUs on Flux Scientific applications R, Python, MATLAB
E N D
Advanced High PerformanceComputing WorkshopHPC 201 Charles J Antonelli Mark Champe Seth Meyer LSAIT ARSSeptember, 2014
Roadmap • Flux review • Globus Connect • Advanced PBS • Array & dependent scheduling • Tools • GPUs on Flux • Scientific applications • R, Python, MATLAB • Parallel programming • Debugging & profiling cja 2014
Flux review cja 2014
The Flux cluster Login nodes Compute nodes Data transfernode Storage … cja 2014
A Flux node 48 GB – 1 TB RAM 8 GPUs (GPU Flux) 12-40 Intel cores Local disk Each GPU contains 2,688 GPU cores cja 2014
Programming Models • Two basic parallel programming models • Message-passingThe application consists of several processes running on different nodes and communicating with each other over the network • Used when the data are too large to fit on a single node, and simple synchronization is adequate • “Coarse parallelism” or “SPMD” • Implemented using MPI (Message Passing Interface) libraries • Multi-threadedThe application consists of a single process containing several parallel threads that communicate with each other using synchronization primitives • Used when the data can fit into a single process, and the communications overhead of the message-passing model is intolerable • “Fine-grained parallelism” or “shared-memory parallelism” • Implemented using OpenMP (Open Multi-Processing) compilers and libraries • Both cja 2014
Using Flux • Three basic requirements:A Flux login accountA Flux allocationAn MToken (or a Software Token) • Logging in to Fluxssh flux-login.engin.umich.eduCampus wired or MWirelessOtherwise: • VPN • ssh login.itd.umich.edufirst cja 2014
Cluster batch workflow • You create a batch script and submit it to PBS • PBS schedules your job, and it enters the flux queue • When its turn arrives, your job will execute the batch script • Your script has access to all Flux applications and data • When your script completes, anything it sent to standard output and error are saved in files stored in your submission directory • You can ask that email be sent to you when your jobs starts, ends, or aborts • You can check on the status of your job at any time,ordelete it if it’s not doing what you want • A short time after your job completes, it disappears from PBS cja 2014
Loosely-coupled batch script #PBS -N yourjobname #PBS -V #PBS -A youralloc_flux #PBS -l qos=flux #PBS -q flux #PBS –l procs=12,pmem=1gb,walltime=00:05:00 #PBS -M youremailaddress #PBS -m abe #PBS -j oe #Your Code Goes Below: cat $PBS_NODEFILEcd $PBS_O_WORKDIR mpirun ./c_ex01 cja 2014
Tightly-coupled batch script #PBS -N yourjobname #PBS -V #PBS -A youralloc_flux #PBS -l qos=flux #PBS -q flux #PBS –l nodes=1:ppn=12,mem=4gb,walltime=00:05:00 #PBS -M youremailaddress #PBS -m abe #PBS -j oe #Your Code Goes Below: cd $PBS_O_WORKDIR matlab -nodisplay -r script cja 2014
GPU batch script #PBS -N yourjobname #PBS -V #PBS -A youralloc_flux #PBS -l qos=flux #PBS -q flux #PBS –l nodes=1:gpus=1,walltime=00:05:00 #PBS -M youremailaddress #PBS -m abe #PBS -j oe #Your Code Goes Below: cat $PBS_NODEFILEcd $PBS_O_WORKDIR matlab -nodisplay -r gpuscript cja 2014
Copying data Three ways to copy data to/from Flux • Use scp from login server:scpflux-login.engin.umich.edu:hpc201cja/example563.png . • Use scp from transfer host:scpflux-xfer.engin.umich.edu:hpc201cja/example563.png . • Use Globus Online cja 2014
Globus Online • Features • High-speed data transfer, much faster than SCP or SFTP • Reliable & persistent • Minimal client software: Mac OS X, Linux, Windows • GridFTP Endpoints • Gateways through which data flow • Exist for XSEDE, OSG, … • UMich: umich#flux, umich#nyx • Add your own client endpoint! • Add your own server endpoint: contact flux-support@umich.edu • More information • http://cac.engin.umich.edu/resources/login-nodes/globus-gridftp cja 2014
Job Arrays • Submit copies of identical jobs • Invoked via qsub –t: qsub –t array-spec pbsbatch.txt Where array-spec can be m-n a,b,c m-n%slotlimit e.g. qsub –t 1-50%10 Fifty jobs, numbered 1 through 50, only ten can run simultaneously • $PBS_ARRAYID records array identifier cja 2014 14
Dependent scheduling • Submit job to become eligible for execution at a given time • Invoked via qsub –a: qsub–a [[[[CC]YY]MM]DD]hhmm[.SS] … qsub –a 201412312359 j1.pbs j1.pbs becomes eligible one minute before New Year’s Day 2015 qsub -a 1800 j2.pbsj2.pbs becomes eligible at six PM today (or tomorrow, if submitted after six PM) cja 2014 15
Dependent scheduling • Submit job to run after specified job(s) • Invoked via qsub –W: qsub -W depend=type:jobid[:jobid]… Where depend can be after Schedule this job after jobids have startedafteranySchedule this job after jobids have finishedafterok Schedule this job after jobids have finishedwith no errorsafternotok Schedule this job after jobids have finishedwith errors qsubfirst.pbs # assume receives jobid 12345qsub –W afterany:12345 second.pbsSchedule second.pbs after first.pbs completes cja 2014 16
Dependent scheduling • Submit job to run before specified job(s) • Requires dependent jobs to be scheduled first • Invoked via qsub –W: qsub -W depend=type:jobid[:jobid]… Where depend can be beforejobids scheduled after this job startsbeforeanyjobidsscheduled after this job completesbeforeokjobids scheduled after this job completes with no errorsbeforenotokjobids scheduled after this job completes with errorson:N wait for N job completions qsub –W on:1 second.pbs # assume receives jobid 12345qsub–W beforeany:12345 first.pbsSchedule second.pbs after first.pbs completes cja 2014 17
Troubleshooting • showq [-r][-i][-b][-w user=uniq] # running/idle/blocked jobs • qstat -f jobno # full info inclgpu • qstat -n jobno # nodes/cores where job running • diagnose -p # job prio and components • pbsnodes # nodes, states, properties • pbsnodes -l # list nodes marked down • checkjob [-v] jobno # why job jobno not running • mdiag -a # allocs & users (flux) • freenodes # aggregate node/core busy/free • mdiag -u uniq # allocs for uniq (flux) • mdiag -a alloc_flux # cores active, alloc (flux) cja 2014
Scientific applications cja 2014
Scientific Applications • R (incl snow and multicore) • R with GPU (GpuLm, dist) • SAS, Stata • Python, SciPy, NumPy, BioPy • MATLAB with GPU • CUDA Overview • CUDA C (matrix multiply) cja 2014
Python • Python software available on Flux • EPDThe Enthought Python Distribution provides scientists with a comprehensive set of tools to perform rigorous data analysis and visualization.https://www.enthought.com/products/epd/ • biopythonPython tools for computational molecular biologyhttp://biopython.org/wiki/Main_Page • numpyFundamental package for scientific computinghttp://www.numpy.org/ • scipyPython-based ecosystem of open-source software for mathematics, science, and engineeringhttp://www.scipy.org/ cja 2014
Debugging & profiling cja 2014
Debugging with GDB • Command-line debugger • Start programs or attach to running programs • Display source program lines • Display and change variables or memory • Plant breakpoints, watchpoints • Examine stack frames • Excellent tutorial documentation • http://www.gnu.org/s/gdb/documentation/ cja 2014 23
Compiling for GDB • Debugging is easier if you ask the compiler to generate extra source-level debugging information • Add –g flag to your compilationicc–g serialprogram.c –o serialprogramormpicc –g mpiprogram.c –o mpiprogram • GDB will work without symbols • Need to be fluent in machine instructions and hexadecimal • Be careful using –O with –g • Some compilers won’t optimize code when debugging • Most will, but you sometimes won’t recognize the resulting source code at optimization level -O2 and higher • Use –O0 –g to suppress optimization cja 2014 24
Running GDB Two ways to invoke GDB: • Debugging a serial program:gdb./serialprogram • Debugging an MPI program:mpirun-npNxterm -e gdb ./mpiprogram • This gives you N separate GDB sessions, each debugging one rank of the program • Remember to use the –X or –Y option to ssh when connecting to Flux, or you can’t start xtermsthere cja 2014
Useful GDB commands gdb exec start gdb on executable exec gdb exec core start gdb on executable exec with core file core l [m,n] list source disasdisassemble function enclosing current instruction disasfunc disassemble function func b func set breakpoint at entry to func b line# set breakpoint at source line# b *0xaddr set breakpoint at address addr i b show breakpoints d bp# delete beakpointbp# r [args] run program with optional args bt show stack backtrace c continue execution from breakpoint stepsingle-step one source linenext single-step, don’t step into function stepi single-step one instruction p var display contents of variable var p *vardisplay value pointed to by var p &var display address of var p arr[idx] display element idx of array arr x 0xaddr display hex word at addr x *0xaddr display hex word pointed to by addr x/20x 0xaddr display 20 words in hex starting at addr i r display registers i r ebp display register ebpset var = expression set variable var to expression q quit gdb cja 2014
Debugging with DDT • Allinea’s Distributed Debugging Tool is a comprehensive graphical debugger designed for the complex task of debugging parallel code • Advantages include • Provides GUI interface to debugging • Similar capabilities as, e.g., Eclipse or Visual Studio • Supports parallel debugging of MPI programs • Scales much better than GDB cja 2014
Running DDT • Compile with -g:mpicc –g mpiprogram.c –o mpiprogram • Load the DDT module:module load ddt • Start DDT:ddtmpiprogram • This starts a DDT session, debugging all ranks concurrently • Remember to use the –X or –Y option to ssh when connecting to Flux, or you can’t start ddt there • http://arc.research.umich.edu/flux-and-other-hpc-resources/flux/software-library/ • http://content.allinea.com/downloads/userguide.pdf cja 2014
Application Profiling with MAP • Allinea’s MAP Tool is a statistical application profiler designed for the complex task of profiling parallel code • Advantages include • Provides GUI interface to profiling • Observe cumulative results, drill down for details • Supports parallel profiling of MPI programs • Handles most of the details under the covers cja 2014
Running MAP • Compile with -g:mpicc –g mpiprogram.c –o mpiprogram • Load the MAP module:module load ddt • Start MAP:ddtmpiprogram • This starts a MAP session • Runs your program, gathers profile data, displays summary statistics • Remember to use the –X or –Y option to ssh when connecting to Flux, or you can’t start ddt there • http://content.allinea.com/downloads/userguide.pdf cja 2014
Resources • http://arc.research.umich.edu/flux-and-other-hpc-resources/flux/software-library/ • U-M Advanced Research Computing Flux pages • http://arc.research.umich.edu/resources-services/flux/ • U-M Advanced Research Computing Flux pages • http://cac.engin.umich.edu/ • CAEN HPC Flux pages • http://www.youtube.com/user/UMCoECAC • CAEN HPC YouTube channel • For assistance: flux-support@umich.edu • Read by a team of people including unit support staff • Cannot help with programming questions, but can help with operational Flux and basic usage questions cja 2014
References • Supported Flux software, http://arc.research.umich.edu/flux-and-other-hpc-resources/flux/software-library/, (accessed June 2014) • Free Software Foundation, Inc., “GDB User Manual,” http://www.gnu.org/s/gdb/documentation/ (accessed June 2014). • Intel C and C++ Compiler 14User and Reference Guide, https://software.intel.com/en-us/compiler_14.0_ug_c(accessed June 2014). • Intel Fortran Compiler 14 User and Reference Guide,https://software.intel.com/en-us/compiler_14.0_ug_f(accessed June 2014). • Torque Administrator’s Guide, http://docs.adaptivecomputing.com/torque/4-2-8/torqueAdminGuide-4.2.8.pdf(accessed June 2014). • Submitting GPGPU Jobs, https://sites.google.com/a/umich.edu/engin-cac/resources/systems/flux/gpgpus(accessed June 2014). cja 2014