240 likes | 250 Views
Advanced Computing Facility Introduction. Overview. The Advanced Computing Facility (ACF) houses High Performance Computing (HPC) resources dedicated to scientific research 458 nodes, 8568 processing cores and 49.78TB memory 20 nodes have over 500GB memory per node
E N D
Overview • The Advanced Computing Facility (ACF) • houses High Performance Computing (HPC) resources dedicated to scientific research • 458 nodes, 8568 processing cores and 49.78TB memory • 20 nodes have over 500GB memory per node • 13 nodes have 64 AMD cores per node and 109 node have 24 Intel cores per node • Coprocessor: Nvidia K80: 52, Nvidia K40C: 2, Nvidia K40m: 4, Nvidia K20m: 2, Nvidia M2070:1 • Virtual machine operation system: Linux http://ittc.ku.edu/cluster/acf_cluster_hardware.html
Cluster Usage Website http://ganglia.acf.ku.edu/
Useful Links • ACF Cluster computing resources • http://ittc.ku.edu/cluster/acf_cluster_hardware.html • Advanced Computing Facility (ACF) documentation main page • https://acf.ku.edu/wiki/index.php/Main_Page • Cluster Jobs Submission Guide • https://acf.ku.edu/wiki/index.php/Cluster_Jobs_Submission_Guide • Advanced guide http://www.adaptivecomputing.com/support/documentation-index/torque-resource-manager-documentation/ • ACF Portal Website • http://portal.acf.ku.edu/ • Cluster Usage Website • http://ganglia.acf.ku.edu/
ACF Portal Website http://portal.acf.ku.edu/
ACF Portal Website • Monitor jobs • View cluster loads • Download files • Upload files • ...
Access Cluster System via Linux Terminal • Access cluster in Nichols hall • 1. Login to login server → 2. Submit cluster jobs or start an interactive session from the login server . • Cluster will create a virtual machine to run your job or for your interactive session. • Access cluster from off campus • Use the KU Anywhere VPN first : http://technology.ku.edu/software/ku-anywhere-0 login1 server or login2 server
Access Cluster System via Linux Terminal • Login to login server • Use “ssh” to directly connect to the cluster login servers: login1 or login2 • Examples: ssh login1 # login with your default linux account ssh -X login1 # “-X” access login server with X11 forwarding ssh <username>@login1 # login with a different linux account ssh -X <username>@login1 • Login server is an entry point to the cluster and cannot support computationally intensive tasks
Access Cluster System via Linux Terminal • Submit a cluster job • Run “qsub” on login server to submit your job script • Job script includes PBS parameters in the top portion and the commands to run in the bottom portion • PBS parameters (beginning with #PBS) describe the parameters of the job • Basic example: • PBS parameters can be used as “qsub” arguments qsub -l nodes=2:ppn=4,mem=8000m,walltime=24:00:00 <yourscript> (Virtual machine with 2 nodes, 4 cores per node, and 8G memory) “script.sh” file: qsub script.sh #!/bin/bash # #PBS -N JobName #PBS -l nodes=2:ppn=4,mem=8000m,walltime=24:00:00 #PBS -M user1@ittc.ku.edu,user2@ittc.ku.edu #PBS -m abe echo Hello World!
Access Cluster System via Linux Terminal • Start an interactive session on the cluster • Basic command qlogin (= “qsub -I -q interactive -l nodes=1:ppn=1”) (Interactive session virtual machine with 1 node, 1 cores per node, and 2G memory) • Advanced command • Run 'qsub' to submit an interactive job. • Example: qsub -I -q interactive -l nodes=3:ppn=4, mem=8000m (Interactive session virtual machine with 3 nodes, 4 cores per node, and 8G memory) • Further reading • https://acf.ku.edu/wiki/index.php/Cluster_Jobs_Submission_Guide http://www.adaptivecomputing.com/support/documentation-index/torque-resource-manager-documentation/
Monitoring Job • Run the following commands from a login server "qstat -n1u <username>" or "qstat -nu <username>"
Application Support • All installed applications can be found in /tools/cluster/6.2/ • Manage software-specific environment variables: • Run script "env-selector-menu" to select user's combined environment variables • This creates a file in the user's home directory called “.env-selector” containing the selections. You may remove this file to clear the selections chosen. • or Run “module load {module_name}” to load environment variables to support the specific software in the current shell • Example: module load cuda/7.5 caffe/1.0rc3 • Load environment variables for cuda 7.5 and caffe 1.0rc3 • Find available modules: • Run “module avail” or check what are in the folder “/tools/cluster/6.2/modules”
Rules for Job-Related Cluster Folders Folders writable without asking administrator for permission ~/ : the most heavily used on the cluster and throughout ITTC. When running cluster jobs, you may use ~/ for your compiled programs and cluster job organization, but it is important to store and access data on other filesystems. /tmp : Each node has a local storage space that is freely accessible in /tmp. It is often useful to write output from cluster jobs to the local disk, archive the results, and copy the archive to another cluster filesystem. Folders writable only with administrator's permission /data : best suited for storing large data sets. The intended usage case for /data is for files that are written once, and read multiple times. /work : best suited for recording output from cluster jobs. If a researcher has a batch of cluster jobs that will generate large amounts of output, space will be assigned in /work. /projects : used for organizing group collaborations. /scratch : the only cluster filesystem that is not backed up. This space is used for storing data temporarily during processing on the cluster. Exceptionally large data sets or large amounts of cluster jobs' output may pose difficulty for the storage backup system and are stored in /scratch during processing. /library : contains read-only space for researchers who need copies of data on each node of the cluster. Email clusterhelp@acf.ku.edu to ask for data sets to be copied to /library.
Useful GUI Software in the Cluster System matlab • Technical computing nautilus • File explorer gedit • Text editor nsight • IDE environment for debugging c++ and CUDA code • Must apply for a GPU virtual machine and before running nsight, CUDA module must be loaded: module load cuda/7.5
Installed Deep Learning Software in Cluster Caffe: only GPU version • module load cuda/7.5 caffe/1.0rc3 • Input layer: Only support 'hdf5' file format Tensorflow: both GPU and CPU versions • Example: module load tensorflow/0.8_cpu
Interactive GUI Example • Matlab ssh -X login1 qsub -X -I -q interactive -l nodes=2:ppn=4,mem=8000m (Starting an interactive virtual machine with 2 nodes, 4 cores per node, and 8G memory) matlab& • Nsight ssh -X login1 qsub -X –I –q gpu -l nodes=1:k40:ppn=4:gpus=2,mem=8000m (Starting an interactive virtual machine with 1 nodes, 4 cores per node, 2 k40 GPU, and 8G memory) module load cuda/7.5 nsight&
Caffe 'qsub' script Example #!/bin/bash # #This is an example script # #These commands set up the Cluster Environment for your job: #PBS -S /bin/bash #PBS -N mnist_train_test1 #PBS -q gpu #PBS -l nodes=1:ppn=1:k40,gpus=1 #PBS -M username@ittc.ku.edu #PBS -m abe #PBS -d ~/mnist/scripts #PBS -e ~/mnist/logs/${PBS_JOBNAME}-${PBS_JOBID}.err #PBS -o ~/mnist/logs/${PBS_JOBNAME}-${PBS_JOBID}.out #Loading modules module load cuda/7.5 caffe/1.0rc3 # Save job specific information for troubleshooting echo "Job ID is ${PBS_JOBID}" echo "Running on host $(hostname)" echo "Working directory is ${PBS_O_WORKDIR}" echo "The following processors are allocated to this job:" echo $(cat $PBS_NODEFILE) # Run the program echo "Start: $(date +%F_%T)" source ${PBS_O_WORKDIR}/train_lenet_hdf5.sh echo "Stop: $(date +%F_%T)" Full example: mnist.tar.gz
ACF Virtual Machine vs. Desktop • ACF • Many softwares installed in /tools/cluster/6.2 • Should manually add the corresponding paths to the shell environment variables or use “env-selector-menu” or module loader to set these variables. • Desktop • Softwares installed in /usr/bin /usr/lib • These folders are included in the searching path by default