700 likes | 869 Views
HPC USERS AND ADMINISTRATORS WORKSHOP. M.S MABAKANE. INDUCTION COURSE. APPLY FOR SYSTEM ACCOUNT. Below is the summary of account application process: Apply online using: http://www.chpc.ac.za/index.php/contact-us/apply-for-resources-form Application recorded in the helpdesk system
E N D
HPC USERS AND ADMINISTRATORS WORKSHOP M.S MABAKANE
APPLY FOR SYSTEM ACCOUNT Below is the summary of account application process: • Apply online using: http://www.chpc.ac.za/index.php/contact-us/apply-for-resources-form • Application recorded in the helpdesk system • Committee approve/reject the application • User sign CHPC Use Policy • User account created in the system
LOGIN INTO THE SYSTEMS To login to the systems, ssh into the following hostnames: • Sun cluster: sun.chpc.ac.za • GPU cluster: gpu.chpc.ac.za The users connects to the systems using 10 GB/s network bandwidth from anywhere in South Africa and 100 MB/s outside of the country. The amount of network speed determines the level of accessing or copying data into systems. For more info about login and use of HPC systems, please visit: wiki.chpc.ac.za.
COPYING DATA Different modes can be used to copy data into or out of the HPC systems: Linux machine: scp -r username@hostname:/directory-to-copy. The windows users can use an application, namely, Winscp to copy data from the computer to the HPC clusters. Moreover, WinSCP can also be used to copy data from the clusters to the users computer. Basic information on how to install Ubuntu Linux operating system can be found on: https://help.ubuntu.com/community/Installation. For more info on how to install and use WinSCP, please visit these website: http://winscp.net/eng/docs/guide_install.
HPC SYSTEMS sun cluster Blue Gene/P SA Grid system GPU cluster 6
DIFFERENT COMPONENTS OF SUN SYSTEM Nehalem Harpertown Sparc Sparc Visualization node Dell Westmere
DIFFERENT COMPONENTS OF SUN SYSTEM (cont…) For more information visit: http://wiki.chpc.ac.za/quick:start#what_you_have 10
FILESYSTEM STORAGE The sun system is equipped with five important different directories used to store data in Lustre filesystem. Below is the major directories within the storage: • /opt/gridware/compilers • /opt/gridware/libraries • /opt/gridware/applications • /export/home/username • /export/home/username/scratch
STORAGE & BACK-UP POLICIES CHPC has implemented quota policies to govern the storage capacity of the supercomputing system. User are allowed to store up a maximum of 10 GB data in their home directories. The system provide a grace period of seven days for users who exceeded 10 GB in the home directories. User’s home directory is backed-up. No back-up for scratch. Data older than 90 days is deleted in the user’s scratch directory.
COMPILERS Different compilers are available to compile various parallel programs in the sun system. The modules are used to load the following available compilers in the user environment: GNU & Intel compilers: • G.C.C 4.1.2 (gfortran, gcc and g++) -> /usr/bin • G.C.C 4.7.2 (gfortran, gcc and g++) -> module add gcc/4.7.2 • Intel 12.1.0 with MKL and IntelMPI (ifort, icc, icpc) -> module add intel2012 • Intel 13.0.1 with MKL and IntelMPI -> module add intel-XE/13.0 • Sun Studio (suncc, sunf95, c++filt) -> module add sunstudio
MESSAGE PASSING IINTERFACE (MPI) Various message passing interface (mpi) are used to parallelize applications on the sun system and this include: • OpenMPI 1.6.1 compiled with gnu (mpicc, mpiCC, mpif90 and mpif77) -> module add openmpi/openmpi-1.6.1-gnu • OpenMPI 1.6.1 compiled with intel (mpicc, mpiCC, mpif90 and mpif77) -> module add openmpi/openmpi-1.6.1-intel • OpenMPI 1.6.5 compiled with gnu (mpicc, mpiCC, mpif90 and mpif77) -> module add openmpi/openmpi-1.6.5-gnu & module add openmpi/openmpi-1.6.5_gcc-4.7.2 • OpenMPI 1.6.5 compiled with Intel (mpicc, mpiCC, mpif90 and mpif77) -> module add openmpi/openmpi-1.6.5-intel & module add openmpi/openmpi-1.6.5-intel-XE-13.0
APPLICATIONS Different scientific and commercial groups utilise various parallel applications to perform computational calculations using sun cluster. The most popular applications in the cluster are as follows: • Weather Research Forecast (WRF) • WRF-Chem • DL_POLY • ROMS (Regional Oceanic Modelling system) • Gaussian • VASP • Gadget • Material and Discovery studio • CAM • Quantum Espresso
MOAB AND TORQUE Moab cluster suite is a scheduling tool used to control jobs on both sun cluster. Moreover, torque is utilised to monitor the computational resources available in the clusters. Basic moab commands: • msub - submit job • showq - check status of the job • canceljob - cancel job in the cluster
EXAMPLE OF MOAB SCRIPT #!bin/bash #MSUB -l nodes=3:ppn=12 #MSUB -l walltime=2:00:00 #MSUB -l feature=dell|westmere #MSUB -m be #MSUB -V #MSUB -o /lustre/SCRATCH2/users/username/file.out #MSUB -e /lustre/SCRATCH2/users/username/file.err #MSUB -d /lustre/SCRATCH2/users/username/ #MSUB -mb ##### Running commands nproc=`cat $PBS_NODEFILE | wc -l` mpirun -np $nproc <executable> <output>
GPU CLUSTER 19
COMPUTE NODES 24 GB of memory 16 Intel Xeon processors (2.4 Ghz) 4 X Nvidia Tesla GPU cards 96 GB local hard drive capacity 20
FILE SYSTEM The GPU cluster is attached to 14 terabytes managed using GPFS across the entire supercomputer. The cluster has the following file system structure: • /GPU/home • /GPU/opt • All the libraries and compilers are located in /GPU/opt. On the other hand, /GPU/home is used to store user’s applications and output data. • No back-up and storage policies on the GPU system
COMPILERS and MPI Different kinds of libraries are used to compile and execute parallel applications simulating in the GPU system: • Intel compiler 12.1.0 with MKL and MPI (ifort, icc, icpc) -> module load intel/12.0 • Intel compiler 13.0.1 with MKL and IntelMPI -> module load intel/13.0 • G.C.C 4.6.3 (gfortran, gcc and g++) -> module load gcc/4.6.3 • Mpich 2.1.5 compiled with gnu (mpirun, mpicc, mpif90) -> module load mpich2/1.5 • Mvapich compiled with Intel (mpirun, mpicc, mpif90) -> module load mvapich2/intel/1.9
APPLICATIONS Below is the list of available applications running in the GPU cluster: • Emboss • NAMD • AMBER • Gromacs
QUESTIONS 24
SLA BETWEEN CHPC & USERS The CHPC has developed the service level agreement with the users to ensure smooth operations and utilisation of the supercomputing system. To this end; CHPC is responsible for ensuring that users’ queries are responded and resolved on-time. Below is the breakdown of queries associated with the resolution time: • Create user accounts, login and ssh - 1 day • Network - 1 day • Storage - 2 days • Installing softwares and libraries - 3 days • Compiling & Porting applications - 3 days All the queries and incidents are recorded in the helpdesk system.
HELPDESK STATISTICS • L1 - all calls resolved within 24 hrs e.g. ssh, login and user creation • L2 - all calls such storage, third party software and hardware resolved within 48 hrs • L3 - all calls resolved within 72 hrs. e.g. compilers, software and applications
CUSTOMER SATISFACTION In the 2nd quarter of 2013/14; customer satisfaction survey was conducted in order to understand the users satisfaction with regard to the services provided by the centre. Furthermore, the aim of the survey was to collect critical suggestions, complaints and compliments that may lead to improve the overall operational services within the CHPC. The survey was categorised into the following pillars: • Helpdesk • System performance • Scheduler • Training and workshop • CHPC Website and wiki
DIFFERENT TYPES OF SUPERCOMPUTERS Supercomputers are regarded as the fastest computers that can perform millions/trillions of calculations within a short period of time. These supercomputing systems can be classified into various categories such as: • Distributed-memory systems • Shared-memory machines • Vector systems Most of the scientists utilise these fastest computers to compute parallel applications and generate scientific output as quick as possible.
PERFORMANCE OF PROGRAMS ON SUPERCOMPUTERS The performance of parallel applications is mainly affected by interplays of factors such as (Adve and Vernon, 2004; Ebnenasir and Beik, 2009): • Limited network bandwidth • Unevenly distribution of message-passing • Slow read/write requests within the storage • Logic of the parallel code • High memory latency in the processing nodes • High processor utilisation in the execution nodes Adve, V.S., and Vernon, M.K. (2004). Parallel program performance prediction using deterministic task graph analysis. ACM Transactions on Computer Systems. 22(1): 94-136. Ebnenasir, A., and Beik, R. (2009). Developing parallel programs: a design-oriented perspective. Proceedings of the 2009 IEEE 31st international conference on software engineering. Vancouver: IEEE Computer Society, pp. 1-8.
THE PERFORMANCE OF THE SUPERCOMPUTERS Different components (e.g. processor, memory, network) play an important role in determining the performance of the supercomputers such as clusters, massive-parallel processing and shared-memory machines. In the CHPC - sun cluster, execution nodes are equipped with different processors and memory to simulate applications. The performance of the execution nodes is therefore important to compute parallel programs and generate output as quick as possible. In this case, we look at the statistics of computational resources used to process applications in the sun cluster.
Sun Storage 47
LUSTRE FILESYSTEM Lustre is a distributed parallel file system used to manage, share and monitor data in the storage. In the sun cluster, it is configured to administer the following storage capacity: • 1 Petabytes (SCRATCH 5) • 480 terabytes (SCRATCH 1, 2, 3 and 4) Both sub-storage systems (480 terabytes and 1 petabytes) are shared across the entire cluster. On this front, ClusterStor manager is used to monitor and reports the status of 1 petabytes sub-storage. Different scripts are used to monitor and control the shared 480 terabytes.