1 / 63

Cluster Workshop

Cluster Workshop. For COMP RPG students 17 May, 2010 High Performance Cluster Computing Centre (HPCCC) Faculty of Science Hong Kong Baptist University. Outline. Overview of Cluster Hardware and Software Basic Login and Running Program in a job queuing system Introduction to Parallelism

Download Presentation

Cluster Workshop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cluster Workshop For COMP RPG students 17 May, 2010 High Performance Cluster Computing Centre (HPCCC) Faculty of Science Hong Kong Baptist University

  2. Outline Overview of Cluster Hardware and Software Basic Login and Running Program in a job queuing system Introduction to Parallelism Why Parallelism Cluster Parallelism Open MP Message Passing Interface Parallel Program Examples Policy for usingsciblade.sci.hkbu.edu.hk http://www.sci.hkbu.edu.hk/hpccc/sciblade 2

  3. Overview of Cluster Hardware and Software

  4. Cluster Hardware This 256-node PC cluster (sciblade) consist of: Master node x 2 IO nodes x 3 (storage) Compute nodes x 256 Blade Chassis x 16 Management network Interconnect fabric 1U console & KVM switch Emerson Liebert Nxa 120k VA UPS 4

  5. Sciblade Cluster 256-node clusters supported by fund from RGC 5

  6. Hardware Configuration Master Node Dell PE1950, 2x Xeon E5450 3.0GHz (Quad Core) 16GB RAM, 73GB x 2 SAS drive IO nodes (Storage) Dell PE2950, 2x Xeon E5450 3.0GHz (Quad Core) 16GB RAM, 73GB x 2 SAS drive 3TB storage Dell PE MD3000 Compute nodes x 256 each with Dell PE M600 blade server w/ Infiniband network 2x Xeon E5430 2.66GHz (Quad Core) 16GB RAM, 73GB SAS drive 6

  7. Hardware Configuration Blade Chassis x 16 Dell PE M1000e Each hosts 16 blade servers Management Network Dell PowerConnect 6248 (Gigabit Ethernet) x 6 Inerconnect fabric Qlogic SilverStorm 9120 switch Console and KVM switch Dell AS-180 KVM Dell 17FP Rack console Emerson Liebert Nxa 120kVA UPS 7

  8. Software List Operating System ROCKS 5.1 Cluster OS CentOS 5.3 kernel 2.6.18 Job Management System Portable Batch System MAUI scheduler Compilers, Languages Intel Fortran/C/C++ Compiler for Linux V11 GNU 4.1.2/4.4.0 Fortran/C/C++ Compiler 8

  9. Software List Message Passing Interface (MPI) Libraries MVAPICH 1.1 MVAPICH2 1.2 OPEN MPI 1.3.2 Mathematic libraries ATLAS 3.8.3 FFTW 2.1.5/3.2.1 SPRNG 2.0a(C/Fortran) /4.0(C++/Fortran) 9

  10. Software List • Molecular Dynamics & Quantum Chemistry • Gromacs 4.0.7 • Gamess 2009R1, • Gaussian 03 • Namd 2.7b1 • Third-party Applications • FDTD simulation • MATLAB 2008b • TAU 2.18.2, VisIt 1.11.2 • Xmgrace 5.1.22 • etc 10

  11. Software List Queuing system Torque/PBS Maui scheduler Editors vi emacs 11

  12. Hostnames Master node External : sciblade.sci.hkbu.edu.hk Internal : frontend-0 IO nodes (storage) pvfs2-io-0-0, pvfs2-io-0-1, pvfs-io-0-2 Compute nodes compute-0-0.local, …, compute-0-255.local 12

  13. Basic Login and Running Program in a Job Queuing System

  14. Basic login Remote login to the master node Terminal login using secure shell ssh -l username sciblade.sci.hkbu.edu.hk Graphical login PuTTY & vncviewer e.g. [username@sciblade]$ vncserver New ‘sciblade.sci.hkbu.edu.hk:3 (username)' desktop is sciblade.sci.hkbu.edu.hk:3 It means that your session will run on display 3. 14

  15. Graphical login Using PuTTY to setup a secured connection: Host Name=sciblade.sci.hkbu.edu.hk 15

  16. Graphical login (con’t) ssh protocol version 16

  17. Graphical login (con’t) Port 5900 +display number (i.e. 3 in this case) 17

  18. Graphical login (con’t) Next, click Open, and login to sciblade Finally, run VNC Viewer on your PC, and enter "localhost:3" {3 is the display number} You should terminate your VNC session after you have finished your work. To terminate your VNC session running on sciblade, run the command [username@tdgrocks] $ vncserver –kill : 3 18

  19. Linux commands Both master and compute nodes are installed with Linux Frequently used Linux command in PC cluster http://www.sci.hkbu.edu.hk/hpccc/sciblade/faq_sciblade.php 19

  20. ROCKS specific commands ROCKS provides the following commands for users to run programs in all compute node. e.g. cluster-fork Run program in all compute nodes cluster-fork ps Check user process in each compute node cluster-kill Kill user process at one time tentakel Similar to cluster-fork but run faster 20

  21. Ganglia • Web based management and monitoring • http://sciblade.sci.hkbu.edu.hk/ganglia 21

  22. Why Parallelism

  23. Why Parallelism – Passively Suppose you are using the most efficient algorithm with an optimal implementation, but the program still takes too long or does not even fit onto your machine Parallelization is the last chance. 23

  24. Why Parallelism – Initiative Faster Finish the work earlier Same work in shorter time Do more work More work in the same time Most importantly, you want to predict the result before the event occurs 24

  25. Examples Many of the scientific and engineering problems require enormous computational power. Following are the few fields to mention: Quantum chemistry, statistical mechanics, and relativistic physics Cosmology and astrophysics Computational fluid dynamics and turbulence Material design and superconductivity Biology, pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling Medicine, and modeling of human organs and bones Global weather and environmental modeling Machine Vision 25

  26. Parallelism The upper bound for the computing power that can be obtained from a single processor is limited by the fastest processor available at any certain time. The upper bound for the computing power available can be dramatically increased by integrating a set of processors together. Synchronization and exchange of partial results among processors are therefore unavoidable. 26

  27. Multiprocessing Clustering LM LM LM LM CU CU CU CU n 1 2 n-1 2 1 n-1 n DS DS DS DS IS IS IS IS CPU CPU CPU CPU PU PU PU PU 1 2 n-1 n n 1 2 n-1 DS DS DS DS Interconnecting Network Shared Memory Parallel Computer Architecture Distributed Memory – Cluster Shared Memory – Symmetric multiprocessors (SMP) 27

  28. Clustering: Pros and Cons Advantages Memory scalable to number of processors. ∴Increase number of processors, size of memory and bandwidth as well. Each processor can rapidly access its own memory without interference Disadvantages Difficult to map existing data structures to this memory organization User is responsible for sending and receiving data among processors 28

  29. TOP500 Supercomputer Sites (www.top500.org) 29

  30. Cluster Parallelism

  31. Parallel Programming Paradigm • Multithreading • OpenMP • Message Passing • MPI (Message Passing Interface) • PVM (Parallel Virtual Machine) Shared memory only Shared memory, Distributed memory 31

  32. Distributed Memory Programmers view: Several CPUs Several block of memory Several threads of action Parallelization Done by hand Example MPI Serial Process P1 P2 P3 P1 Process 0 Data exchange via interconnection Message Passing P2 Process 1 P3 Process 2 time 32

  33. Message Passing Model P1 P2 P3 Serial Process 0 P1 Process 1 P2 Message Passing Process 2 P3 time Data exchange Process A process is a set of executable instructions (program) which runs on a processor. Message passing systems generally associate only one process per processor, and the terms "processes" and "processors" are used interchangeably Message Passing The method by which data from one processor's memory is copied to the memory of another processor. 33

  34. OpenMP

  35. OpenMP Mission The OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, including Unix platforms and Windows NT platforms. Jointly defined by a group of major computer hardware and software vendors. OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer. 35

  36. OpenMP compiler choice gcc 4.40 or above compile with -fopenmp Intel 10.1 or above compile with –Qopenmp on Windows compile with –openmp on linux PGI compiler compile with –mp Absoft Pro Fortran compile with -openmp 36

  37. Sample openmp example #include <omp.h> #include <stdio.h> int main() { #pragma omp parallelprintf("Hello from thread %d, nthreads %d\n", omp_get_thread_num(),omp_get_num_threads()); } 37

  38. serial-pi.c #include <stdio.h> static long num_steps = 10000000; double step; int main () { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; printf("Est Pi= %f\n",pi); } 38

  39. Openmp version of spmd-pi.c #include <omp.h> #include <stdio.h> static long num_steps = 10000000; double step; #define NUM_THREADS 8 int main () { int i, nthreads; double pi, sum[NUM_THREADS]; step = 1.0/(double) num_steps; omp_set_num_threads(NUM_THREADS); #pragma omp parallel { int i, id,nthrds; double x; id = omp_get_thread_num(); nthrds = omp_get_num_threads(); if (id == 0) nthreads = nthrds; for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) { x = (i+0.5)*step; sum[id] += 4.0/(1.0+x*x); } } for(i=0, pi=0.0;i<nthreads;i++) pi += sum[i] * step; printf("Est Pi= %f using %d threads \n",pi,nthreads); } 39

  40. Message Passing Interface (MPI)

  41. MPI Is a library but not a language, for parallel programming An MPI implementation consists of a subroutine library with all MPI functions include files for the calling application program some startup script (usually called mpirun, but not standardized) Include the lib file mpi.h (or however called) into the source code Libraries available for all major imperative languages (C, C++, Fortran …) 41

  42. General MPI Program Structure MPI include file #include <mpi.h> void main (int argc, char *argv[]) { int np, rank, ierr; ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&np); /* Do Some Works */ ierr = MPI_Finalize(); } #include <mpi.h> void main (int argc, char *argv[]) { int np, rank, ierr; ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&np); /* Do Some Works */ ierr = MPI_Finalize(); } #include <mpi.h> void main (int argc, char *argv[]) { int np, rank, ierr; ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&np); /* Do Some Works */ ierr = MPI_Finalize(); } variable declarations #include <mpi.h> void main (int argc, char *argv[]) { int np, rank, ierr; ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&np); /* Do Some Works */ ierr = MPI_Finalize(); } #include <mpi.h> void main (int argc, char *argv[]) { int np, rank, ierr; ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&np); /* Do Some Works */ ierr = MPI_Finalize(); } Initialize MPI environment Do work and make message passing calls Terminate MPI Environment 42

  43. Sample Program: Hello World! In this modified version of the "Hello World" program, each processor prints its rank as well as the total number of processors in the communicator MPI_COMM_WORLD. Notes: Makes use of the pre-defined communicator MPI_COMM_WORLD. Not testing for error status of routines! 43

  44. Sample Program: Hello World! #include <stdio.h> #include “mpi.h” // MPI compiler header file void main(int argc, char **argv) { int nproc,myrank,ierr; ierr=MPI_Init(&argc,&argv); // MPI initialization // Get number of MPI processes MPI_Comm_size(MPI_COMM_WORLD,&nproc); // Get process id for this processor MPI_Comm_rank(MPI_COMM_WORLD,&myrank); printf (“Hello World!! I’m process %d of %d\n”,myrank,nproc); ierr=MPI_Finalize(); // Terminate all MPI processes } 44

  45. Performance When we write a parallel program, it is important to identify the fraction of the program that can be parallelized and to maximize it. The goals are: load balance memory usage balance minimize communication overhead reduce sequential bottlenecks scalability 45

  46. Compiling & Running MPI Programs • Using mvapich 1.1 • Setting path, at the command prompt, type: export PATH=/u1/local/mvapich1/bin:$PATH (uncomment this line in .bashrc) • Compile using mpicc, mpiCC, mpif77 or mpif90, e.g. mpicc –o cpi cpi.c • Prepare hostfile (e.g. machines) number of compute nodes: Compute-0-0 Compute-0-1 Compute-0-2 Compute-0-3 • Run the program with a number of processor node: mpirun –np 4 –machinefile machines ./cpi 46

  47. Compiling & Running MPI Programs • Using mvapich 1.2 • Prepare .mpd.conf and .mpd.passwd and saved in your home directory : MPD_SECRETWORD=gde1234-3 (you may set your own secret word) • Setting environment for mvapich 1.2 export MPD_BIN=/u1/local/mvapich2 export PATH=$MPD_BIN:$PATH (uncomment this line in .bashrc) • Compile using mpicc, mpiCC, mpif77 or mpif90, e.g. mpicc –o cpi cpi.c • Prepare hostfile (e.g. machines) one hostname per line like previous section 47

  48. Compiling & Running MPI Programs • Pmdboot with the hostfile mpdboot –n 4 –f machines • Run the program with a number of processor node: mpiexec –np 4 ./cpi • Remember to clean after running jobs by mpdallexit mpdallexit 48

  49. Compiling & Running MPI Programs • Using openmpi:1.2 • Setting environment for openmpi export LD-LIBRARY_PATH=/u1/local/openmpi/ lib:$LD-LIBRARY_PATH export PATH=/u1/local/openmpi/bin:$PATH (uncomment this line in .bashrc) • Compile using mpicc, mpiCC, mpif77 or mpif90, e.g. mpicc –o cpi cpi.c • Prepare hostfile (e.g. machines) one hostname per line like previous section • Run the program with a number of processor node mpirun –np 4 –machinefile machines ./cpi 49

  50. Submit parallel jobs into torque batch queue • Prepare a job script, say omp.pbs like the following #!/bin/sh ### Job name #PBS -N OMP-spmd ### Declare job non-rerunable #PBS -r n ### Mail to user ##PBS -m ae ### Queue name (small, medium, long, verylong) ### Number of nodes #PBS -l nodes=1:ppn=8 #PBS -l walltime=00:08:00 cd $PBS_O_WORKDIR export OMP_NUM_THREADS=8 ./omp-test ./serial-pi ./omp-spmd-pi • Submit it using qsub qsub omp.pbs 50

More Related