360 likes | 519 Views
Parallel Programming On the IUCAA Clusters. Sunu Engineer. IUCAA Clusters. The Cluster – Cluster of Intel Machines on Linux Hercules – Cluster of HP ES45 quad processor nodes References: http://www.iucaa.ernet.in/. The Cluster.
E N D
Parallel Programming On the IUCAA Clusters Sunu Engineer
IUCAA Clusters • The Cluster – Cluster of Intel Machines on Linux • Hercules – Cluster of HP ES45 quad processor nodes • References: http://www.iucaa.ernet.in/
The Cluster • Four Single Processor Nodes with 100 Mbps Ethernet interconnect. • 1.4 GHz, Intel Pentium 4 • 512 MB RAM • Linux 2.4 Kernel (Redhat 7.2 Distribution) • MPI – LAM 6.5.9 • PVM – 3.4.3
Hercules • Four quad processor nodes with Memory Channel interconnect • 1.25 GHz Alpha 21264D RISC Processor • 4 GB RAM • Tru64 5.1A with TruCluster software • Native MPI • LAM 7.0 • PVM 3.4.3
ES45 Cluster Processor ~ 679/960 System GFLOPS ~ 30 Algorithm/Benchmark Used – Specint/float/HPL Expected Computational Performance • Intel Cluster • Processor - 512/590 • System GFLOPS ~ 2 • Algorithm/Benchmark Used – Specint/float/HPL
Parallel Programs • Move towards large scale distributed programs • Larger class of problems with higher resolution • Enhanced levels of details to be explored • …
The Starting Point • Model Single Processor Program Multi Processor Program • Model Multiprocessor Program
Decomposition of a Single Processor Program • Temporal • Initialization • Control • Termination • Spatial • Functional • Modular • Object based
Multi Processor Programs • Spatial delocalization – Dissolving the boundary • Single spatial coordinate - Invalid • Single time coordinate - Invalid • Temporal multiplicity • Multiple streams at different rates w.r.t an external clock.
In comparison • Multiple points of initialization • Distributed control • Multiple points and times of termination • Distribution of the activity in space and time
Degrees of refinement • Fine parallelism • Instruction level • Program statement level • Loop level • Coarse parallelism • Process level • Task level • Region level
Patterns and Frameworks • Patterns - Documented solutions to recurring design problems. • Frameworks – Software and hardware structures implementing the infrastructure
Processes and Threads • From heavy multitasking to lightweight multitasking on a single processor • Isolated memory spaces to shared memory space
Posix Threads in Brief • pthread_create(pthread_t id, pthread_attr_t attributes, void *(*thread_function)(void *), void * arguments) • pthread_exit • pthread_join • pthread_self • pthread_mutex_init • pthread_mutex_lock/unlock • Link with –lpthread
Multiprocessing architectures • Symmetric Multiprocessing • Shared memory • Space Unified • Different temporal streams • OpenMP standard
OpenMP Programming • Set of directives to the compiler to express shared memory parallelism • Small library of functions • Environment variables. • Standard language bindings defined for FORTRAN, C and C++
C An openMP program program openmp !$OMP PARALLEL print *, “Hello world from”, omp_get_thread_num() !$OMP END PARALLEL stop end Open MP example #include <stdio.h> #include <omp.h> int main(int argc, char ** argv) { #pragma omp parallel { printf(“Hello World from %d\n”,omp_get_thread_num()); } return(0); }
Open MP directivesParallel and Work sharing • OMP Parallel [clauses] • OMP do [ clauses] • OMP sections [ clauses] • OMP section • OMP single
Combined work sharingSynchronization • OMP parallel do • OMP parallel sections • OMP master • OMP critical • OMP barrier • OMP atomic • OMP flush • OMP ordered • OMP threadprivate
OpenMP Directive clauses • shared(list) • private(list)/threadprivate • firstprivate/lastprivate(list) • default(private|shared|none) • default(shared|none) • reduction (operator|intrinsic : list) • copyin(list) • if (expr) • schedule(type[,chunk]) • ordered/nowait
Open MP Library functions • omp_get/set_num_threads() • omp_get_max_threads() • omp_get_thread_num() • omp_get_num_procs() • omp_in_parallel() • omp_get/set_(dynamic/nested)() • omp_init/destroy/test_lock() • omp_set/unset_lock()
OpenMP environment variables • OMP_SCHEDULE • OMP_NUM_THREADS • OMP_DYNAMIC • OMP_NESTED
OpenMP Reduction and Atomic Operators • Reduction : +,-,*,&,|,&&,|| • Atomic : ++,--,+,*,-,/,&,>>,<<,|
Simple loops • do I=1,N z(I) = a * x(I) + y end do !$OMP parallel do do I=1,N z(I) = a * x(I) + y end do
Data Scoping • Loop index private by default • Declare as shared, private or reduction
Private variables • !$OMP parallel do private(a,b,c) do I=1,m do j =1,n b=f(I) c=k(j) call abc(a,b,c) end do end do #pragma omp parallel for private(a,b,c)
Dependencies • Data dependencies (Lexical/dynamic extent) • Flow dependencies • Classifying and removing the dependencies • Non removable dependencies • Examples Do I=2,n a(I) =a(I)+a(I-1) end do Do I=2,N,2 a(I)= a(I)+a(I-1) End do
Making sure everyone has enough work • Parallel overhead – Creation of threads, synchronization vs. work done in the loop $!OMP parallel do schedule(dynamic,3) schedule type – static, dynamic, guided,runtime
Parallel regions – from fine to coarse parallelism • $!OMP Parallel • threadprivate and copyin • Work sharing constructs • do, sections, section, single Synchronization • critical, atomic, barrier, ordered, master
To distributed memory systems • MPI, PVM, BSP …
Some Parallel Libraries Existing parallel libraries and toolkits include: • PUL, the Parallel Utilities Library from EPCC. • The Multicomputer Toolbox from Tony Skjellum and colleagues at LLNL and MSU. • The Portable, Extensible, Toolkit for Scientific computation from ANL. • ScaLAPACK from ORNL and UTK. • ESSL, PESSL on AIX • PBLAS, PLAPACK, ARPACK