440 likes | 675 Views
Experiencing Cluster Computing. Class 1. Introduction to Parallelism. Outline. Why Parallelism Types of Parallelism Drawbacks Concepts Starting Parallelization Simple Example. Why Parallelism. Why Parallelism – Passively.
E N D
Experiencing Cluster Computing Class 1
Outline • Why Parallelism • Types of Parallelism • Drawbacks • Concepts • Starting Parallelization • Simple Example
Why Parallelism – Passively Suppose you are using the most efficient algorithm with an optimal implementation and the program still takes too long or does not even fit onto your machine? Parallelization is the last chance.
Why Parallelism – Initiatively • Faster • Finish the work earlier • Same work in shorter time • Do more work • More work in the same time • Most importantly, you want to predict the result before the event occurs
Examples Many of the scientific and engineering problems require enormous computational power. Following are the few fields to mention: • Quantum chemistry, statistical mechanics, and relativistic physics • Cosmology and astrophysics • Computational fluid dynamics and turbulence • Material design and superconductivity • Biology, pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling • Medicine, and modeling of human organs and bones • Global weather and environmental modeling • Machine Vision
Parallelism • The upper bound for the computing power that can be obtained from a single processor is limited by the fastest processor available at any certain time. • The upper bound for the computing power available can be dramatically increased by integrating a set of processors together. • Synchronization and exchange of partial results among processors are therefore unavoidable.
Computer Architecture 4 categories: SISD:Single Instruction Single Data SIMD:Single Instruction Multiple Data MISD:Multiple Instruction Single Data MIMD:Multiple Instruction Multiple Data
Processor Organizations SISD SIMD MISD MIMD Uniprocessor (Single processor computer) Distributed Memory (Microcomputers) Shared Memory (Microprocessors) SMP Cluster Vector processor Array processor NUMA Computer Architecture
Multiprocessing Clustering LM LM LM LM CU CU CU CU n 1 2 n-1 2 1 n-1 n DS DS DS DS IS IS IS IS CPU CPU CPU CPU PU PU PU PU 1 2 n-1 n n 1 2 n-1 DS DS DS DS Interconnecting Network Shared Memory Parallel Computer Architecture Shared Memory – Symmetric multiprocessors(SMP) Distributed Memory – Cluster
Parallel Programming Paradigm • Multithreading • OpenMP • Message Passing • MPI (Message Passing Interface) • PVM (Parallel Virtual Machine) Shared memory only Shared memory, Distributed memory
Threads • In computer programming, a thread is placeholder information associated with a single use of a program that can handle multiple concurrent users. • From the program's point-of-view, a thread is the information needed to serve one individual user or a particular service request. • If multiple users are using the program or concurrent requests from other programs occur, a thread is created and maintained for each of them. • The thread allows a program to know which user is being served as the program alternately gets re-entered on behalf of different users.
Master Thread FORK Parallel Region Thread 1 2 3 4 Team of parallel threads JOIN Threads • Programmers view: • Single CPU • Single block of memory • Several threads of action • Parallelization • Done by the compiler Fork-Join Model
Shared Memory • Programmers view: • Several CPUs • Single block of memory • Several threads of action • Parallelization • Done by the compiler • Example • OpenMP Single threaded P1 P2 P3 Process Threads P1 Data exchange via shared memory Process P2 Multi-threaded P3 time
Master Thread !$OMP PARALLEL Team of parallel threads Parallel Region 1 !$OMP END PARALLEL !$OMP PARALLEL Parallel Region 2 !$OMP END PARALLEL Multithreaded Parallelization
Serial Process P1 P2 P3 P1 Process 0 Data exchange via interconnection Message Passing P2 Process 1 P3 Process 2 time Distributed Memory • Programmers view: • Several CPUs • Several block of memory • Several threads of action • Parallelization • Done by hand • Example • MPI
Drawbacks of Parallelism • Traps • Deadlocks • Process Synchronization • Programming Effort • Few tools support for automated parallelization and debugging • Task Distribution (Load balancing)
Deadlock • The earliest computer operating systems ran only one program at a time. • All of the resources of the system were available to this one program. • Later, operating systems ran multiple programs at once, interleaving them. • Programs were required to specify in advance what resources they needed so that they could avoid conflicts with other programs running at the same time. • Eventually some operating systems offered dynamic allocation of resources. Programs could request further allocations of resources after they had begun running. This led to the problem of the deadlock.
Deadlock • Parallel tasks require resources to accomplish their work. If the resources are not available, the work cannot be finished. Each resource can only be locked (controlled) by exactly one task at any given point in time. • Consider the situation: • Two tasks need both the same two resources. • Each task manages to gain control over just one resource, but not the other. • Neither task releases the resource that it already holds. • It is called deadlock and the program will not terminate.
Deadlock Resource Resource
Dining Philosophers • Each philosopher either thinks or eats. • In order to eat, he requires two forks. • Each philosopher tries to pick up the right fork first. • If success, he waits for the left fork to become available. • Deadlock
Dining Philosophers Demo • Problem • http://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/deadlock/Diners.htm • Solution • http://www.sci.hkbu.edu.hk/tdgc/tutorial/ExpClusterComp/deadlock/FixedDiners.htm
Speedup Given a fixed problem size. TS: sequential wall clock execution time (in seconds) TN: parallel wall clock execution time using N processors (in seconds) Ideally, speedup = N Linear speed up
Speedup • Absolute Speedup Sequential time on 1 processor/parallel time on N processors • Relative Speedup Parallel time on 1 processor/parallel time on N processors • Different because parallel code on 1 processor has unnecessary MPI overhead • It may be slower than sequential code on 1 processor
or Parallel Efficiency Effciency is a measure of process utilization in a parallel program, relative to the serial program. Parallel Efficiency E: Speedup per processor Ideally, EN = 1.
Amdahl’s Law It states that potential program speedup is defined by the fraction of code (f) which can be parallelized If none of the code can be parallelized, f = 0 and the speedup = 1 (no speedup). If all of the code is parallelized, f = 1 and the speedup is infinite (in theory).
Amdahl’s Law Introducing the number of processors performing the parallel fraction of work, the relationship can be modeled by the equation where: P: parallel fraction S: serial fraction N: number of processors
Amdahl’s Law When N ∞, Speedup = 1/S Interpretation: No matter how many processors are used, the upper bound for the speed up is determined by the sequential section.
Amdahl’s Law – Example If the sequential section of a program amounts 5% of the run time, then S = 0.05 and hence:
Behind Amdahl’s Law • How much faster can a given problem be solved? • Which problem size can be solved on a parallel machine in the same time as on a sequential one? (Scalability)
Parallelization – Option 1 • Starting from an existing, sequential program • Easy on shared memory architectures (OpenMP) • Potentially adequate for small number of processes (moderate speed-up) • Does not scale to large number of processes • Restricted to trivially parallel problems on distributed memory machines
Parallelization – Option 2 • Starting from scratch • Not popular, but often inevitable • Needs new program design • Increase complexity (data distribution) • Widely applicable • Often the best choice for large scale problems
Goals for Parallelization • Avoid or reduce • synchronization • communication • Try to maximize computational intensive sections.
Summation Given an N-dimensional vector of type integer. // Initialization // for (int i = 0; i<len; i++) vec[i] = i*i ; // Sum Calculation // for (int i = 0; i<len; i++) sum += vec[i];
Parallel Algorithm • Divide the vector in certain parts • In each CPU, initialize their own parts • Use global reduction to calculate the sum of the vector
OpenMP Compiler directives (#pragma omp) are inserted to tell the compiler to perform parallelization. The compiler would be responsible for automatically parallelizing certain types of loops. #pragma omp parallel for for (int i=1; i<len; i++) vec[i] = i*i; #pragma omp parallel for reduction(+: sum) for (int i=0; i<len; i++) sum += vec[i];
no. of processors, np = 3 rank 0 1 2 localsum sum sum sum sum MPI_Reduce MPI vec // in each process, do the initialization for(int i=rank; i<len; i+=np) vec[i] = i*i; // calculate the local sum for(int i=rank; i<len; i+=np) localsum += vec[i]; // perform global reduction MPI_Reduce(&localsum, &sum, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);