1.47k likes | 1.61k Views
Introducción a la computación de altas prestaciones. Francisco Almeida y F rancisco de Sande. Departamento de Estadística, I.O. y Computación Universidad de La Laguna. La Laguna, 12 de febrero de 2004. Questions. Why Parallel Computers? How Can the Quality of the Algorithms be Analyzed?
E N D
Introducción a la computación de altas prestaciones Francisco Almeida y Francisco de Sande Departamento de Estadística, I.O. y Computación Universidad de La Laguna La Laguna, 12 de febrero de 2004
Questions • Why Parallel Computers? • How Can the Quality of the Algorithms be Analyzed? • How Should Parallel Computers Be Programmed? • Why the Message Passing Programming Paradigm? • Why de Shared Memory Programming Paradigm?
OUTLINE • Introduction to Parallel Computing • Performance Metrics • Models of Parallel Computers • The MPI Message Passing Library • Examples • The OpenMP Shared Memory Library • Examples • Improvements in black hole detection using parallelism
Why Parallel Computers ? • Applications Demanding more Computational Power: • Artificial Intelligence • Weather Prediction • Biosphere Modeling • Processing of Large Amounts of Data (from sources such as satellites) • Combinatorial Optimization • Image Processing • Neural Network • Speech Recognition • Natural Language Understanding • etc.. 1990 s 1980 s 1970 s Performance 1960 s Cost SUPERCOMPUTERS
Top500 • www.top500.org
Speed-up • Ts = Sequential Run Time: Time elapsed between the begining and the end of its execution on a sequential computer. • Tp = Parallel Run Time: Time that elapses from the moment that a parallel computation starts to the moment that the last processor finishes the execution. • Speed-up: T*s / Tp ? p • T*s = Time of the best sequential algorithm to solve the problem.
Speed-up Optimal Number of Processors
Efficiency • In practice, ideal behavior of an speed-up equal to p is not achieved because while executing a parallel algorithm, the processing elements cannot devote 100% of their time to the computations of the algorithm. • Efficiency: Measure of the fraction of time for which a processing element is usefully employed. • E = (Speed-up / p) x 100 %
Amdahl`s Law • Amdahl`s law attempt to give a maximum bound for speed-up from the nature of the algorithm chosen for the parallel implementation. • Seq = Proportion of time the algorithm needs to be spent in purely sequential parts. • Par = Proportion of time that might be done in parallel • Seq + Par = 1 (where 1 is for algebraic simplicity) • Maximum Speed-up = (Seq + Par) / (Seq + Par / p) = 1 / (Seq + Par / p) p = 1000
Example • A problem to be solved many times over several different inputs. • Evaluate F(x,y,z) • x in {1 , ..., 20};y in {1 , ..., 10}; z in {1 , ..., 3}; • The total number of evaluations is 20*10*3 = 600. • The cost to evaluate F in one point (x, y, z) is t. • The total running time is t * 600. • If t is equal to 3 hours. • The total running time for the 600 evaluations is 1800 hours 75 days
The Sequential Model • The RAM model express computations on von Neumann architectures. • The von Neumann architecture is universally accepted for sequential computations. RAM Von Neumann
The Parallel Model PRAM • Computational Models • Programming Models • Architectural Models BSP, LogP PVM, MPI, HPF, Threads, OPenMP Parallel Architectures
Digital AlphaServer 8400 Hardware • Shared Memory • BusTopology • C4-CEPBA • 10 Alpha processors21164 • 2 Gb Memory • 8,8 Gflop/s
SGI Origin 2000 Hardware • Shared Dsitributed Memory • Hypercubic Topology • C4-CEPBA • 64 R1000 processos • 8 Gb memory • 32 Gflop/s
The SGI Origin 3000 Architecture (1/2) jen50.ciemat.es • 160 processors MIPS R14000 / 600MHz • On 40 nodes with 4 processors each • Data and instruction cache on-chip • Irix Operating System • Hypercubic Network
The SGI Origin 3000 Architecture (2/2) • cc-Numa memory Architecture • 1 Gflops Peak Speed • 8 MB external cache • 1 Gb main memory each proc. • 1 Tb Hard Disk
Beowulf Computers • COTS: Commercial-Off-The-Shelf computers • Distributed Memory
Towards Grid Computing…. Source: www.globus.org & updated
The Parallel Model PRAM • Computational Models • Programming Models • Architectural Models BSP, LogP PVM, MPI, HPF, Threads, OpenMP Parallel Architectures
Drawbacks that arise when solving Problems using Parallelism • Parallel Programming is more complex than sequential. • Results may vary as a consequence of the intrinsic non determinism. • New problems. Deadlocks, starvation... • Is more difficult to debug parallel programs. • Parallel programs are less portable.
processor processor processor processor Interconnection Network processor processor processor The Message Passing Model Send(parameters) Recv(parameters)
MPI CMMD pvm Express Zipcode p4 PARMACS EUI MPI Parallel Libraries Parallel Applications Parallel Languages
MPI • What Is MPI? • Message Passing Interface standard • The first standard and portable message passing library with good performance • "Standard" by consensus of MPI Forum participants from over 40 organizations • Finished and published in May 1994, updated in June 1995 • What does MPI offer? • Standardization - on many levels • Portability - to existing and new systems • Performance - comparable to vendors' proprietary libraries • Richness - extensive functionality, many quality implementations
$> mpicc -o hello hello.c $> mpirun -np 4 hello Hello from processor 2 of 4 Hello from processor 3 of 4 Hello from processor 1 of 4 Hello from processor 0 of 4 A Simple MPI Program MPI hello.c #include <stdio.h> #include "mpi.h" main(int argc, char*argv[]) { int name, p, source, dest, tag = 0; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&name); MPI_Comm_size(MPI_COMM_WORLD,&p); printf(“Hello from processor %d of %d\n",name, p); MPI_Finalize(); }
$> mpirun -np 4 helloms Processor 2 of 4 Processor 3 of 4 Processor 1 of 4 processor 0, p = 4 greetings from process 1! greetings from process 2! greetings from process 3! A Simple MPI Program MPI helloms.c #include <stdio.h> #include <string.h> #include "mpi.h" main(int argc, char*argv[]) { int name, p, source, dest, tag = 0; char message[100]; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&name); MPI_Comm_size(MPI_COMM_WORLD,&p); if (name != 0) { printf("Processor %d of %d\n",name, p); sprintf(message,"greetings from process %d!", name); dest = 0; MPI_Send(message, strlen(message)+1, MPI_CHAR, dest, tag, MPI_COMM_WORLD); } else { printf("processor 0, p = %d ",p); for(source=1; source < p; source++) { MPI_Recv(message,100, MPI_CHAR, source, tag, MPI_COMM_WORLD, &status); printf("%s\n",message); } } MPI_Finalize(); }
Linear Model to Predict Communication Performance Time to send N bytes= n + b
. . . 0 1 p One-to-all broadcast Single-node Accumulation One-to-all broadcast M . . . 0 1 p M M M Single-node Accumulation 0 1 Step 1 2 Step 2 . . . Step p p
Second Step 6 7 6 7 2 3 2 3 4 5 4 5 0 1 0 1 First Step 6 7 6 7 2 3 2 3 4 5 4 5 0 1 0 1 Broadcast on Hypercubes
Third Step 6 7 6 7 2 3 2 3 4 5 4 5 0 1 0 1 Broadcast on Hypercubes
MPI Broadcast • int MPI_Bcast( void *buffer; int count; MPI_Datatype datatype; int root; MPI_Comm comm; ); • Broadcasts a message from the process with rank "root" to all other processes of the group
A6 @A7 110 A7 @A6 101 A2 @A3 101 A3 @A2 101 A0 A5@A4 101 A1 A0@A1 000 A1@A0 001 A6 @A7@A4@A5 110 A7 @A6@A5@A4 101 A2 @A3@ A0@A1 101 A3 @A2@ A1@A0 101 A7 @A6@ A5@A4 101 A0@A1 @A2 @A3 000 A1@A0@ A3 @A2 001 Reduction on Hypercubes A6 110 • @ conmutative and associative operator • Ai in processor i • Every processor has to obtain A0@A1@...@AP-1 A7 101 A2 101 A3 101 A5 101 A0 000 A1 001
int MPI_Reduce( void *sendbuf;void *recvbuf;int count;MPI_Datatype datatype;MPI_Op op;int root;MPI_Comm comm;); Reduces values on all processes to a single value processes int MPI_Allreduce( void *sendbuf;void *recvbuf;int count;MPI_Datatype datatype;MPI_Op op;MPI_Comm comm;); Combines values form all processes and distributes the result back to all Reductions with MPI
. . . . . . 0 0 1 1 p p All-To-All BroadcastMultinode Accumulation All-to-all broadcast M1 M2 Mp M0 M0 M0 Single-node Accumulation M1 M1 M1 Mp Mp Mp Reductions, Prefixsums
MPI Collective Operations MPI Operator Operation --------------------------------------------------------------- MPI_MAX maximum MPI_MIN minimum MPI_SUM sum MPI_PROD product MPI_LAND logical and MPI_BAND bitwise and MPI_LOR logical or MPI_BOR bitwise or MPI_LXOR logical exclusive or MPI_BXOR bitwise exclusive or MPI_MAXLOC max value and location MPI_MINLOC min value and location
p= 1 4 dx (1+x2) 0 Computing : Sequential double t, pi=0.0, w; long i, n = ...; double local, pi = 0.0; .... h = 1.0 / (double) n; for(i = 0; i < n; i++) { x = (i + 0.5) * h; pi += f(x); } pi *= h; 4 2 0.0 0.2 0.4 0.6 0.8 1.0
p= 1 4 dx (1+x2) 0 Computing : Parallel MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); h = 1.0 / (double) n; mypi = 0.0; for (i = name; i < n; i += numprocs) { x = h * (i + 0.5) *h; mypi+= f(x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); 4 2 0.0 0.2 0.4 0.6 0.8 1.0
The Master Slave Paradigm Master Slaves
Condor University Wisconsin-Madison. www.cs.wisc.edu/condor • A problem to be solved many times over several different inputs. • The problem to be solved is computational expensive.