330 likes | 458 Views
NORA/Clusters. AMANO, Hideharu Textbook pp. 140-147. NORA (No Remote Access Memory Model). No hardware shared memory Data exchange is done by messages (or packets) Dedicated synchronization mechanism is provided. High peak performance Message passing library ( MPI,PVM) is provided. Send.
E N D
NORA/Clusters AMANO, Hideharu Textbook pp.140-147
NORA (NoRemoteAccessMemoryModel) • No hardware shared memory • Data exchange is done by messages (or packets) • Dedicated synchronization mechanism is provided. • High peak performance • Message passing library (MPI,PVM) is provided.
Send Receive Send Receive Message passing(Blocking: randezvous)
Send Receive Send Receive Message passing(with buffer)
Send Receive Message passing(non-blocking) Other Job
PVM (ParallelVirtualMachine) • A buffer is provided for a sender. • Both blocking/non-blocking receive is provided. • Barrier synchronization
MPI(MessagePassingInterface) • Superset of the PVM for 1 to 1 communication. • Group communication • Various communication is supported. • Error check with communication tag. • Detail will be introduced later.
Programming style using MPI • SPMD (Single Program Multiple Data Streams) • Multiple processes executes the same program. • Independent processing is done based on the process number. • Program execution using MPI • Specified number of processes are generated. • They are distributed to each node of the NORA machine or PC cluster.
Communication methods • Point-to-Point communication • A sender and a receiver executes function for sending and receiving. • Each function must be strictly matched. • Collective communication • Communication between multiple processes. • The same function is executed by multiple processes. • Can be replaced with a sequence of Point-to-Point communication, but sometimes effective.
Fundamental MPI functions • Most programs can be described using six fundamental functions • MPI_Init()… MPI Initialization • MPI_Comm_rank()… Get the process # • MPI_Comm_size()… Get the total process # • MPI_Send()… Message send • MPI_Recv()… Message receive • MPI_Finalize()… MPI termination
Other MPI functions • Functions for measurement • MPI_Barrier() … barrier synchronization • MPI_Wtime() … get the clock time • Non-blocking function • Consisting of communication request and check • Other calculation can be executed during waiting.
An Example 1: #include <stdio.h> 2: #include <mpi.h> 3: 4: #define MSIZE 64 5: 6: int main(int argc, char **argv) 7: { 8: char msg[MSIZE]; 9: int pid, nprocs, i; 10:MPI_Status status; 11: 12:MPI_Init(&argc, &argv); 13: MPI_Comm_rank(MPI_COMM_WORLD, &pid); 14:MPI_Comm_size(MPI_COMM_WORLD, &nprocs); 15: 16: if (pid == 0) { 17: for (i = 1; i < nprocs; i++) { 18: MPI_Recv(msg, MSIZE, MPI_CHAR, i, 0, MPI_COMM_WORLD, &status); 19:fputs(msg, stdout); 20:} 21:} 22:else { 23: sprintf(msg, "Hello, world! (from process #%d)\n", pid); 24:MPI_Send(msg, MSIZE, MPI_CHAR, 0, 0, MPI_COMM_WORLD); 25: } 26: 27: MPI_Finalize(); 28: 29: return 0; 30: }
Initialize and Terminate int MPI_Init( int *argc, /* pointer to argc */ char ***argv /* pointer to argv */ ); mpi_init(ierr) integer ierr! return code The attributes from command line must be passed directly to argc and argv. int MPI_Finalize(); mpi_finalize(ierr) integer ierr ! return code
Commincator functions It returns the rank (process ID) in the communicator comm. int MPI_Comm_rank( MPI_Comm comm, /* communicator */ int *rank /* process ID (output) */ ); mpi_comm_rank(comm, rank, ierr) integer comm, rank integer ierr ! return code It returns the total number of processes in the communicator comm. int MPI_Comm_size( MPI_Comm comm, /* communicator */ int *size /* number of process (output) */ ); mpi_comm_size(comm, size, ierr) integer comm, size integer ierr! return code • Communicators are used for sharing commnication space among a subset of processes. MPI_COMM_WORLD is pre-defined one for all processes.
MPI_Send It sends data to process “dest”. int MPI_Send( void *buf, /* send buffer */ int count, /* # of elements to send */ MPI_Datatype datatype, /* datatype of elements */ int dest, /* destination (receiver) process ID */ int tag, /* tag */ MPI_Comm comm /* communicator */ ); mpi_send(buf, count, datatype, dest, tag, comm, ierr) <type> buf(*) integer count, datatype, dest, tag, comm integer ierr ! return code • Tags are used for identification of message.
MPI_Recv int MPI_Recv( void *buf, /* receiver buffer */ int count, /* # of elements to receive */ MPI_Datatype datatype, /* datatype of elements */ int source, /* source (sender) process ID */ int tag, /* tag */ MPI_Comm comm, /* communicator */ MPI_Status /* status (output) */ ); mpi_recv(buf, count, datatype, source, tag, comm, status, ierr) <type> buf(*) integer count, datatype, source, tag, comm, status(mpi_status_size) integer ierr ! return code • The same tag as the sender’s one must be passed to MPI_Recv. • Set the pointers to a variable MPI_Status. It is a structure with three members: MPI_SOURCE, MPI_TAG and MPI_ERROR, which stores process ID of the sender, tag and error code.
datatype and count • The size of the message is identified with count and datatype. • MPI_CHAR char • MPI_INT int • MPI_FLOAT float • MPI_DOUBLE double … etc.
Compile and Execution % icc –o hello hello.c -lmpi % mpirun –np 8 ./hello Hello, world! (from process #1) Hello, world! (from process #2) Hello, world! (from process #3) Hello, world! (from process #4) Hello, world! (from process #5) Hello, world! (from process #6) Hello, world! (from process #7)
Shared memory model vs.Message passing model • Benefits • Distributed OS is easy to implement. • Automatic parallelize compiler. • OpenMP • Message passing • Formal verification is easy (Blocking) • No-side effect (Shared variable is side effect itself) • Small cost
OpenMP #include <stdio.h> int main() { pragma omp parallel { int tid, npes; tid = omp_get_thread_num(); npes = omp_get_num_threads(); printf(“Hello World from %d of %d\n”, tid, npes) } return 0; } • Multiple threads are generated by using pragma. • Variables declared globally can be shared.
Convenient pragma for parallel execution #pragma omp parallel { #pragma omp for for (i=0; i<1000; i++){ c[i] = a[i] + b[i]; } } • The assignment between i and thread is automatically adjusted in order that the load of each thread becomes even.
Automatic parallelizing Compilers • Automatically translating a code for uniprocessors into multiprocessors. • Loop level parallelism is main target of parallelizing. • Fortran codes have been main targets • No pointers • The array structure is simple • Recently, restricted C becomes a target language • Oscar Compiler (Waseda Univ.), COINS
PC Clusters • Multicomputers • Dedicated hardware (CPU, network) • High performance but expensive • Hitachi’s SR8000, Cray T3E, etc. • WS/PC Clusters • Using standard CPU boards • High Performance/Cost • Standard bus often forms a bottleneck • Beowluf Clusters • Standard CPU boards, Standard components • LAN+TCP/IP • Free-software • A cluster withStandard System Area Network(SAN) like Myrinet is often called Beowulf Cluster
PC clusters in supercomputing Clusters occupies more than 80% of top 500 supercomputers in 2008/11 Let’s check http://www.top500.org
SAN (System Area Network) for PC clusters • Virtual Cut-through routing • High throughput/Low latency • Out of the cabinet but in the floor • Also used for connecting disk subsystems • Sometimes called System Area Network • Infiniband • Myrinet • Quadrics • GBEthernet: 10GB Ethernet Store & Forward Tree based topologies
Infiniband • Point-to-point direct serial interconnection. • Using 8b/10b code. • Various types of topologies can be supported. • Multicasting/atomic transactions are supported. • The maximum throughput
Remote DMA (user level) Local Node Remote Node System Call Sender Data Source Data Sink User Kernel Kernel Agent User Level RDMA Buffer Buffer Host I/F Protocol Engine Protocol Engine Network Interface Network Interface
RHiNET Cluster Node CPU: Pentium III 933MHz Memory: 1Gbyte PCI bus: 64bit/66MHz OS: Linux kernel 2.4.18 SCore: version 5.0.1 RHiNET-2 with 64 nodes Network Optical →Gigabit Ethernet
Cluster of supercomputers • Recent trend of supercomputers is connecting powerful components with high speed SANs like Infiniband • Roadrunner by IBM • General purpose CPU + Cell BE
Grid Computing • Supercomputers which are connected with Internet can be treated virtually as a single big supercomputer. • Using middleware or toolkit to manage it. • Globus toolkit • GEO (Global Earth Observation Grid) • AIST Japan
Contest/Homework • Assume that there is an array x[100]. Write the MPI code for computing sum of all elements of x with 8 processes.