330 likes | 477 Views
An Introduction to Parallel Programming and MPICH. Nikolaos Hatzopoulos. What is Serial Computing?. Traditionally, software has been written for serial computation: To be run on a single computer having a single Central Processing Unit (CPU);
E N D
An Introduction to Parallel Programming and MPICH NikolaosHatzopoulos
What is Serial Computing? • Traditionally, software has been written for serial computation: • To be run on a single computer having a single Central Processing Unit (CPU); • A problem is broken into a discrete series of instructions. • Instructions are executed one after another. • Only one instruction may execute at any moment in time.
What is Parallel Computing? • In the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem: • To be run using multiple CPUs • A problem is broken into discrete parts that can be solved concurrently • Each part is further broken down to a series of instructions • Instructions from each part execute simultaneously on different CPUs
Computer Architecture(von Neumann) • Comprised of four main components: • Memory • Control Unit • Arithmetic Logic Unit • Input/Output • Read/write, random access memory is used to store both program instructions and data • Program instructions are coded data which tell the computer to do something • Data is simply information to be used by the program • Control unit fetches instructions/data from memory, decodes the instructions and then sequentially coordinates operations to accomplish the programmed task. • Aritmetic Unit performs basic arithmetic operations • Input/Output is the interface to the human operator
UMA, or Uniform Memory Access In the UMA memory architecture, all processors access shared memory through a bus (or another type of interconnect) as seen in the following diagram:
UMA, or Uniform Memory Access UMA gets its name from the fact that each processor must use the same shared bus to access memory, resulting in a memory access time that is uniform across all processors. Note that access time is also independent of data location within memory. That is, access time remains the same regardless of which shared memory module contains the data to be retrieved.
NUMA(Non-Uniform Memory Access) • In the NUMA shared memory architecture, each processor has its own local memory module that it can access directly and with a distinctive performance advantage. At the same time, it can also access any memory module belonging to another processor using a shared bus (or some other type of interconnect) as seen in the diagram below:
NUMA(Non-Uniform Memory Access) What gives NUMA its name is that memory access time varies with the location of the data to be accessed. If data resides in local memory, access is fast. If data resides in remote memory, access is slower. The advantage of the NUMA architecture as a hierarchical shared memory scheme is its potential to improve average case access time through the introduction of fast, local memory.
Modern multiprocessor systems • In this complex hierarchical scheme, processors are grouped by their physical location on one or the other multi-core CPU package or "node". Processors within a node share access to memory modules as per the UMA shared memory architecture. At the same time, they may also access memory from the remote node using a shared interconnect, but with slower performance as per the NUMA shared memory architecture.
Distributed computing • A distributed computer (also known as a distributed memory multiprocessor) is a distributed memory computer system in which the processing elements are connected by a network. Distributed computers are highly scalable.
Parallel algorithm for Distributed Memory Computing • We assumed that we have these numbers: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] and we want to add them with a parallel algorithm for 4 CPUs • Solution: CPU0 1, 2, 3 CPU0 6 6 + 15 + 24 + 33 = 78 CPU1 4, 5, 6 15 CPU2 7,8,9 24 CPU3 10, 11, 12 33
What’s the benefits from a parallel program • We assume that Tn is the time to pass a Message through the Network and To it’s the time of an operation to execute. • In our example we would need: Tn + 3To + Tn + 4To = 2Tn + 4To • For a serial program: 12To • We assume that To=1 and Tn=10 Parallel = 2x10 + 4x1 = 24 Serial = 12x1 = 12 • Conclusion Serial is faster than parallel
What’s the benefits from a parallel program • We assume that we have 12,000 numbers to add • Parallel: Tn + 3,000To + Tn + 4To = 10 + 3,000x1 + 10 +4x1 = 3,024 • Serial: 12,000To = 12,000x1 = 12,000 • Conclusion parallel it’s about 4 times faster than serial • Parallel computing is begum beneficial for large scale computational problems
MPICH • MPICH is a freely available, portable implementation of MPI, a standard for message-passing for distributed-memory applications used in parallel computing. • Message Passing Interface (MPI) is a specification for an API that allows many computers to communicate with one another. • MPICH is a library for C/C++ or Fortran
Installation MPICH for Linux • Web page: http://www.mcs.anl.gov/research/projects/mpi/mpich1/ • Download for Linux: ftp://ftp.mcs.anl.gov/pub/mpi/mpich.tar.gz • Untar: tar xvfzmpich.tar.gz • Configure: as root: ./configure - -prefix=/usr/local –rsh=ssh as user: ./configure - -prefix=/home/username –rsh=ssh • Compile: make • Install: make install
Testing MPICH • $ which mpicc It should give a path of mpicc where we install it like: ~/bin/mpicc and the same for mpirun • To run a test: from mpich installation dir $cd examples/basic $make $mpirun –np 2 cpi result: Process 0 of 2 on localhost.localdomain pi is approximately 3.1415926544231318, Error is 0.0000000008333387 wall clock time = 0.000000 Process 1 of 2 on localhost.localdomain
Possible Errors • Not find the path of mpicc $cd ~ $gedit .bashrc add the following line at the bottom export PATH=$PATH:/path_of_mpich/bin save and relogin • When we run: mpirun -np 2 cpi p0_29223: p4_error: Could not gethostbyname for host buster.localdomain; may be invalid name that means it cannot resolve buster.localdomain that is our hostname as root: gedit /etc/hosts locate 127.0.0.1 and add at the and of this line the hostname example: 127.0.0.1 localhost.localdomainlocalhostbuster.localdomain
ssh login without password • To avoid typing our password as many times as the np value we can make an login without password • $ssh-keygen by finishing this process it will create two files $cd ~.ssh $ls id_rsa id_rsa.pub known_hosts $cp id_rsa.pub authorized_keys2 So when we do $sshlocalhost it will login without password
hello.cmpich program #include <stdio.h> #include “mpi.h” main(intargc, char** argv){ intmy_rank; int size; intnamelen; char proc_name[MPI_MAX_PROCESSOR_NAME]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Get_processor_name(proc_name, &namelen); if (my_rank == 2) printf(“Hello – I am process 2\n”); else printf(“Hello from process %d of %d on %s\n”, my_rank, size, proc_name); MPI_Finalize(); }
Run hello.c • $mpicchello.c • $mpirun –np 4 a.out result: Hello from process 0 of 4 on localhost.localdomain Hello from process 1 of 4 on localhost.localdomain Hello from process 3 of 4 on localhost.localdomain Hello - I am process 2 • NOTE: the results may displayed by different order that’s depends on how the operating system manages the processes
From Documentation of MPICH • http://www.mcs.anl.gov/research/projects/mpi/mpich1/docs.html • MPI_MAX_PROCESSOR_NAME Maximum length of name returned by MPI_GET_PROCESSOR_NAME • MPI_Init Initialize the MPI execution environment • MPI_Comm_rank Determines the rank of the calling process in the communicator • MPI_Comm_size Determines the size of the group associated with a communicator • MPI_Get_processor_name Gets the name of the processor • MPI_Finalize Terminates MPI execution environment
Prepare data for parallel sum if (my_rank == 0){ //ON CPU0 array_size = 12; for(i=0;i<array_size;i++) data[i] = i+1 ; //FILL THE data array 1,2,3,4.. 12 for (target = 1; target < p; target++) MPI_Send(&array_size, 1, MPI_INT, target, tag1, MPI_COMM_WORLD); //send array size to the rest CPUs loc_array_size = array_size/p; //calculate locale array size k = loc_array_size; for(target = 1; target < p; target++){ MPI_Send(&data[k], loc_array_size, MPI_INT, target, tag2, MPI_COMM_WORLD); //send data to the rest CPUs k+=loc_array_size; } //k = 3,6,9,12 for(k=0; k<loc_array_size; k++) data_loc[k]=data[k]; //initialize CPU0 local array } else{ MPI_Recv(&array_size, 1, MPI_INT, 0, tag1, MPI_COMM_WORLD, &status); //receive array size from CPU0 loc_array_size = array_size/p; MPI_Recv(&data_loc[0], loc_array_size, MPI_INT, 0, tag2, MPI_COMM_WORLD, &status); //receive locale array from CPU0 }
Parallel sum res = 0; //parallel sum for (k=0; k<loc_array_size; k++) res = res + data_loc[k]; if (my_rank != 0){ MPI_Send(&res, 1, MPI_INT, 0, tag3, MPI_COMM_WORLD); //send result to CPU0 } else{ finres = res; //res of CPU0 printf("\n Result of process %d: %d\n", my_rank, res); for (source = 1; source < p; source++) { MPI_Recv(&res, 1, MPI_INT, source, tag3, MPI_COMM_WORLD, &status); //receive results from CPUs finres = finres + res; printf("\n Result of process %d: %d\n", source, res); } printf("\n\n\n Final Result: %d\n", finres); } MPI_Finalize();
Parallel Sum Output $ mpirun -np 4 a.out Result of process 0: 6 Result of process 1: 15 Result of process 2: 24 Result of process 3: 33 Final Result: 78
MPI_Send • Performs a basic send • Synopsis • #include "mpi.h" intMPI_Send( void *buf, int count, MPI_Datatypedatatype, intdest, int tag, MPI_Commcomm ) • Input Parameters • bufinitial address of send buffer (choice) • count number of elements in send buffer (nonnegative integer) • datatypedatatype of each send buffer element (handle) • destrank of destination (integer) • tag message tag (integer) • commcommunicator (handle)
MPI_Recv • Basic receive • Synopsis • #include "mpi.h" intMPI_Recv( void *buf, int count, MPI_Datatypedatatype, int source, int tag, MPI_Commcomm, MPI_Status *status ) • Output Parameters • bufinitial address of receive buffer (choice) • status status object (Status) • Input Parameters • count maximum number of elements in receive buffer (integer) • datatypedatatype of each receive buffer element (handle) • source rank of source (integer) • tag message tag (integer) • commcommunicator (handle)