1 / 18

Chunlin Tian NAOC Beijing 2011

Chunlin Tian NAOC Beijing 2011. High Performance Computation --- A Practical Introduction. Outline. Parallelization techniques OpenMP: do-loop based MPI: communication Auto-parallelization, CUDA Remark: It is at introduction level It is NOT a comprehensive introduction. Introduction.

Download Presentation

Chunlin Tian NAOC Beijing 2011

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chunlin Tian NAOC Beijing 2011 High Performance Computation --- A Practical Introduction

  2. Outline Parallelization techniques OpenMP: do-loop based MPI: communication Auto-parallelization, CUDA Remark: It is at introduction level It is NOT a comprehensive introduction

  3. Introduction Speed up the computing Mathematic, physics, computation Hardware number of CPU size of memory CPU : multi-processer vs. cluster; GPU Memory: distributed vs. shared Software Auto-parallelization by compiler OpenMP MPI Cuda

  4. Shared vs. Distributed Hardware: Desktop vs. Supercomputer Software: distributed= shared

  5. Auto-parallelization Easy to employ Set environment variable setenvOMP_NUM_THREADS2 Compiler options pgf77 –mp–static … … ifort –parallel … … Not smart enough Only efficient for dual core CPU Some time even slower than the single thread

  6. OpenMP-introduction Open Multi-Processing An API supporting multi-platform shared memory multiprocessing programming. It consists of a set of compiler directives, library routines and environment variables. History: 1997, version 1.0 in Fortran 1998, version 1.0 in C, C++ 2000 ,version 2.0 in Fortran 2002, version 2.0 in C, C++ 2005, version 2.5 in Fortran, C, C++ 2008, version 3.0 in Fortran, C, C++ … Compilers: GNU, Intel, IBM, PGI, MS …

  7. Coding with OpenMP Step 1: define parallel region Step 2: define the types of the variables Step 3: mark the do-loops to be paralleled Remark: you can parallel your code (parts by parts) incrementally. The number of parallel regions should be as less as possible.

  8. Example of OpenMP code !$omp parallel !$omp& default (shared) !$omp& private (tmp) !$omp do do i=1,nx tmp=a(i)**2+b(i)**2 tmp=sqrt(tmp) c(i)=a(i)/tmp d(i)=b(i)/tmp enddo !$omp end do !$omp single write(*,*)maxval(c), maxval(b) • !$omp end single • !$omp do do j=1,ny tmp=a(j)**2+b(j)**2 tmp=sqrt(tmp) c(j)=b(j)/tmp d(j)=a(j)/tmp enddo • !$omp end do • !$omp end parallel

  9. Run the OpenMP code Set environment variable setenvOMP_NUM_THREADS 4 ifort –openmp –intel-static *.f –o openbbs1.e ./openbbs1.e

  10. Scalability of OpenMP code Ideally it should be linear. But the initializing, finalizing, and synthesis etc. takes time.

  11. MPI Message Pass Interface A specification for an API that allows many computers to communicate with one another. Language-independent protocol, programmer interface, semantic specification History: 1994 May, version 1.0, the final report of MPIF 1995 June, version 1.1 1997 July, version 1.2, MPI-1; 2.0 MPI-2 2008 May, version 1.3 2008 June, version 2.1 2009 Sept., version 2.2 Remark: Open MPI ≠ OpenMP MPICH, HP MPI, Intel MPI, MS MPI, …

  12. Coding with MPI 1: determine the number of blocks 2: define virtual CPU topology 3: define the parallel region 4: assign tasks to different threads. 5: communication between threads. 6: manage the threads: master-slave non-master

  13. Example of MPI coding Include ‘mpi.h’ nx=100, ny=100 !number of grids mx=2, my=5 !number of blocks call MPI_INIT(ierr) !initialize the parallelization call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr) !get id … … … … … … … call MPI_Finalize(ierr) !finalize the parallelization myidmyidx,myidythe IDs of myid’s neighbours !virtual topology call MPI_SEND(vb,nx*2, MPI_REAL8, receiverid,tag,MPI_COMM_WORLD,ierr) !send data call MPI_RECV(va,nx*2,MPI_REAL8, senderid, tag, MPI_COMM_WORLD,ierr) !receive data

  14. CPU Virtual Topology 1. each thread has a unique ID; 2. each thread has more than one neighbors; 3. cpus can be arranged as one- or multi- dimensional array; 4. the topology should be as simple as possible.

  15. MPI Communication Point-point: one CPU to one CPU Collective: one to multiple: broadcast; scatter; gather; reduce, etc. Block Send and then check the receiving buffer Non-block Send and return

  16. Run the MPI code Compiling mpif77 –O3 *.f -o mpimod4.e Start mpd mpdboot Run code mpirun –n 7 mpidmode.4

  17. CUDA • what's next ? GPU-SUPERCOMPUTING It is do-loop based method. Do-loop <==> cuda subroutine

  18. Summary Parallelization Three levels of parallelization (compiler, OpenMP, MPI) Employment: Easy <---> Difficult Scalability: inefficient <---> efficient? Principle Do-loop based parallelization Massage passing Thanks!!

More Related