Chunlin Tian NAOC Beijing 2011

Chunlin Tian NAOC Beijing 2011 High Performance Computation --- A Practical Introduction

Outline Parallelization techniques OpenMP: do-loop based MPI: communication Auto-parallelization, CUDA Remark: It is at introduction level It is NOT a comprehensive introduction

Introduction Speed up the computing Mathematic, physics, computation Hardware number of CPU size of memory CPU : multi-processer vs. cluster; GPU Memory: distributed vs. shared Software Auto-parallelization by compiler OpenMP MPI Cuda

Shared vs. Distributed Hardware: Desktop vs. Supercomputer Software: distributed= shared

Auto-parallelization Easy to employ Set environment variable setenvOMP_NUM_THREADS2 Compiler options pgf77 –mp–static … … ifort –parallel … … Not smart enough Only efficient for dual core CPU Some time even slower than the single thread

OpenMP-introduction Open Multi-Processing An API supporting multi-platform shared memory multiprocessing programming. It consists of a set of compiler directives, library routines and environment variables. History: 1997, version 1.0 in Fortran 1998, version 1.0 in C, C++ 2000 ,version 2.0 in Fortran 2002, version 2.0 in C, C++ 2005, version 2.5 in Fortran, C, C++ 2008, version 3.0 in Fortran, C, C++ … Compilers: GNU, Intel, IBM, PGI, MS …

Coding with OpenMP Step 1: define parallel region Step 2: define the types of the variables Step 3: mark the do-loops to be paralleled Remark: you can parallel your code (parts by parts) incrementally. The number of parallel regions should be as less as possible.

Example of OpenMP code !$omp parallel !$omp& default (shared) !$omp& private (tmp) !$omp do do i=1,nx tmp=a(i)**2+b(i)**2 tmp=sqrt(tmp) c(i)=a(i)/tmp d(i)=b(i)/tmp enddo !$omp end do !$omp single write(*,*)maxval(c), maxval(b) • !$omp end single • !$omp do do j=1,ny tmp=a(j)**2+b(j)**2 tmp=sqrt(tmp) c(j)=b(j)/tmp d(j)=a(j)/tmp enddo • !$omp end do • !$omp end parallel

Run the OpenMP code Set environment variable setenvOMP_NUM_THREADS 4 ifort –openmp –intel-static *.f –o openbbs1.e ./openbbs1.e

Scalability of OpenMP code Ideally it should be linear. But the initializing, finalizing, and synthesis etc. takes time.

MPI Message Pass Interface A specification for an API that allows many computers to communicate with one another. Language-independent protocol, programmer interface, semantic specification History: 1994 May, version 1.0, the final report of MPIF 1995 June, version 1.1 1997 July, version 1.2, MPI-1; 2.0 MPI-2 2008 May, version 1.3 2008 June, version 2.1 2009 Sept., version 2.2 Remark: Open MPI ≠ OpenMP MPICH, HP MPI, Intel MPI, MS MPI, …

Coding with MPI 1: determine the number of blocks 2: define virtual CPU topology 3: define the parallel region 4: assign tasks to different threads. 5: communication between threads. 6: manage the threads: master-slave non-master

Example of MPI coding Include ‘mpi.h’ nx=100, ny=100 !number of grids mx=2, my=5 !number of blocks call MPI_INIT(ierr) !initialize the parallelization call MPI_COMM_RANK(MPI_COMM_WORLD,myid,ierr) !get id … … … … … … … call MPI_Finalize(ierr) !finalize the parallelization myidmyidx,myidythe IDs of myid’s neighbours !virtual topology call MPI_SEND(vb,nx*2, MPI_REAL8, receiverid,tag,MPI_COMM_WORLD,ierr) !send data call MPI_RECV(va,nx*2,MPI_REAL8, senderid, tag, MPI_COMM_WORLD,ierr) !receive data

CPU Virtual Topology 1. each thread has a unique ID; 2. each thread has more than one neighbors; 3. cpus can be arranged as one- or multi- dimensional array; 4. the topology should be as simple as possible.

MPI Communication Point-point: one CPU to one CPU Collective: one to multiple: broadcast; scatter; gather; reduce, etc. Block Send and then check the receiving buffer Non-block Send and return

Run the MPI code Compiling mpif77 –O3 *.f -o mpimod4.e Start mpd mpdboot Run code mpirun –n 7 mpidmode.4

CUDA • what's next ? GPU-SUPERCOMPUTING It is do-loop based method. Do-loop <==> cuda subroutine

Summary Parallelization Three levels of parallelization (compiler, OpenMP, MPI) Employment: Easy <---> Difficult Scalability: inefficient <---> efficient? Principle Do-loop based parallelization Massage passing Thanks!!

Chunlin Tian NAOC Beijing 2011

Chunlin Tian NAOC Beijing 2011

Presentation Transcript

29 September 2011, Beijing, China

Beijing, September 25-27, 2011

17 May 2011, Beijing, China

NAOC Scholarship Foundation

29 August, 2011 Beijing, China

He Han, Wang Huaning NAOC, Beijing 2005-07-11

May 17, 2011| Beijing, China

WU HONG NAOC

Michael Demidov ISTP, Irkutsk, Russia Beijing, NAOC, 12 January, 2010

Briefing on NAOC

Tian-Pouw Pun

Beijing, September 25-27, 2011

Tian Belawati for

29 September 2011, Beijing, China

Beijing Roundtable on Innovation Policies Beijing, 18- 19 October 2011

Briefing on NAOC

May 26, INAP Conference, Beijing 2011

Tian Ye, Felix Rauner, Lars Heinemann, Andrea Maurer KOMET (Beijing): Findings and Conclusions