550 likes | 758 Views
平行化程式概念及 計中高效能運算環境. 台灣大學計算機中心 作業管理組 程式設計師 張傑生. 大綱. 平行化程式概念 介紹 OpenMP MPI 計中高效能運算環境. Advantages of Parallel Programming. Need to solve larger problems more memory intensive more computation more data intensive Parallel programming provides more CPU resources
E N D
平行化程式概念及計中高效能運算環境 台灣大學計算機中心 作業管理組 程式設計師 張傑生
大綱 • 平行化程式概念 • 介紹 • OpenMP • MPI • 計中高效能運算環境
Advantages of Parallel Programming • Need to solve larger problems • more memory intensive • more computation • more data intensive • Parallel programming provides • more CPU resources • more memory resources • solve problems that were not possible with serial program • solve problems more quickly
Parallel Computer Architectures • Two Basic Architectures • Shared Memory Computer • multiple processors • share a global memory space • processors can efficiently exchange/share data • Distributed Memory (ex. Beowulf cluster) • collection of serial computers (nodes) • each nodes uses its own local memory • work together to solve a problem • communicate between nodes via messages • nodes are networked together • Latest Parallel Computers mixed shared/distributed memory architecture • nodes have more than 1 processor • dual/quad core processors
Shared Memory Computer • Bottleneck • memory
Distributed Memory • Bottleneck • network
Parallel vs. Distributed Computing • Data dependency • Data exchange • Clock/execution synchronization • 通常平行程式計算侷限於單一主機或 cluster. • 避免資料交換延遲 • 避免被慢速主機拖累
如何建置平行化程式環境 • Compiler, library • GCC, Intel compiler • Support both MPI and OpenMP • MPICH • 使用既有設備即可 • 學習 • 驗證程式正確性 • 效能非考量
如何平行化程式 • 唯有使用者自己瞭解程式瓶頸 • Algorithm • function, loop, I/O • 先確認 serial code 正確性 • 準備多組 test data,以利未來驗證平行執行結果 • 浮點精確度問題 • 思考演算法修改 • 計算切割、資料切割 • 評估方法 • MPI, OpenMP
OpenMP • OpenMP: An application programming interface (API) for parallel programming on multiprocessors • Compiler directives • Library of support functions • OpenMP works in conjunction with Fortran, C, or C++
Shared-memory Model • Processors interact and synchronize with each other through shared variables.
Fork/Join Parallelism • Initially only master thread is active • Master thread executes sequential code • Fork:Master thread creates or awakens additional threads to execute parallel code • Join: At end of parallel code created threads die or are suspended
Parallel for Loops • C programs often express data-parallel operations as for loops for (i = first; i < size; i += prime) marked[i] = 1; • OpenMP makes it easy to indicate when the iterations of a loop may execute in parallel • Compiler takes care of generating code that forks/joins threads and allocates the iterations to threads
Pragmas • Pragma: a compiler directive in C or C++ • Stands for “pragmatic information” • A way for the programmer to communicate with the compiler • Compiler free to ignore pragmas • Syntax: • #pragma omp <rest of pragma>
Hello World! #include <omp.h> #include <stdio.h> int main (int argc, char *argv[]) { int id, nthreads; #pragma omp parallel private(id) { id = omp_get_thread_num(); printf("Hello World from thread %d\n", id); #pragma omp barrier if ( id == 0 ) { nthreads = omp_get_num_threads(); printf("There are %d threads\n",nthreads); } } return 0; }
Parallel for Pragma • Format:#pragma omp parallel forfor (i = 0; i < N; i++) a[i] = b[i] + c[i];
Reductions • Reductions are so common that OpenMP provides support for them • May add reduction clause to parallel for pragma • Specify reduction operation and reduction variable • OpenMP takes care of storing partial results in private variables and combining partial results after the loop
Reductions Example #define N 10000 /*size of a*/ void calculate(int); /*function that calculates the elements of a*/ long w; double a[N]; calculate(a); sum = 0.0; /*forks off the threads and starts the work-sharing construct*/ #pragma omp parallel for private(w) reduction(+:sum)schedule(static,1) for(i = 0; i < N; i++) { w = i*i; sum = sum + w*a[i]; } printf("\n %lf",sum);
v = alpha(); w = beta(); x = gamma(v, w); y = delta(); printf ("%6.2f\n", epsilon(x,y)); May execute alpha, beta, and delta in parallel Functional Parallelism Example
Example of parallel sections #pragma omp parallel sections { #pragma omp section /* Optional */ v = alpha(); #pragma omp section w = beta(); #pragma omp section y = delta(); } x = gamma(v, w); printf ("%6.2f\n", epsilon(x,y));
Incremental Parallelization • Sequential program a special case of a shared-memory parallel program • Parallel shared-memory programs may only have a single parallel loop • Incremental parallelization: process of converting a sequential program to a parallel program a little bit at a time
Pros and Cons of OpenMP • Pros • Simple: need not deal with message passing as MPI does • Data layout and decomposition is handled automatically by directives. • Incremental parallelism: can work on one portion of the program at one time, no dramatic change to code is needed. • Unified code for both serial and parallel applications: OpenMP constructs are treated as comments when sequential compilers are used. • Original (serial) code statements need not, in general, be modified when parallelized with OpenMP. This reduces the chance of inadvertently introducing bugs. • Cons • Currently only runs efficiently in shared-memory multiprocessor platforms • Requires a compiler that supports OpenMP. • Scalability is limited by memory architecture. • Reliable error handling is missing. • Lack fine-grain mechanisms to control thread-processor mapping. • Synchronization between a subset of threads is not allowed.
MPI • Message Passing Interface • A standard message passing library for parallel computers • MPI was designed for high performance on both massively parallel machines and on workstation clusters. • SPMD programming model • Single Program Multiple Data (SPMD) • A single program running on different sets of data.
Hello World! #include <stdio.h> #include <mpi.h> int main (argc, argv) int argc; char *argv[]; { int rank, size, MPI_Init (&argc, &argv); /* starts MPI */ MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */ printf( "Hello world from process %d of %d\n", rank, size ); MPI_Finalize(); return 0; }
Output andes:~> mpirun -np 4 hello_world_c Hello world from process 1 of 4 Hello world from process 2 of 4 Hello world from process 3 of 4 Hello world from process 0 of 4 順序可能不一樣
Initialization and Clean-up • MPI_Init • Initialize the MPI execution environment. • The first MPI routine in your program. • The argc and argv parameters are from the standard C command line interface. • MPI_Finalize • Terminate MPI execution environment • The last statement in your program
Configuration • MPI_Comm_size • Tells the number of processes in the system. • The MPI_COMM_WORLD means all the processor in the system. • MPI_Comm_rank • Tells the rank of the calling process.
Communication • Point to point • MPI_Send, MPI_Recv • MPI_Send ((void *)&data, icount, DATA_TYPE, idest, itag, MPI_COMM_WORLD); • data 要送出去的資料起點,可以是純量 (scalar) 或陣列 (array)資料 • icount 要送出去的資料數量,當icount的值大於一時,data必須是陣列 • DATA_TYPE 是要送出去的資料類別 • Idest 是收受資料的CPU id • Itag 要送出去的資料標籤 • Envelope • 送出資料的CPU id • 收受資料的CPU id • 資料標籤 • communicator • Collective • Broadcast • Multicast
Serial Code • 計算 a[i] 並加總 for (i = 0; i < n; i++) { a[i] = b[i] + c[i] * d[i]; suma += a[i]; } printf( "sum of A=%f\n",suma);
MPI code • 計算切割而資料不切割
資料分配 if ( myid==0) { for (idest = 1; idest < nproc; idest++) { istart1=gstart[idest]; icount1=gcount[idest]; itag=10; MPI_Send ((void *)&b[istart1], icount1, MPI_DOUBLE, idest, itag, comm); itag=20; MPI_Send ((void *)&c[istart1], icount1, MPI_DOUBLE, idest, itag, comm); itag=30; MPI_Send ((void *)&d[istart1], icount1, MPI_DOUBLE, idest, itag, comm); } } else { icount=gcount[myid]; isrc=0; itag=10; MPI_Recv ((void *)&b[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat); itag=20; MPI_Recv ((void *)&c[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat); itag=30; MPI_Recv ((void *)&d[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat); }
計算及回傳 for (i = istart; i <= iend; i++) { a[i] = b[i] + c[i] * d[i]; } itag=110; if (myid > 0) { icount=gcount[myid]; idest=0; MPI_Send((void *)&a[istart], icount, MPI_DOUBLE, idest, itag, comm); } else { for ( isrc=1; isrc < nproc; isrc++ ) { icount1=gcount[isrc]; istart1=gstart[isrc]; MPI_Recv((void *)&a[istart1], icount1, MPI_DOUBLE, isrc, itag, comm, istat); } }
列印結果 if (myid == 0) { suma=0.0; for (i = 0; i < n; i++) suma+=a[i]; printf( "sum of A=%f\n",suma); }
CPU 0 SEND(1) RECV(1) CPU 1 SEND(0) RECV(0) 解決方式 謹慎處理 SEND/RECV 位置 Non-blocking fuctions MPI_ISEND MPI_IRECV 使用 MPI_SENDRECV Deadlock Problem
MPI_Scatter • MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&b, n, MPI_DOUBLE, iroot, comm); • 每一段資料必須等長。其引數依序為 • t 是待送出陣列的起點 • n 是送給每一個CPU的資料數量 • MPI_DOUBLE 是待送出資料的類別 • b 是接收資料存放的起點,如果n值大於一時,b必須是個陣列 • n 是接收資料的數量 • MPI_DOUBLE 是接收資料的類別 • Iroot 是送出資料的CPU id
MPI_Reduce • MPI_Reduce ((void *)&suma, (void *)&sumall, count, MPI_DOUBLE, MPI_SUM, iroot, comm); • suma 是待運作 (累加) 的變數 • sumall 是存放運作 (累加) 後的結果 (把各個CPU上的suma加總) • count 是待運作 (累加) 的資料個數 • MPI_DOUBLE 是suma和sumall的資料類別 • MPI_SUM 是運作函數 • iroot 是存放運作結果的CPU_id
MPI code 計算及資料皆切割 資料數目必須可整除 n=total/nproc; iroot=0 MPI_Scatter((void *)&t, n, MPI_DOUBLE, (void *)&b, n, MPI_ DOUBLE, iroot, comm); MPI_Scatter((void *)&t, n, MPI_DOUBLE, (void *)&c, n, MPI_ DOUBLE, iroot, comm); MPI_Scatter((void *)&t, n, MPI_DOUBLE, (void *)&d, n, MPI_DOUBLE, iroot, comm);
MPI code suma=0.0; for(i=istart; i<=iend; i++) { a[i]=b[i]+c[i]*d[i]; suma=suma+a[i]; } idest=0; MPI_Gather((void *)&a, n, MPI_DOUBLE, (void *)&t, n, MPI_ DOUBLE, idest, comm); MPI_Reduce((void *)&suma, (void *)&sumall, 1, MPI_DOUBLE, MPI_SUM, idest, comm);
MPI code if(myid == 0) { printf( "sum of A=%f\n",sumall); }
Shared-memory Model vs.Message-passing Model (#1) • Shared-memory model • Number active threads 1 at start and finish of program, changes dynamically during execution • Message-passing model • All processes active throughout execution of program
Shared-memory Model vs.Message-passing Model (#2) • Shared-memory model • Execute and profile sequential program • Incrementally make it parallel • Stop when further effort not warranted • Message-passing model • Sequential-to-parallel transformation requires major effort • Transformation done in one giant step rather than many tiny steps
計中高效能運算設備介紹 • 建置日期:2003/11 • 運算節點:50 • Nexcom Blade Server • Dual Intel Xeon 2.0GHz • 1GB memory • 效能 • Rpeak: 400GFlops • Rmax: 200GFlops • 未來計畫移做教育訓練用途
計中高效能運算設備介紹 • 建置日期:2005/05 • 運算節點:78 • IBM Blade Server • Dual Intel Xeon 3.2GHz • 5GB memory • 效能: • Rpeak: 998GFlops • Rmax: 500GFlops • 目前服務主力 • 適合對象: • Serial jobs(非平行化程式) • 已透過 MPI 平行化之程式
計中高效能運算設備介紹 • 建置日期:2006/11 • 運算節點: • IBM p595 • 64*Power5 1.9GHz CPU • 256GB memory • AIX 5.3 • 效能: • Rpeak: 486GFlops • Rmax: 421GFlops • 目前服務主力 • 適合對象: • 已透過 OpenMP 平行化之程式 • 大量記憶體需求之程式