1 / 17

Parallel Architecture Models

Shared Memory Dual/Quad Pentium, Cray T90, IBM Power3 Node Distributed Memory Cray T3E, IBM SP2, Network of Workstations Distributed-Shared Memory SGI Origin 2000, Convex Exemplar. Parallel Architecture Models. Shared Memory. Bus. c. c. c. c. P. P. P. P.

moses
Download Presentation

Parallel Architecture Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Shared Memory Dual/Quad Pentium, Cray T90, IBM Power3 Node Distributed Memory Cray T3E, IBM SP2, Network of Workstations Distributed-Shared Memory SGI Origin 2000, Convex Exemplar Parallel Architecture Models

  2. Shared Memory Bus c c c c P P P P Shared Memory Systems (SMP) • - Any processor can access any memory location at equal cost • (Symmetric Multi-Processor) • - Tasks “communicate” by writing/reading common locations • - Easier to program • - Cannot scale beyond around 30 PE's (bus bottleneck) • - Most workstation vendors make SMP's today (SGI, Sun, HP • Digital; Pentium) • -Cray Y-MP, C90, T90 (cross-bar between PE's and memory)

  3. Shared Memory Bus c c c c P P P P Cache Coherence in SMP’s - Each proc’s cache holds most recently accessed values - If multiply cached word is modified, need to make all copies consistent - Bus-based SMP’s use an efficient mechanism: snoopy bus - Snoopy bus monitors all writes; marks other copies invalid - When proc finds invalid cache word, fetches copy from SM

  4. NIC NIC NIC NIC M M M M c c c c P P P P Distributed Memory Systems M: Memory c: Cache P: Processor NIC: Network Interface Card Interconnection Network - Each processor can only access its own memory - Explicit communication by sending and receiving messages - More tedious to program - Can scale to hundreds/thousands of processors - Cache coherence is not needed - Examples: IBM SP-2, Cray T3E, Workstation Clusters

  5. M M M M Interconnection Network c c c c P P P P Distributed Shared Memory - Each processor can directly access any memory location - Physically distributed memory; many simultaneous accesses - Non-uniform memory access costs - Examples: Convex Exemplar, SGI Origin 2000 - Complex hardware and high cost for cache coherence - Software DSM systems (e.g. Treadmarks) implement shared memory abstraction on top of Distributed Memory Systems

  6. Shared-Address Space Models BSP(Bulk Synchronous Parallel model) HPF(High Performance Fortran) OpenMP Message Passing Partitioned address space PVM, MPI [Ch.8, I.Fosters book: Designing and Building Parallel Programs (available online)] Higher Level Programming Environments PETSc:Portable Extensible Toolkit for Scientific computation POOMA:Parallel Object-Oriented Methods and Applications Parallel Programming Models

  7. Standard sequential Fortran/C model Single global view of data Automatic parallelization by compiler User can provide loop-level directives Easy to program Only available on Shared-Memory Machines OpenMP

  8. Global shared address space, similar to sequential programming model User provides data mapping directives User can provide information on loop-level parallelism Portable: available on all three types of architectures Compiler automatically synthesizes message-passing code if needed Restricted to dense arrays and regular distributions Performance is not consistently good High Performance Fortran

  9. Program is a collection of tasks Each task can only read/write its own data Tasks communicate data by explicitly sending/receiving messages Need to translate from global shared view to local partitioned view in porting a sequential program Tedious to program/debug Very good performance Message Passing

  10. Illustrative Example Real a(n,n),b(n,n) Do k = 1,NumIter Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do a(20,20) b(20,20)

  11. Example: OpenMP Real a(n,n),b(n,n) c$omp parallel shared(a,b,k) private(i,j) Do k = 1,NumIter c$omp do Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do c$omp do Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do Global shared view of data a(20,20) b(20,20)

  12. Example: HPF (1D partition) Real a(n,n),b(n,n) chpf$ Distribute a(block,*), b(block,*) Do k = 1,NumIter chpf$ independent, new(i) Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do chpf$ independent , new(i) Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do Global shared view of data P0 P1 P2 P3 a(20,20) b(20,20)

  13. Example: HPF (2D partition) Real a(n,n),b(n,n) chpf$ Distribute a(block,block) chpf$ Distribute b(block,block) Do k = 1,NumIter chpf$ independent, new(i) Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do chpf$ independent , new(i) Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do Global shared view of data a(20,20) b(20,20)

  14. P0 P0 P1 P1 al(5,20) P2 P2 P3 P3 ghost cells al(5,20) bl(0:6,20) Local partitioned view with ghost cells a(20,20) b(20,20) Message Passing: Local View communication required bl(5,20) Global shared view Local partitioned view

  15. bl(0:6,20) Example: Message Passing Real al(NdivP,n),bl(0:NdivP+1,n) me = get_my_procnum() Do k = 1,NumIter if (me=P-1) send(me+1,bl(NdivP,1:n)) if (me=0) recv(me-1,bl(0,1:n)) if (me=0) send(me-1,bl(1,1:n)) if (me=P-1) recv(me+1,bl(NdivP+1,1:n)) if (me=0) then i1=2 else i1=1 if (me=P-1) then i2=NdivP-1 else i2=NdivP Do i = i1,i2 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do ……... al(5,20) ghost cells are communicated by message-passing Local partitioned view with ghost cells

  16. Program Porting/Development Effort OpenMP = HPF << MPI Portability across systems HPF = MPI >> OpenMP (only shared-memory) Applicability MPI = OpenMP >> HPF (only dense arrays) Performance MPI > OpenMP >> HPF Comparison of Models

  17. Higher level parallel programming model Aims to provide both ease of use and high performance for numerical PDE solution Uses efficient message-passing implementation underneath but: Provides global view of data arrays System takes care of needed message-passing Portable across shared & distributed memory systems PETSc

More Related