180 likes | 319 Views
Shared Memory Dual/Quad Pentium, Cray T90, IBM Power3 Node Distributed Memory Cray T3E, IBM SP2, Network of Workstations Distributed-Shared Memory SGI Origin 2000, Convex Exemplar. Parallel Architecture Models. Shared Memory. Bus. c. c. c. c. P. P. P. P.
E N D
Shared Memory Dual/Quad Pentium, Cray T90, IBM Power3 Node Distributed Memory Cray T3E, IBM SP2, Network of Workstations Distributed-Shared Memory SGI Origin 2000, Convex Exemplar Parallel Architecture Models
Shared Memory Bus c c c c P P P P Shared Memory Systems (SMP) • - Any processor can access any memory location at equal cost • (Symmetric Multi-Processor) • - Tasks “communicate” by writing/reading common locations • - Easier to program • - Cannot scale beyond around 30 PE's (bus bottleneck) • - Most workstation vendors make SMP's today (SGI, Sun, HP • Digital; Pentium) • -Cray Y-MP, C90, T90 (cross-bar between PE's and memory)
Shared Memory Bus c c c c P P P P Cache Coherence in SMP’s - Each proc’s cache holds most recently accessed values - If multiply cached word is modified, need to make all copies consistent - Bus-based SMP’s use an efficient mechanism: snoopy bus - Snoopy bus monitors all writes; marks other copies invalid - When proc finds invalid cache word, fetches copy from SM
NIC NIC NIC NIC M M M M c c c c P P P P Distributed Memory Systems M: Memory c: Cache P: Processor NIC: Network Interface Card Interconnection Network - Each processor can only access its own memory - Explicit communication by sending and receiving messages - More tedious to program - Can scale to hundreds/thousands of processors - Cache coherence is not needed - Examples: IBM SP-2, Cray T3E, Workstation Clusters
M M M M Interconnection Network c c c c P P P P Distributed Shared Memory - Each processor can directly access any memory location - Physically distributed memory; many simultaneous accesses - Non-uniform memory access costs - Examples: Convex Exemplar, SGI Origin 2000 - Complex hardware and high cost for cache coherence - Software DSM systems (e.g. Treadmarks) implement shared memory abstraction on top of Distributed Memory Systems
Shared-Address Space Models BSP(Bulk Synchronous Parallel model) HPF(High Performance Fortran) OpenMP Message Passing Partitioned address space PVM, MPI [Ch.8, I.Fosters book: Designing and Building Parallel Programs (available online)] Higher Level Programming Environments PETSc:Portable Extensible Toolkit for Scientific computation POOMA:Parallel Object-Oriented Methods and Applications Parallel Programming Models
Standard sequential Fortran/C model Single global view of data Automatic parallelization by compiler User can provide loop-level directives Easy to program Only available on Shared-Memory Machines OpenMP
Global shared address space, similar to sequential programming model User provides data mapping directives User can provide information on loop-level parallelism Portable: available on all three types of architectures Compiler automatically synthesizes message-passing code if needed Restricted to dense arrays and regular distributions Performance is not consistently good High Performance Fortran
Program is a collection of tasks Each task can only read/write its own data Tasks communicate data by explicitly sending/receiving messages Need to translate from global shared view to local partitioned view in porting a sequential program Tedious to program/debug Very good performance Message Passing
Illustrative Example Real a(n,n),b(n,n) Do k = 1,NumIter Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do a(20,20) b(20,20)
Example: OpenMP Real a(n,n),b(n,n) c$omp parallel shared(a,b,k) private(i,j) Do k = 1,NumIter c$omp do Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do c$omp do Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do Global shared view of data a(20,20) b(20,20)
Example: HPF (1D partition) Real a(n,n),b(n,n) chpf$ Distribute a(block,*), b(block,*) Do k = 1,NumIter chpf$ independent, new(i) Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do chpf$ independent , new(i) Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do Global shared view of data P0 P1 P2 P3 a(20,20) b(20,20)
Example: HPF (2D partition) Real a(n,n),b(n,n) chpf$ Distribute a(block,block) chpf$ Distribute b(block,block) Do k = 1,NumIter chpf$ independent, new(i) Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do chpf$ independent , new(i) Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do Global shared view of data a(20,20) b(20,20)
P0 P0 P1 P1 al(5,20) P2 P2 P3 P3 ghost cells al(5,20) bl(0:6,20) Local partitioned view with ghost cells a(20,20) b(20,20) Message Passing: Local View communication required bl(5,20) Global shared view Local partitioned view
bl(0:6,20) Example: Message Passing Real al(NdivP,n),bl(0:NdivP+1,n) me = get_my_procnum() Do k = 1,NumIter if (me=P-1) send(me+1,bl(NdivP,1:n)) if (me=0) recv(me-1,bl(0,1:n)) if (me=0) send(me-1,bl(1,1:n)) if (me=P-1) recv(me+1,bl(NdivP+1,1:n)) if (me=0) then i1=2 else i1=1 if (me=P-1) then i2=NdivP-1 else i2=NdivP Do i = i1,i2 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do ……... al(5,20) ghost cells are communicated by message-passing Local partitioned view with ghost cells
Program Porting/Development Effort OpenMP = HPF << MPI Portability across systems HPF = MPI >> OpenMP (only shared-memory) Applicability MPI = OpenMP >> HPF (only dense arrays) Performance MPI > OpenMP >> HPF Comparison of Models
Higher level parallel programming model Aims to provide both ease of use and high performance for numerical PDE solution Uses efficient message-passing implementation underneath but: Provides global view of data arrays System takes care of needed message-passing Portable across shared & distributed memory systems PETSc