Parallel Architecture Models

Shared Memory Dual/Quad Pentium, Cray T90, IBM Power3 Node Distributed Memory Cray T3E, IBM SP2, Network of Workstations Distributed-Shared Memory SGI Origin 2000, Convex Exemplar Parallel Architecture Models

Shared Memory Bus c c c c P P P P Shared Memory Systems (SMP) • - Any processor can access any memory location at equal cost • (Symmetric Multi-Processor) • - Tasks “communicate” by writing/reading common locations • - Easier to program • - Cannot scale beyond around 30 PE's (bus bottleneck) • - Most workstation vendors make SMP's today (SGI, Sun, HP • Digital; Pentium) • -Cray Y-MP, C90, T90 (cross-bar between PE's and memory)

Shared Memory Bus c c c c P P P P Cache Coherence in SMP’s - Each proc’s cache holds most recently accessed values - If multiply cached word is modified, need to make all copies consistent - Bus-based SMP’s use an efficient mechanism: snoopy bus - Snoopy bus monitors all writes; marks other copies invalid - When proc finds invalid cache word, fetches copy from SM

NIC NIC NIC NIC M M M M c c c c P P P P Distributed Memory Systems M: Memory c: Cache P: Processor NIC: Network Interface Card Interconnection Network - Each processor can only access its own memory - Explicit communication by sending and receiving messages - More tedious to program - Can scale to hundreds/thousands of processors - Cache coherence is not needed - Examples: IBM SP-2, Cray T3E, Workstation Clusters

M M M M Interconnection Network c c c c P P P P Distributed Shared Memory - Each processor can directly access any memory location - Physically distributed memory; many simultaneous accesses - Non-uniform memory access costs - Examples: Convex Exemplar, SGI Origin 2000 - Complex hardware and high cost for cache coherence - Software DSM systems (e.g. Treadmarks) implement shared memory abstraction on top of Distributed Memory Systems

Shared-Address Space Models BSP(Bulk Synchronous Parallel model) HPF(High Performance Fortran) OpenMP Message Passing Partitioned address space PVM, MPI [Ch.8, I.Fosters book: Designing and Building Parallel Programs (available online)] Higher Level Programming Environments PETSc:Portable Extensible Toolkit for Scientific computation POOMA:Parallel Object-Oriented Methods and Applications Parallel Programming Models

Standard sequential Fortran/C model Single global view of data Automatic parallelization by compiler User can provide loop-level directives Easy to program Only available on Shared-Memory Machines OpenMP

Global shared address space, similar to sequential programming model User provides data mapping directives User can provide information on loop-level parallelism Portable: available on all three types of architectures Compiler automatically synthesizes message-passing code if needed Restricted to dense arrays and regular distributions Performance is not consistently good High Performance Fortran

Program is a collection of tasks Each task can only read/write its own data Tasks communicate data by explicitly sending/receiving messages Need to translate from global shared view to local partitioned view in porting a sequential program Tedious to program/debug Very good performance Message Passing

Illustrative Example Real a(n,n),b(n,n) Do k = 1,NumIter Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do a(20,20) b(20,20)

Example: OpenMP Real a(n,n),b(n,n) c$omp parallel shared(a,b,k) private(i,j) Do k = 1,NumIter c$omp do Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do c$omp do Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do Global shared view of data a(20,20) b(20,20)

Example: HPF (1D partition) Real a(n,n),b(n,n) chpf$ Distribute a(block,*), b(block,*) Do k = 1,NumIter chpf$ independent, new(i) Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do chpf$ independent , new(i) Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do Global shared view of data P0 P1 P2 P3 a(20,20) b(20,20)

Example: HPF (2D partition) Real a(n,n),b(n,n) chpf$ Distribute a(block,block) chpf$ Distribute b(block,block) Do k = 1,NumIter chpf$ independent, new(i) Do i = 2,n-1 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do chpf$ independent , new(i) Do i = 2,n-1 Do j = 2,n-1 b(i,j) = a(i,j) End Do End Do End Do Global shared view of data a(20,20) b(20,20)

P0 P0 P1 P1 al(5,20) P2 P2 P3 P3 ghost cells al(5,20) bl(0:6,20) Local partitioned view with ghost cells a(20,20) b(20,20) Message Passing: Local View communication required bl(5,20) Global shared view Local partitioned view

bl(0:6,20) Example: Message Passing Real al(NdivP,n),bl(0:NdivP+1,n) me = get_my_procnum() Do k = 1,NumIter if (me=P-1) send(me+1,bl(NdivP,1:n)) if (me=0) recv(me-1,bl(0,1:n)) if (me=0) send(me-1,bl(1,1:n)) if (me=P-1) recv(me+1,bl(NdivP+1,1:n)) if (me=0) then i1=2 else i1=1 if (me=P-1) then i2=NdivP-1 else i2=NdivP Do i = i1,i2 Do j = 2,n-1 a(i,j)=(b(i-1,j)+b(i,j-1) +b(i+1,j)+b(i,j+1))/4 End Do End Do ……... al(5,20) ghost cells are communicated by message-passing Local partitioned view with ghost cells

Program Porting/Development Effort OpenMP = HPF << MPI Portability across systems HPF = MPI >> OpenMP (only shared-memory) Applicability MPI = OpenMP >> HPF (only dense arrays) Performance MPI > OpenMP >> HPF Comparison of Models

Higher level parallel programming model Aims to provide both ease of use and high performance for numerical PDE solution Uses efficient message-passing implementation underneath but: Provides global view of data arrays System takes care of needed message-passing Portable across shared & distributed memory systems PETSc

Parallel Architecture Models

Parallel Architecture Models

Presentation Transcript

Overview of Parallel Architecture

NETWORK ARCHITECTURE MODELS

Parallel Computer Architecture

Parallel Processing: Architecture Overview

Principles of Parallel Architecture

Parallel Programming Models

Parallel Programming Models

Parallel computer architecture classification

HMAX Models Architecture

Parallel Architecture

Parallel Architecture is Ubiquitous

Parallel Programming Models

Parallel-Machine Models

Parallel Computation Models

Parallel computation models

Overview of Parallel Architecture

Parallel architecture Technique

Computer Architecture Parallel Processors

Inter-Processor Parallel Architecture

Parallel Programming Models

Parallel Processing: Architecture Overview