Computational Methods in Astrophysics

Computational Methods in Astrophysics Dr Rob Thacker (AT319E) thacker@ap

Today’s Lecture Distributed Memory Computing I • Key concepts – • Differences between shared & distributed memory • Message passing • A few network details • General comment – the overall computing model has not changed in decades, but the APIs have…

API Evolution • From 80’s through to early 2000s much HPC evolution was driven via math & physics communities • Notable focus on regular arrays and data structures • Big forums, working on standards etc • Starting in 2000s growth of data analytics + computational biology introduced different requirements • C++, Java or Python • Irregular data, able to start designs from scratch We’ll start with the traditional viewpoint/API - very broad use in astrophysics. Then we’ll look at alternatives later.

Shared vs distributed memory • The key difference is data decomposition • Commonly called “domain decomposition” • Numerous possible ways to break up data space • Each has different compromises in terms of the required communication patterns that result • The comms pattern determines the overall complexity of the parallel code • The decomposition can be handle in implicit explicit ways

Message Passing API’s Operate effectively on distributed memory architectures Shared memory only Parallel API’s from the decomposition-communication perspective Explicit MPI,PVM SHMEM CAF, UPC Communication HPF OpenMP Implicit Implicit Explicit Decomposition

Message Passing • Concept of sequential processes communicating via messages was developed by Hoare in the 70’s • Hoare, CAR, Comm ACM, 21, 666 (1978) • Each process has its own local memory store • Remote data needs are served by passing messages containing the desired data • Naturally carries over to distributed memory architectures • Two ways of expressing message passing: • Coordination of message passing at the language level (e.g. Occam) • Calls to a message passing library • Two types of message passing • Point-to-point (one-to-one) • Broadcast (one-to-all,all-to-all)

Broadcast versus point-to-point Broadcast (one-to-all) Point-to-point(one-to-one) Process 2 Process 2 Process 1 Process 1 Process 3 Process 3 Process 4 Process 4 Collective operation • Involves a group of processes Non-Collective operation • Involves a pair of processes

Message passing API’s • Message passing API’s dominate • Often reflect underlying hardware design • Legacy codes can frequently be converted more easily • Allows explicit management of memory hierarchy • Message Passing Interface (MPI) is the predominant API • Parallel Virtual Machine (PVM) is an earlier API that possesses some useful features over MPI • Useful paradigm for heterogeneous systems, there’s even a python version

http://www.csm.ornl.gov/pvm/ PVM – An overview • API can be traced back to 1989(!) • Geist & Sunderam developed experimental version • Daemon based • Each host runs a daemon that controls resources • Process can be dynamically created and destroyed • PVM Console • Each user may actively configure their host environment • Process groups for domain decomposition • PVM group server controls this aspect • Limited number of collective operations • Barriers, broadcast, reduction • Roughly 40 functions in the API

PVM API and programming model • PVM most naturally fits a master-worker model • Master process responsible for I/O • Workers are spawned by master • Each process has a unique identifier • Messages are typed and tagged • System is aware of data-type, allows easy portability across heterogeneous network • Messages are passed via a three phase proces • Clear (initialize) buffer • Pack buffer • Send buffer

Example code tid=pvm_mytid() if (tid==source){ bufid= pvm_initsend(PvmDataDefault); info = pvm_packint(&i1,1,1); info = pvm_packfloat(vec1,2,1); info = pvm_send(dest,tag); } else if (tid==dest){ bufid= pvm_recv(source,tag); info = pvm_upkint(&i2,1,1); info = pvm_upkfloat(vec2,2,1); } Sender Receiver

MPI – An overview • API can be traced back to 1992 • First unofficial meeting of MPI forum at Supercomputing 92 • Mechanism for creating processes is not specified within API • Different mechanism on different platforms • MPI 1.x standard does not allow for creating or destroying processes • Process groups central to parallel model • ‘Communicators’ • Richer set of collective operations than PVM • Derived data-types important advance • Can specify a data-type to control pack-unpack step implicitly • 125 functions in the API (v1.0)

MPI API and programming model • More naturally a true SPMD type programming model • Oriented toward HPC applications • Master-worker model can still be implemented effectively • As for PVM, each process has a unique identifier • Messages are typed, tagged and flagged with a communicator • Messaging can be a single stage operation • Can send specific variables without need for packing • Packing is still an option

Remote Direct Memory Access • Message passing involves a number of expensive operations: • CPUs must be involved (possibly OS kernel too) • Buffers are often required • RDMA cuts down on the CPU overhead • CPU sets up channels for the DMA engine to write directly to the buffer and avoid constantly taxing the CPU • Frequently discussed under the “zero-copy” euphemism • Message passing API’s have been designed around this concept (but usually called remote memory access) • Cray SHMEM

RDMA illustrated HOST A HOST B Memory/ Buffer Memory/ Buffer CPU CPU packet NIC (with RDMA engine) NIC (with RDMA engine)

Networking issues • Networks have played a profound role in the evolution of parallel APIs • Examine network fundamentals in more detail • Provides better understanding of programming issues • Reasons for library design (especially RDMA)

OSI network model • Grew out of 1982 attempt by ISO to develop Open Systems Interconnect (too many vendor proprietary protocols at that time) • Motivated from theoretical rather than practical standpoint • System of layers taken together = protocol stack • Each layer communicates with its peer layer on the remote host • Proposed stack was too complex and had too much freedom: not adopted • e.g. X.400 email standard required several books of definitions • Simplified Internet TCP/IP protocol stack eventually grew out of the OSI model • e.g. SMTP email standard takes a few pages

Conceptual structure of OSI network Layer 7. Application(http,ftp,…) Upper level Layer 6. Presentation (data std) Layer 5. Session (application) Layer 4.Transport (TCP,UDP,...) Data transfer Layer 3. Network (IP,…) Routing Lower level Layer 2. Data link (Ethernet,…) Layer 1. Physical (signal)

Internet Protocol Suite • Protocol stack on which the internet runs • Occasionally called TCP/IP protocol stack • Doesn’t map perfectly to OSI model • OSI model lacks richness at lower levels • Motivated by engineering rather than concepts • Higher levels of OSI model were mapped into a single application layer • Expanded some layering concepts within the OSI model (e.g. internetworking was added to the network layer)

Internet Protocol Suite e.g. FTP, HTTP, DNS e.g. TCP, UDP, RTP, SCTP IP e.g. Ethernet, token ring e.g. T1, E1 “Layer 7” Application Layer 4.Transport Layer 3. Network Layer 2. Data link Layer 1. Physical

Internet Protocol (IP) • Data-oriented protocol used by hosts for communicating data across a packet-switched inter-network • Addressing and routing are handled at this level • IP sends and receives data between two IP addresses • Data segment = packet (or datagram) • Packet delivery is unreliable – packets may arrive corrupted, duplicated or not at all, and out of order • Lack of delivery guarantees allows fast switching

IP Addressing • On an ethernet network routing at the data link layer occurs between 6 byte MAC (Media Access Control) addresses • IP adds its own configurable address scheme on top of this • 4 byte address, expressed as 4 decimals on 0-255 • Note 0 and 255 are both reserved numbers • Division of numbers determines network number versus node • Subnet masks determine how these are divided • Classes of networks are described by the first number in the IP address and the number of network addresses • [192:255].35.91.* = class C network (254 hosts) (subnet mask 255.255.255.0) • [128:191].132.*.* = class B network (65,534 hosts) ( “ 255.255.0.0) • [1:126].*.*.* = class A network (16 million hosts) ( “ 255.0.0.0) Note the 35.91 in the class C example, and the 132. in the class B example can be different, but are filled in to show how the network address is defined

Transmission Control Protocol (TCP) • TCP is responsible for division of the applications data-stream, error correction and opening the channel (port) between applications • Applications send a byte stream to TCP • TCP divides the byte stream into appropriately sized segments (set by the MTU* of the IP layer) • Each segment is given two sequence numbers to enable the byte stream to be reconstructed • Each segment also has a checksum to ensure correct packet delivery • Segments are passed to IP layer for delivery *maximum transfer unit

Humour: TCP joke "Hi, I'd like to hear a TCP joke." "Hello, would you like to hear a TCP joke?" "Yes, I'd like to hear a TCP joke." "OK, I'll tell you a TCP joke." "Ok, I will hear a TCP joke." "Are you ready to hear a TCP joke?" "Yes, I am ready to hear a TCP joke." "Ok, I am about to send the TCP joke. It will last 10 seconds, it has two characters, it does not have a setting, it ends with a punchline." "Ok, I am ready to get your TCP joke that will last 10 seconds, has two characters, does not have an explicit setting, and ends with a punchline." "I'm sorry, your connection has timed out. Hello, would you like to hear a TCP joke?"

UDP: Alternative to TCP • UDP=User Datagram Protocol • Only adds a checksum and multiplexing capabilitiy – limited functionality allows a streamlined implementation: faster than TCP • No confirmation of delivery • Unreliable protocol: if you need reliability you must build on top of this layer • Suitable for real-time applications where error correction is irrelevant (e.g. streaming media, voice over IP) • DNS and DHCP both use UDP

Encapsulation of layers Application data TCP header Transport data IP header Network data enet header Data link data

Link Layer • For high performance clusters the link layer frequently determines the networking above it • All high performance interconnects emulate IP • Each data link thus brings its own networking layer with it

Overview of interconnect fabrics • Broadly speaking interconnect breakdown into the two camps: commodity vs specialist • Commodity: gigabit ethernet (cost<50 per port) • Specialist: everything else (cost > 200 dollars per port) • Specialist interconnects primarily provide two features over gigabit: • Higher bandwidth • Lower message latency

10Gigabit Ethernet • Expected to become commodity any year now (estimates still in the range of $1000 per port) • A lot of the early implementations were from companies with HPC backgrounds e.g. Myrinet, Mellanox • The problem has always been finding a technological driver outside HPC – few people need a GB/s out of their desktop

Infiniband • Infiniband (Open) standard is designed to cover many arenas, from database servers to HPC • 237 systems on top500 (Nov. 2015) • Has essentially become commoditized • Serial bus, can add bandwidth by adding more channels (“lanes”) and increasing channel speed • x14 data rate option now available (14 Gb/s) • 56Gb/s (14*4) ports now common, higher available • Necessary for “fat nodes” with lots of cores • 600 Gb/s projected for 2017

History of MPI • Many different message passing standards circa 1992 • Most designed for high performance distributed memory systems • Following SC92 MPI Forum is started • Open participation encouraged (e.g. PVM working group was asked for input) • Goal is to produce as portable an interface as possible • Vendors included but not given control – specific hardware optimizations were avoided • Web address: http://www.mpi-forum.org • MPI-1 standard released 1994 • Forum reconvened in 1995-97 to define MPI-2 • Fully functional MPI-2 implementations did not appear until 2002 though • Reference guide is available for download • http://www.netlib.org/utk/papers/mpi-book/mpi-book.ps

C vs FORTRAN interface • As much effort as possible was extended to keep the interfaces similar • Only significant difference is C functions return their value as the error code • FORTRAN versions pass a separate argument • Arguments to C functions may be more strongly typed than FORTRAN equivalents • FORTRAN interface relies upon integers

MPI Communication model • Messages are typed and tagged • Don’t need to explicitly define buffer • Interface is provided if you want to use it • Specify start point of a message using memory address • Packing interface available if necessary (MPI_PACK datatype) • Communicators (process groups) are a vital component of the MPI standard • Destination processes must include the specific process group • Messages must therefore specify: (address,count,datatype,destination,tag,communicator) • Address defines the message data, the remaining variables define the message envelope

MPI-2 • Significant advance over the 1.2 standard • Defines remote memory access (RMA) interface • Two types of modes of operation • Active target: all processes participate in a single communication phase (although point-to-point messaging is allowed) • Passive target: Individual processes participate in point-to-point messaging • Parallel I/O • Dynamic process management (MPI_SPAWN)

Missing pieces • MPI-1 did not specify how processes start • PVM defined its own console • Start-up is done using vendor/open source supplied package • MPI-2 defines mpiexec – a standardized startup routine • Standard buffer interface is implementation specific • Process groups are static – can only be created or destroyed • No mechanism for obtaining details about the hosts involved in the computation

Getting started: enrolling & exitting from the MPI environment • Every program must initialize by executing MPI_INIT(ierr) or int MPI_INIT(int argc, char ***argv) • argc, argv are historical hangovers for the C version which maybe set to NULL • Default communicator is MPI_COMM_WORLD • Determine the process id by calling • MPI_COMM_RANK(MPI_COMM_WORLD, myid,ierr) • Note PVM essentials puts enrollment and id resolution into one call • Determine total number of processes via • MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr) • To exit, processes must call MPI_FINALIZE(ierr)

Minimal MPI program program main include “mpif.h” integer ierr,myid call MPI_INIT( ierr ) call MPI_COMM_RANK(MPI_COMM_WORLD, & myid, ierr) print *, 'Hello, world from ’,myid call MPI_FINALIZE( ierr ) end #include "mpi.h" #include <stdio.h> int main( int argc, char *argv[] ) { int myid; MPI_Init( &argc, &argv ); MPI_Comm_rank(MPI_COMM_WORLD, &myid); printf( "Hello, world from %d\n“,myid); MPI_Finalize(); return 0; } Normally execute by mpirun –np 4 my_program Output: Hello,world from 2 Hello world from 1 Hello world from 0 Hello world from 3

Compiling MPI codes • Some implementations (e.g. MPICH) define additional wrappers for the compiler: • mpif77, mpif90 for F77,F90 • mpicc, mpicxx for C/C++ • Code is then compiled using mpif90 (e.g.) rather than f90, libraries are linked in automatically • Usually best policy when machine specific libraries are required • Linking can always be done by hand

What needs to go in a message? • Things that need specifying: • How will “data” be described? - specify • How will processes be identified? – where? • How will the receiver recognize/screen messages? - tagging • What will it mean for these operations to complete? – confirmed completion

MPI Basic (Blocking) Send MPI_SEND (start, count, datatype, dest, tag, comm) • The message buffer is described by (start, count, datatype). • The target process is specified by dest, which is the rank of the target process in the communicator specified by comm. • When this function returns, the data has been delivered to the system and the buffer can be reused. The message may not have been received by the target process. From Bill Gropp’s slides

Subtleties of point-to-point messaging This kind of communication is `unsafe’. Whether it works correctly is dependent upon whether the system has enough buffer space. Process A Process B MPI_Send(B) MPI_Send(A) MPI_Recv(B) MPI_Recv(A) This code leads to a deadlock, since the MPI_Recv blocks execution until it is completed. Process A Process B MPI_Recv(B) MPI_Recv(A) MPI_Send(B) MPI_Send(A) You should always try and write communication patterns like this: a send is match by a recv. Process A Process B MPI_Send(B) MPI_Recv(A) MPI_Recv(B) MPI_Send(A)

Buffered Mode communication • Buffered sends avoid the issue of whether enough internal buffering is available • Programmer explicitly defines buffer space sufficient to allow all messages to be sent • MPI_Bsend has same semantics as MPI_Send • MPI_Buffer_attach(buffer,size,ierr) must be called to define the buffer space • Frequently better to rely on non-blocking communication though

Helps alleviate two issues Blocking communication can potentially starve a process for data while it could be doing useful work Problems related to buffering are circumvented, since the user must explicitly ensure the buffer is available MPI_Isend adds a handle to the subroutine call which is later used to determine whether the operation has succeeded MPI_Irecv is the matching non-blocking receive operation MPI_Test can be used to detect whether the send/receive has completed MPI_Wait is used to wait for an operation to complete Handle is used to identify which particular message MPI_Waitall is used to wait for a series of operations to complete Array of handles is used Non-blocking communication

Solutions to deadlocking • If sends and recieves need to be matched use MPI_Sendrecv • Non-blocking versions of Isend and Irecv will prevent deadlocks • Advice: Use buffered mode sends (Ibsend) so you know for sure that buffer space is available Process A Process B MPI_Sendrecv(B) MPI_Sendrecv(A) Process A Process B MPI_ISend(B) MPI_ISend(A) MPI_IRecv(B) MPI_IRecv(A) MPI_Waitall MPI_Waitall

Other sending modes • Synchronous send (MPI_Ssend) • Only returns when the receiver has started receiving the message • On return indicates that send buffer can be reused, and also that receiver has started processing the message • Non-local communication mode: dependent upon speed of remote processing • (Receiver) Ready send (MPI_Rsend) • Used to eliminate unnecessary handshaking on some systems • If posted before receiver is ready then outcome is undefined (dangerous!) • Semantically, Rsend can be replaced by standard send

Collective Operations • Collectives apply to all processes within a given communicator • Three main categories: • Data movement (e.g. broadcast) • Synchronization (e.g. barrier) • Global reduction operations • All processes must have a matching call • Size of data sent must match size of data received • Unless specifically a synchronization function, these routines do not imply synchronization • Blocking mode only – but unaware of status of remote operations • No tags are necessary

Collective Data Movement • Types of data movement: • Broadcast (one to all, or all to all) • Gather (collect to single process) • Scatter (send from one processor to all) • MPI_Bcast(buff,count,datatype,root,comm,ierr) data A0 A0 MPI_Bcast Processor A0 A0 A0

Gather/scatter • MPI_Gather(sendbuf,sendcount,sendtype,recvbuf,recvcount,recvtype,root,comm,ierr) • MPI_Scatter has same semantics • Note MPI_Allgather removes root argument and all processes receive result • Think of it is gather followed by broadcast • MPI_Alltoall(sendbuf,sendcount,sendtype,recvbuf,recvcount,recvtype,comm,ierr) • Processes sends set of distinct data elements to others – useful for transposing a matrix data A0 A1 A2 A3 A0 MPI_Scatter Processor A1 MPI_Gather A2 A3

Global Reduction Operations • Plenty of operations covered: Name of Operation Action MPI_MAX maximum MPI_MIN minimum MPI_SUM sum MPI_PROD product MPI_LAND logical and MPI_BAND bit-wise and MPI_LOR logical or MPI_BOR bit-wise or MPI_LXOR logical xor MPI_BXOR bit-wise xor MPI_MAXLOC max value and location MPI_MINLOC minimum value and location

Computational Methods in Astrophysics