Advanced Computer Architecture CSE 8383

Advanced Computer ArchitectureCSE 8383 April 24, 2008 Session 12

Contents • Message Passing Systems (Chapters 5 & 7) • Communication Patterns • Network Computing • Client/Server System • Clusters • Grid • Interconnection Networks

Message Passing Mechanisms • Message Format • Message  arbitrary number of fixed length packets • Packet  basic unit containing destination address. Sequence number is needed • A packet can further be divided into flits (flow control digits) • Routing and sequence occupy header flit

Message, Packets, Flits Message Packet Destination Sequence Data flit

Store and Forward Routing • Packets are the basic units of information flow • Each node uses a packet buffer • A packet is transferred from S to D through a sequence of intermediate nodes • Channel and buffer must be available

Wormhole Routing • Flits are the basic units of information flow • Each node uses a flit buffer • Flits are transferred from S to D through a sequence of intermediate routers in order (Pipeline) • Can be visualized as a railroad train • Flits from different packets cannot be mixed up

Latency Analysis • L  packet length (in bits) • W  Channel bandwidth (bits/sec) • D  Distance (number of hops) • F  flit length (in bits)

Store and Forward Latency D

WH Latency D

Latency Analysis • L  packet length (in bits) • W  Channel bandwidth (bits/sec) • D  Distance (number of hops) • F  flit length (in bits) • TSF = D * L/W • TWH = L/W + D* F/W  L/Wif L>>F (independent of D)

Communication Patterns • Point to Point  1 - 1 • Multicast  1 - n • Broadcast  1 - all • Conference  n - n

Routing potential problems Deadlock: • When 2 messages, each is holding the resources required by the other in order to move, both messages will be blocked (cyclic dependency for resources) • Straightforward solution (but inefficient) is rerouting • Another solution is avoidance of occurrence of deadlock using a strict monotonic order of network resources • Channel dependency graph (CDG) is a technique for developing a deadlock-free routing algorithm.

0 1 3 2 c1 c1 c2 c3 c4 c5 c4 c6 c2 c8 c7 c8 c5 c6 c7 c3 A 4-node network and its CDGs (a) A 4-node network (b) Channel dependency graph (CDG) c2 c3 c4 c1 c5 c6 c7 c8 (c) CDG for a deadlock-free version of the network

Livelock: • A message goes around the network and never reaches its destination • It results from using adaptive routing algorithms with dynamic injection, where nodes inject their messages in the network at arbitrary times • Policies to avoid livelock are based on assigning a priority to a message injected to the network: • Messages are routed according to their priorities • Once a message is injected, only a finite number of messages will be injected with higher or equal priority.

Starvation: • A node suffers from starvation if it has a message to inject into the network but is never allowed to do so. • The simplest policy to avoid starvation is to allow each node to have an injection queue that competes with the queues of the incoming links to the same node. • The main disadvantage is that a node with a high message injection rate can slow down all the other nodes in the network.

Routing Efficiency • Two Parameters • Channel Traffic (number of channels used to deliver the message involved) • Communication Latency (distance)

Multicast on a mesh (5 unicasts) Traffic ? Latency ?

Multicast on a mesh (multicast pattern 1) Traffic ? Latency ?

Multicast on a mesh (multicast pattern 2) Traffic ? Latency ?

Broadcast (tree structure) 3 2 3 4 2 1 2 3 1 1 2

Message Passing in PVM (Revisit) Sending Task Receiving Task User application Library User application Library 5 1 4 8 6 2 3 7 Daemon Daemon

Standard PVM asynchronous communication • A sending task issues a send command (point 1) • The message is transferred to the daemon (point 2) • Control is returned to the user application (points 3 & 4) • The daemon will transmit the message on the physical wire sometime after returning control to the user application(point 3)

Standard PVM asynchronous communication (cont.) • The receiving task issues a receive command (point 5) at some other time • In the case of a blocking receive, the receiving task blocks on the daemon waiting for a message (point 6). After the message arrives, control is returned to the user application (points 7 & 8) • In the case of a non-blocking receive, control is returned to the user application immediately (points 7 & 8)

Send (3 steps) • A send buffer must be initialized • The message is packed into the buffer • The completed message is sent to its destination(s)

Receive (2 steps) • The message is received • The received items are unpacked

Message Buffers • Buffer Creation (before packing) Bufid = pvm_initsend(encoding_option) Bufid = pvm_mkbuf(encoding_option) Encoding option Meaning 0 XDR 1 No encoding 2 Leave data in place

Message Buffers (cont.) • Data Packing pvm_pk*() • pvm_pkstr() – one argument pvm_pkstr(“This is my data”); • Others – three arguments • Pointer to the first item • Number of items to be packed • Stride pvm_pkint(my_array, n, 1); • Packing functions can be called multiple times to pack data into a single message

Sending a message • Point to point (one receiver) info = pvm_send(tid, tag) • broadcast (multiple receivers) info = pvm_mcast(tids, n, tag) info = pvm_bcast(group_name, tag) • Pack and Send (one step) info = pvm_psend(tid, tag, my_array, length, data type)

Receiving a message • Blocking bufid = pvm_recv(tid, tag) -1  wild card in either tid or tag • Nonblocking bufid = pvm_nrecv(tid, tag) bufid = 0 (no message was received) • Timeout bufid = pvm_trecv(tid, tag, timeout) bufid = 0 (no message was received)

Different Receive in PVM Time Funcitonis called Pvm_recv() Pvm_nrecv() Pvm_trecv() Continue execution wait wait Time is expired Resume execution Message arrival Resume execution Blocking Non-blocking Timeout

Data unpacking pvm_upk*() • pvm_upkstr() – one argument pvm_upkstr(string); • Others – three arguments • Pointer to the first item • Number of items to be unpacked • Stride pvm_upkint(my_array, n, 1);

Networks Computing • Four categories • WAN • MAN • LAN • SAN • Internet • TCP/IP

Other Network technologies • Fast Ethernet and Gigabit Ethernet • The Fiber Distributed Data Interface (FDDI) • High-Performance Parallel Interface (HIPPI) • Asynchronous Transfer Mode (ATM) • Scalable Coherent Interface (SCI)

10Gbps SCI HiPPI 1000Mbps 1000 Base T ATM FDDI 100Mbps 100 Base T 10Mbps 10 Base T SAN LAN MAN WAN A representation of network technologies

Client Client Server Server Threads Interconnection Network Client/Server Systems

Client Server Sockets • Sockets are used to provide the capability of making connections from one application running on one machine to another running on a different machine. • Once a socket is created, it can be used to wait for an incoming connection (passive socket) or can be used to initiate connection (active socket). A Socket Connection

Server 1 Server 2 Server 3 Server n Slaves (Workers) Interconnection Network Client Master (Supervisor) A Client Server Framework for Parallel Applications

Computer Clusters • Advances in commodity processors and network technology • Network of PCs and workstations connected via LAN or WAN forms a Parallel System • Compete favorably (cost/performance)

Programming Environment Middleware OS OS OS M M M I/O I/O I/O C C C P P P Interconnection Network Cluster Architecture Home cluster

Internet Grids Geographically distributed platforms. Dependable, consistent, pervasive, and inexpensive access to high end computing.

Interconnection Networks Ethernet • A packet-switched LAN technology. • All hosts connected to an Ethernet receive every transmission, making it possible to broadcast a packet to all hosts at the same time. • Ethernet uses a distributed access control scheme called Carrier Sense Multiple Access with Collision Detect (CSMA/CD). • Each computer connected to an Ethernet network is assigned a unique 48-bit address known as its Ethernet address, also called the media access control address, (MAC).

Switches • A n1 x n2 switch consists of: • n1 input ports • n2 output ports • Links connecting each input to every output • Control logic to select a specific connection • Internal buffers • The connections between input ports and output ports may be: • One-to-one (point-to-point) • One-to-many (multicast or broadcast) • Many-to-one: may cause conflicts at the output ports and needs arbitration.

When only one-to-one connections are allowed, the switch is called crossbar. • An n x n crossbar switch can establish n! connections. • If we allow both one-to-one as well as one-to-many in an n x n switch, the number of connections that can be established is nn. (We discussed this before, remember?)

Port 4 Port 0 5 0 6 Port 5 Port 1 Port 6 Port 2 Port 7 Port 3 id 6 Routing can be achieved using 2 mechanisms: • Source-path: the entire path to the destination is stored in the packet header at the source location. • Table-based: the switch must have a complete routing table that determines the corresponding port for each destination. Dest-id Port 4 Port 0 Port 5 Port 1 Port 6 Port 2 Port 7 Port 3 Routing table Source-path Routing versus Table-based Routing

Myrinet Clos network • Myrinet is a high-performance, packet communication and switching technology. • Myrinet switches are multiple-port components that route a packet entering on an input channel of a port to the output channel of the port selected by the packet.

Network Spine Clos “Spreader” Network Connects Spine (upper 8 switches) to Leaves (16 lower switches) 128 Hosts Myrinet Clos network 128-host Clos Network using 16-port Myrinet Switch

Network Spine 2 links each 64 Hosts Myrinet Clos network 64-host Clos Network using 16-port Myrinet Switch (Each line represents 2 links)

Network Spine 4 links each 32 hosts Myrinet Clos network 32-host Clos Network using 16-port Myrinet Switch (Each line represents 4 links)

The Quadrics network (QsNet) Consists of 2 hardware building blocks • A programmable network interface called Elan: • connects the Quadrics network to a processing node containing one or more CPUs • Elan provides substantial local processing power to implement high-level message passing protocols (ex: MPI). • High-bandwidth, low-latency communication switch called Elite: • QsNet connects Elite switches in a quaternary fat-tree topology.

The Quadrics network (QsNet) Processing Nodes

Advanced Computer Architecture CSE 8383