840 likes | 985 Views
Asynchronous Interconnection Network and Communication. Chapter 3 of Casanova, et. al. Interconnection Network Topologies. The processors in a distributed memory parallel system are connected using an interconnection network.
E N D
Asynchronous Interconnection Network and Communication Chapter 3 of Casanova, et. al.
Interconnection NetworkTopologies • The processors in a distributed memory parallel system are connected using an interconnection network. • All computers have specialized coprocessors that route messages and place date in local memories • Nodes consist of a (computing) processor, a memory, and a communications coprocessor • Nodes are often called processors, when not ambigious.
Network Topology Types • Static Topologies • A fixed network that cannot be changed • Nodes connected directly to each other by point-to-point communications links • Dynamic Topologies • Topology can change at runtime • One or more nodes can request direct communication be established between them. • Done using switches
Some Static Topologies • Fully connected network (or clique) • Ring • Two-Dimensional grid • Torus • Hypercube • Fat tree
Static Topologies Features • Fixed number of nodes • Degree: • Nr of nodes incident to edges • Distance between nodes: • Length of shortest path between two nodes • Diameter: • Largest distance between two nodes • Number of links: • Total number of Edges • Bisection Width: • Minimum nr. of edges that must be removed to partition the network into two disconnected networks of the same size.
Classical Interconnection Networks Features • Clique (or Fully Connected) • All processors are connected • p(p-1)/2 edges • Ring • Very simple and very useful topology • 2D Grid • Degree of interior processors is 4 • Not symmetric, as edge processors have different properties • Very useful when computations are local and communications are between neighbors • Has been heavily used previously
Classical Network • 2D Torus • Easily formed from 2D mesh by connecting matching end points. • Hypercube • Has been extensively used • Using recursive defn, can design simple but very efficient algorithms • Has small diameter that is logarithmic in nr of edges • Degree and total number of edges grows too quickly to be useful with massively parallel machines.
Dynamic Topologies • Fat tree is different than other networks included • The compute nodes are only at the leaves. • Nodes at higher level do not perform computation • Topology is a binary tree – both in 2D front view and in side view. • Provides extra bandwidth near root. • Used by Thinking Machine Corp. on the CM-5 • Crossbar Switch • Has p2 switches, which is very expensive for large p • Can connect n processors to combination of n processors • .Cost rises with the number of switches, which is quadratic with number of processors.
Dynamic Topologies (cont) • Benes Network & Omega Networks • Use smaller size crossbars arranged in stages • Only crossbars in adjacent stages are connected together. • Called multi-stage networks and are cheaper to build that full crossbar. • Configuring multi-stage networks is more difficult than crossbar. • Dynamic networks are now the most common used topologies.
A Simple Communications Performance Model • Assume a processor Pi sends a message to Pj or length m. • Cost to transfer message along a network link is roughly linear in message length. • Results in cost to transfer message along a particular route to be roughly linear in m. • Let ci,j(m) denote the time to transfer this message.
Hockney Performance Model for Communications • The time ci,j(m) to transfer this message can be modeled by ci,j(m) = Li,j + m/Bi,j = Li,j + mbi,j • m is the size of the message • Li,j is the startup time, also called latency • Bi,j is the bandwidth, in bytes per second • bi,j is 1/Bi,j, the inverse of the bandwidth • Proposed by Hockney in 1994 to evaluate the performance of the Intel Paragon. • Probably the most commonly used model.
Hockney Performance Model (cont.) • Factors that Li,j and Bi,j depend on • Length of route • Communication protocol used • Communications software overhead • Ability to use links in parallel • Whether links are half or full duplex • Etc.
Store and Forward Protocol • SF is a point-to-point protocol • Each intermediate node receives and stores the entire message before retransmitting it • Implemented in earliest parallel machines in which nodes did not have communications coprocessors. • Intermediate nodes are interrupted to handle messages and route them towards their destination.
Store and Forward Protocol (cont) • If d(i,j) is the number of links between Pi & Pj, the formula for ci,j(m) can be re-written as ci,j(m) = d(i,j) {L+ m/B} = d(i,j)L+ d(i,j)mb where • L is the initial latency & b is the reciprocal for the broadcast bandwidth for one link. • This protocol produces a poor latency & bandwith • The communication cost can be reduced using pipelining.
Store and Forward Protocol using Pipelining • The message is split into r packets of size m/r. • The packets are sent one after another from Pi to Pj. • The first packet reaches node j after ci,j(m/r) time units. • The remaining r-1 packets arrive in (r-1) (L+ mb/r) time units • Simplifying, total communication time reduces to [d(i,j) -1+r][L+ mb/r] • Casanova, et.al. finds optimal value for r above.
Two Cut-Through Protocols • Common performance model: ci,j(m) = L + d(i,j)* + m/B where • L is the one-time cost of creating a message. • is the routing management overhead • Generally << L as routing management is performed by hardware while L involve software overhead • m/B is the time required to transmit the message through entire route
Circuit-Switching Protocol • First cut-through protocol • Route created before first message is sent • Message sent directly to destination through this route • The nodes used in this transmission can not be used during this transmission for any other communication
Wormhole (WH) Protocol • A second cut-through protocol • The destination address is stored in the header of the message. • Routing is performed dynamically at each node. • Message is split into small packets called flits • If two flits arrive at the same time, flits are stored in intermediate nodes’ internal registers
Point-to-Point Communication Comparisons • Store and Forward is not used in physical networks but only at applications level • Cut-through protocols are more efficient • Hide distance between nodes • Avoid large buffer requirement for intermediate nodes • Almost no message loss • For small networks, flow-control mechanism not needed • Wormhole generally preferred to circuit switching • Latency is normally much lower
LogP Model • Models based on the LogP model are more precise than the Hockney model • Involves three components of communication – the sender, the network, and the receiver • At times, some of these components may be busy while others are not. • Some parameters for LogP • m is the message size (in bytes) • w is the size of packets message is split into • L is an upper bound on the latency • o is the overhead, • Defined to be the time that the a node is engaged in the transmission or reception of a packet
LogP Model (cont) • Parameters for LogP (cont) • g or gap is the minimal time interval between consecutive packet transmission or packet reception • During this time, a node may not use the communication coprocessor (i.e., network card) • 1/g the communication bandwidth available per node • P the number of nodes in the platform • Cost of sending m bytes with packet size w • Processor occupational time on sender/receiver
Other LogP Related Models • LogP attempts to capture in a few parameters the characteristics of parallel platforms. • Platforms are fine-tuned and may use different protocols for short & long messages • LogGP is an extension of LogP where G captures the bandwidth for long messages • pLogP is an extension of LogP where L, o, and g depend on the message size m. • Also seperates sender overhead os and receiver overhead or.
Affine Models • The use of the floor functions in LogP models causes them to be nonlinear. • Causes many problems in analytic & theoretical studies. • Has resulted in proposal of many fully linear models • The time that Pi is busy sending a message is expressed as an affine function of the message size • An affine function of m has form f(m) = a*m + b where a and b are constants. If b=0, then f is linear function • Similarly, the time Pj is busy receiving the message is expressed as an affine function of the message size • We will postpone further coverage of the topic of affine models for the present
Modeling Concurrent Communications • Multi-port model • Assumes that communications are contention-free and do not interfere with each other. • A consequence is that a node may communicate with an unlimited number of nodes without any degradation in performance. • Would require a clique interconnection network to fully support. • May simplify proofs that certain problems are hard • If hard under ideal communications conditions, then hard in general. • Assumption not realistic - communication resources are always limited. • See Casanova text for additional information.
Concurrent Communications Models (2/5) • Bounded Multi-port model • Proposed by Hong and Prasanna • For applications that uses threads (e.g., on a multi-core technology), the network link can be shared by several incoming and outgoing communications. • The sum of bandwidths allocated by operating system to all communications can not exceed bandwidth of network card. • An unbounded nr of communications can take place if they share the total available bandwidth. • The bandwidth defines the bandwidth allotted to each communication • Bandwidth sharing by application is unusual, as is usually handled by operating system.
Concurrent Communications Models (3/5) • 1 port (unidirectional or half-duplex) model • Avoids unrealistically optimistic assumptions • Forbids concurrent communication at a node. • A node can either send data or receive it, but not simultaneously. • This model is very pessimistic, as real world platforms can achieve some concurrent computations. • Model is simple and is easy to design algorithms that follow this model.
Concurrent Communications Models (4/5) • 1 port (bidirectional or full-duplex) model • Currently, most network cards are full-duplex. • Allows a single emission and single reception simultaneously. • Introduced by Blat, et. al. • Current hardware does not easily enable multiple messages to be transmitted simultaneous. • Multiple sends and receives are claimed to be eventually serialized by a single hardware port to the next. • Saif & Parashar did experimental work that suggests asynchronous sends become serialized when message sizes exceed a few megabytes.
Concurrent Communications Models (5/5) • k-ports model • A node may have k>1 network cards • This model allows a node to be involved in a maximum of one emission and one reception on each network card. • This model is used in Chapters 4 & 5.
Bandwidth Sharing • The previous concurrent communication models only consider contention on nodes • Other parts of the network can also limit performance • It may be useful to determine constraints on each network link • This type of network model are useful for performance evaluation purposes, but are too complicated for algorithm design purposes. • Casanova text evalutes algorithms using 2 models: • Hockney model or even simplified versions (e.g. assuming no latency) • Multi-port (ignoring contention) or the 1 port model.
Case Study: Unidirectional Ring • We first consider the platform of p processors arranged in a unidirectional ring. • Processors are denoted Pk for k = 0, 1, … , p-1. • Each PE can find its logical index by calling My_Num().
Unidirectional Ring Basics • The processor can determine the number of PEs by calling NumProcs() • Both preceding commands are supported in MPI, a language implemented on most asychronous systems. • Each processor has its own memory • All processors execute the same program, which acts on data in their local memories • Single Program, Multiple Data or SPMD • Processors communicate by message passing • explicitly sending and receiving messages.
Unidirectional Ring Basics (cont – 2/5) • A processor sends a message using the function send(addr,m) • addr is the memory address (in the sender process) of first data item to be sent. • m is the message length (i.e., nr of items to be sent) • A processor receives a message using function receive(addr,m) • The addr is the local address in receiving processor where first data item is to be stored. • If processor Pi executes a receive, then its predecessor (P(i-1)mod p) must execute a send. • Since each processor has a unique predecessor and successor, they do not have to be specified
Unidirectional Ring Basics (cont – 3/5) • A restrictive assumption is to assume that both the send and receive is blocking. • Then the participating processes can not continue until the communication is complete. • The blocking assumption is typical of 1st generation platforms • A classical assumption is keep the receive blocking but to allow the send is non-blocking • The processor executing a send can continue while the data transfer takes place. • To implement, one function is used to initiate the send and another function is used to determine when communication is finished.
Unidirectional Ring Basics (cont – 4/5) • In algorithms, we simply indicate the blocking and non-blocking operations • More recent proposed assumption is that a single processor can send data, receive data, and compute simultaneously. • All three can occur only if no race condition exists. • Convenient to think of three logical threads of control running on a processor • One for computing • One for sending data • One for receiving data • We will usually use the less restrictive third assumption
Unidirectional Ring Basics (cont – 4/5) • Timings for Send/Receive • We use a simplified version of the Hockney model • The time to send or receive over one link is c(m) = L+ mb • m is the length of the message • L is the startup cost in seconds due to the physical latency and the software overhead • b is the inverse of the data transfer rate.
The Broadcast Operation • The broadcast operation allows an processor Pk to send the same message of length m to all other processors. • At the beginning of the broadcast operation, the message is stored at the address addr in the memory of the sending process, Pk. • At the end of the broadcast, the message will be stored at address addr in the memory of all processors. • All processors must call the following function Broadcast(k, addr, m)
Broadcast Algorithm Overview • The message will go around the ring from processor - from Pk to Pk+1 to Pk+2 to … to Pk-1. • We will assume the processor numbers are modulo p, where p is the number of processors. For example, if k=0 and p=8, then k-1 = p-1 = 7. • Note there is no parallelism in this algorithm, since the message advances around ring only one processor per round. • The predecessor of Pk (i.e, Pk-1) does not send the message to Pk.
Analysis of Broadcast Algorithm • For algorithm to be correct, the “receive” in Step 10 will execute before Step 11. • Running Time: • Since we have a sequence of p-1 communications, the time to broadcast a message of length m is (p-1)(L+mb) • MPI does not typically use ring topology for creating communication primitives • Instead use various tree topologies that are more efficient on modern parallel computer platforms. • However, these primitives are simpler on a ring. • Prepares readers to implement primitives, when more efficient than using MPI primitives.
Scatter Algorithm • Scatter operation allows Pkto send a different message of length m to each processor. • Initially, Pk holds a message of length m to be sent to Pq at location “addr [q]”. • To keep the array of addresses uniform, space for a message to Pk is also provided. • At the end of the algorithm, each processor stores its message from Pk at location msg. • The efficient way to implement this algorithm is to pipeline the messages. • Message to most distant processor (i.e,, Pk-1) is followed by message to processor Pk-2.
Discussion of Scatter Algorithm • In Steps 5-6, Pk successively send messages to the other p-1processors in the order of their distance from Pk. • In Step 7, Pk stores its message to itself. • The other processors concurrently move messages along as they arrive in steps 9-12. • Each processor uses two buffers with addresses tempS and tempR. • This allows processors to send a message and to receive the next message in parallel in Step12.
Discussion of Scatter Algorithm (cont) • In step 11, tempS tempR means two addresses are switched so received value can be sent to next processor. • When a processor receives its message from Pk, the processor stops forwarding (Step 10). • Whatever is in the receive buffer, tempR, at the end is stored as its message from Pk (Step 13). • The running time of the scatter algorithm is the same as for the broadcast, namely (p-1)(L+mb)
Example for Scatter Algorithm • Example: In Figure 3.7, let p=6 and k=4. • Steps 5-6: For i = 1 to p-1 do send(addr[(k+p-i) mod p], m) • Let PE = (k+p-i) mod p = (10 – i) mod 6 • For i=1, PE = 9 mod 6 = 3 • For i=2, PE = 8 mod 6 = 2 • For i=3, PE = 7 mod 6 = 1 • For i=4, PE = 6 mod 6 = 0 • For i=5, PE = 5 mod 6 = 5 • Note messages are sent to processors in the order 3, 2, 1, 0, 5 • That is, messages to most distant processors sent first.
Example for Scatter Algorithm (cont) • Example: In Figure 3.7, let p=6 and k=4. • Steps 10: For i = 1 to (k-1-q) mod p do • Compute: (k-1-q) mod p = (3-q) mod 6 for all q. • Note: q≠ k, which is 4 • q = 5 i = 1 to 4 since (3-5) mod 6 = 4 • PE 5 forwards values in loop from i = 1 to 4 • q = 0 i = 1 to 3 since (3-0) mod 6 = 3 • PE 0 forwards values from i = 1 to 3 • q = 1 i = 1 to 2 since (3-1) mod 6 = 2 • PE 1 forwards values from i = 1 to 2 • q = 2 i = 1 to 1 since (3-2) mod 6 = 1 • PE 2 is active in loop when i = 1