Dong Lu Dept. of Computer Science Northwestern University

Communication Networks of Parallel & Distributed Systems: Low Latency & High Bandwidth comes to clusters & grids Dong Lu Dept. of Computer Science Northwestern University http://www.cs.northwestern.edu/~donglu

Introduction • Communication networks play a vital role in parallel & distributed systems. • Modern communication networks support low latency & high bandwidth communication services. http://www.cs.northwestern.edu/~donglu

How is low latency & high bandwidth achieved? • DMA based zero copy and OS bypassing which can provide applications with direct access to Network Interface Card. • Communication protocol processing is offloaded by using a helper processor on NIC or channel adapter. Often, TCP/IP is not used. • Switched networks or hypercube with high speed routers that support cut-through routing. http://www.cs.northwestern.edu/~donglu

TREND • Those technologies are migrating from inside parallel systems to clusters. • Example: InfiniBand Architecture. • Lower latency & higher bandwidth communication networks are becoming available to Grid computing. • Example: High speed optical networking is becoming dominate in Internet. http://www.cs.northwestern.edu/~donglu

Outline of this talk • Communication networks in parallel systems. • Communication networks in Clusters and System Area Networks. • New and improved Communication network protocols & technologies in Grid computing. • Trend: Low latency & High bandwidth comes to clusters & the Grid. http://www.cs.northwestern.edu/~donglu

Communication networks in parallel systems • IBM SP2 • SGI Origin 2000 http://www.cs.northwestern.edu/~donglu

IBM SP2 • Any-to-any packet switched, multistage network. Excellent scalability. • Micro Channel adapter has an onboard microprocessor that offloads some of the protocol processing load. • The adapter can move messages to and from processor memory directly via direct memory access (DMA), thus supports zero-copy message passing. http://www.cs.northwestern.edu/~donglu

64 nodes IBM SP2 network topology http://www.cs.northwestern.edu/~donglu

SGI Origin 2000 • Origin is a distributed shared memory, cc-NUMA multiprocessor. cc-NUMA stands for cache-coherent non-uniform memory access. • Hypercube network connected by SPIDER routers, which support wormhole “cut-through” routing, that is, start forwarding without getting the whole packet, contrary to what “store and forward” does. • Low latency remote memory access is supported and the ratio of remote memory to local memory latency is very low. http://www.cs.northwestern.edu/~donglu

128 processors SGI Origin 2000 network http://www.cs.northwestern.edu/~donglu

Communication Networks in Clusters • Gigabit Ethernet • Myrinet • Virtual Interface Architecture • InfiniBand Architecture http://www.cs.northwestern.edu/~donglu

Gigabit Ethernet • Can be switch based --- Higher bandwidth and smaller collision domain. • Jumbo frame is supported, up to 9K. • Some NIC are Programmable, which have a onboard processor. • Zero-copy OS-bypass message passing can be supported with programmable NIC and DMA, which make it a low-cost, low latency and high bandwidth architecture! http://www.cs.northwestern.edu/~donglu

High performance GigaE Architecture EMP system, appeared in HPDC2001 http://www.cs.northwestern.edu/~donglu

Myrinet • Developed based on the technology of a parallel system --- Intel Paragon. • The first commercial LAN technology able to provide zero-copy message passing and can offload protocol processing to the interface processor. • Switch based, Cut-through-routing is supported. http://www.cs.northwestern.edu/~donglu

Myrinet • LANai is the host interface that has a processor and DMA engine onboard. • High bandwidth & low latency, but very expensive and not very stable. http://www.cs.northwestern.edu/~donglu

Myrinet http://www.cs.northwestern.edu/~donglu

Virtual Interface Architecture • Support zero-copy and OS-bypassing to provide low latency and high bandwidth communication service. Message send/receive operations and Remote DMA are supported. • To a user process, VIA provides direct access to the network interface in a fully protected fashion. http://www.cs.northwestern.edu/~donglu

Remote DMA http://www.cs.northwestern.edu/~donglu

Virtual Interface Architecture • Each process owns a VI and each VI consists of one send queue and a receive queue. • The memory regions are registered before data transfer by Open/Connect operations. After the Open/Connection and memory registration, user data can be transferred without the operating system. • Memory protection is provided by protection tag mechanism. Protection tags are associated with VIs and memory regions. http://www.cs.northwestern.edu/~donglu

Virtual Interface Architecture http://www.cs.northwestern.edu/~donglu

Infiniband Architecture http://www.cs.northwestern.edu/~donglu

Infiniband Architecture • Encompasses a system-area network for connecting multiple independent processor and I/O platforms. • Defines the communication and management infrastructure supporting both I/O and inter-processor communications. http://www.cs.northwestern.edu/~donglu

Infiniband Architecture • Components: A host channel adapter (HCA), A target channel adapter (TCA) and fabric switch. • Channel adapter offload the protocol processing load from CPU. • DMA/RDMA is supported. • Zero copy-data transfers without kernel involvement and uses hardware to provide highly reliable, fault-tolerant communication http://www.cs.northwestern.edu/~donglu

Communication Networks in the GRID • IPv6 • High performance TCP Reno • TCP tuning for distributed applications on the WAN • TCP Vegas vs. TCP Reno • Random Early Detection gateways • Aggressive TCP Reno: What I have done on Linux kernel http://www.cs.northwestern.edu/~donglu

IPv6 • Expanded Addressing Capabilities. 128 bits vs. 32 bits in IPv4. • Flow Labeling Capability. Good news to real time applications and high performance applications. • Header Format Simplification. • Improved Support for Extensions and Options. • Authentication and Privacy Capabilities. http://www.cs.northwestern.edu/~donglu

High performance TCP Reno (RFC1323) • TCP extension for high performance. • TCP performance depends not upon the transfer rate itself, but rather upon “bandwidth*delay product“, which is growing quickly, much bigger than 65K. • The TCP header uses a 16 bit field to report the receive window size to the sender. Therefore, the largest window that can be used is 2**16 = 65K bytes. http://www.cs.northwestern.edu/~donglu

High performance TCP Reno (RFC1323) • A TCP option, "Window Scale" is adopted to allow windows larger than 2**16 bytes. • However, high transfer rate alone can threaten TCP reliability by violating the assumptions behind the TCP mechanism for duplicate detection and sequencing. That is, any sequence number may eventually be reused, error may result from an accidental reuse of TCP sequence numbers in data segments. http://www.cs.northwestern.edu/~donglu

High performance TCP Reno (RFC1323) • PAWS (Protect Against Wrapped Sequence numbers) mechanism is proposed to avoid this potential problem. http://www.cs.northwestern.edu/~donglu

TCP tuning for distributed applications • The congestion window size is used by TCP to control how many packets should be sent into the network, and the send &receive buffer size as well as the network congestion status decide the congestion window size. • Many operating systems use a default TCP buffer size of either 24 or 32 KB (Linux is only 8 KB). http://www.cs.northwestern.edu/~donglu

TCP tuning for distributed applications • Suppose the slowest hop from site A to site B is 100 Mbps (about 12 MB/sec), typical latency across the US is about 25 ms. 12*25=300K. If the default 24K is used as TCP buffer, then 24/300 = 8%. So, only a small portion of bandwidth is used! • Buffer size = 2 * bandwidth * delay or Buffer size = bandwidth * RTT http://www.cs.northwestern.edu/~donglu

TCP Vegas vs. TCP Reno • Researchers have shown us that aggregate network traffic can be characterized as self-similar or fractal, which usually is a bad property for the performance of Internet. Several researchers claim that the primary source of self-similarity is from TCP Reno via an "additive increase, multiplicative decrease" (AIMD) congestion-control mechanism. • Instead of reacting to congestion as TCP Reno does, TCP Vegas tries to avoid congestion. http://www.cs.northwestern.edu/~donglu

TCP Vegas • Vegas has two threshold values, A and B, default values are A=1, B=3. ESR is the expected sending rate and ASR is the actual sending rate. • Let diff = ESR – ASR • If diff < A, increase the congestion window linearly during the next round trip time. • If diff > B, decrease the window linearly during the next RTT. • Otherwise, don’t change the congestion window size. http://www.cs.northwestern.edu/~donglu

TCP Vegas vs. TCP Reno • Some researchers show that with proper values for A and B, Vegas behave better than Reno in the Grid computing environment. • The problem with Vegas is that Vegas is not verified on a large-scale network and the optimal values of A and B are not easy to decide. http://www.cs.northwestern.edu/~donglu

Random Early Detection gateways • RED gateways maintain a weighted average of the queue length, a minimum and maximum threshold (REDmin, REDmax), and an early drop rate P. Packets are then queued as follows: • If (queue length < REDmin), queue all packets. • If (queue length > REDmin, and queue length < REDmax), drop packets with probability P. • If (queue length > REDmax), drop all packets. • RED can increase the fairness and overall network performance, so it is widely applied in the routers in the world. Since GRID is built on the basis of Internet, the effect of RED routers should be considered. http://www.cs.northwestern.edu/~donglu

Aggressive TCP Reno: What I have done • Linux kernel modification on TCP congestion control. • Some studies have shown us that TCP Reno congestion control is too conservative thus the bandwidth is not fully utilized. • So, make it more aggressive. How? http://www.cs.northwestern.edu/~donglu

Aggressive TCP Reno: What I have done • Window size start from more than one packet (for example 20), and increase more quickly during “slow start”. • Do the same “Congestion avoidance”. • Whenever there is a packet loss, don’t drop to one packet, instead, drop to 80% of window size, and new threshold will be 90% of current window size. http://www.cs.northwestern.edu/~donglu

Aggressive TCP Reno: What I have done Built into Linux kernel Aggress TCP TCP Reno http://www.cs.northwestern.edu/~donglu

Some Performance gains for Virtualized Audio http://www.cs.northwestern.edu/~donglu

Aggressive TCP Reno • Through the kernel modification, some performance gains are achieved without modifying the application code. • But the results are still not very satisfactory and not very stable. Why? http://www.cs.northwestern.edu/~donglu

Aggressive TCP Reno • That can be due to three reasons. • First, virtual audio is very computational intensive, so enhancing the communication performance even more drastically will not change the overall performance much (most time was spent on computing). • Second, the bandwidth*delay product on the cluster is small, which implies that this technique may be more effective on the WAN (with much bigger RTT). • Third, the effect of fast retransmit and fast recovery is not considered here but it turns out to be important. http://www.cs.northwestern.edu/~donglu

Conclusion • Some technologies used by parallel systems are going into cluster, making low latency and high bandwidth available. • With the development of Internet, new & improved protocols are proposed and tested to provide lower latency & higher bandwidth to Grid computing. • New proposal on aggressive TCP is implemented and tested and some performance gains are achieved. More work is needed to make it more effective and stable. http://www.cs.northwestern.edu/~donglu

Questions? http://www.cs.northwestern.edu/~donglu

Dong Lu Dept. of Computer Science Northwestern University