300 likes | 546 Views
Internetworking: Hardware/Software Interface. CS 213, LECTURE 16 L.N. Bhuyan. Protocols: HW/SW Interface. Internetworking : allows computers on independent and incompatible networks to communicate reliably and efficiently;
E N D
Internetworking: Hardware/Software Interface CS 213, LECTURE 16 L.N. Bhuyan
Protocols: HW/SW Interface • Internetworking: allows computers on independent and incompatible networks to communicate reliably and efficiently; • Enabling technologies: SW standards that allow reliable communications without reliable networks • Hierarchy of SW layers, giving each layer responsibility for portion of overall communications task, calledprotocol families or protocol suites • Transmission Control Protocol/Internet Protocol (TCP/IP) • This protocol family is the basis of the Internet • IP makes best effort to deliver; TCP guarantees delivery • TCP/IP used even when communicating locally: NFS uses IP even though communicating across homogeneous LAN CS258 S99
TCP/IP packet • Application sends message • TCP breaks into 64KB segements, adds 20B header • IP adds 20B header, sends to network • If Ethernet, broken into 1500B packets with headers, trailers • Header, trailers have length field, destination, window number, version, ... Ethernet IP Header TCP Header IP Data TCP data (≤ 64KB) CS258 S99
Communicating with the Server: The O/S Wall CPU PCI Bus NIC NIC User Kernel • Problems: • O/S overhead to move a packet between network and application level =>Protocol Stack (TCP/IP) • O/S interrupt • Data copying from kernel space to user space and vice versa • Oh, the PCI Bottleneck!
The Send/Receive Operation • The application writes the transmit data to the TCP/IP sockets interface for transmission in payload sizes ranging from 4 KB to 64 KB. • The data is copied from the User space to the Kernel space • The OS segments the data into maximum transmission unit (MTU)–size packets, and then adds TCP/IP header information to each packet. • The OS copies the data onto the network interface card (NIC) send queue. • The NIC performs the direct memory access (DMA) transfer of each data packet from the TCP buffer space to the NIC, and interrupts CPU activities to indicate completion of the transfer. CS258 S99
Transmitting data across the memory bus using a standard NIC http://www.dell.com/downloads/global/power/1q04-her.pdf CS258 S99
Timing Measurement in UDP Communication X.Zhang, L. Bhuyan and W. Feng, ““Anatomy of UDP and M-VIA for Cluster Communication” JPDC, October 2005 CS258 S99
I/O Acceleration Techniques • TCP Offload:Offload TCP/IP Checksum and Segmentation to Interface hardware or programmable device (Ex. TOEs) – A TOE-enabled NIC using Remote Direct Memory Access (RDMA) can use zero-copy algorithms to place data directly into application buffers. • O/S Bypass:User-level software techniques to bypass protocol stack – Zero Copy Protocol (Needs programmable device in the NIC for direct user level memory access – Virtual to Physical Memory Mapping. Ex. VIA) • Architectural Techniques:Instruction set optimization, Multithreading, copy engines, onloading, prefetching, etc. CS258 S99
Comparing standard TCP/IP and TOE enabled TCP/IP stacks (http://www.dell.com/downloads/global/power/1q04-her.pdf) CS258 S99
Chelsio 10 Gbs TOE CS258 S99
Cluster (Network) of Workstations/PCs CS258 S99
Myrinet Interface Card CS258 S99
InfiniBand Interconnection • Zero-copy mechanism. The zero-copy mechanism enables a user-level application to perform I/O on the InfiniBand fabric without being required to copy data between user space and kernel space. • RDMA. RDMA facilitates transferring data from remote memory to local memory without the involvement of host CPUs. • Reliable transport services. The InfiniBand architecture implements reliable transport services so the host CPU is not involved in protocol-processing tasks like segmentation, reassembly, NACK/ACK, etc. • Virtual lanes. InfiniBand architecture provides 16 virtual lanes (VLs) to multiplex independent data lanes into the same physical lane, including a dedicated VL for management operations. • High link speeds. InfiniBand architecture defines three link speeds, which are characterized as 1X, 4X, and 12X, yielding data rates of 2.5 Gbps, 10 Gbps, and 30 Gbps, respectively. Reprinted from Dell Power Solutions, October 2004. BY ONUR CELEBIOGLU, RAMESH RAJAGOPALAN, AND RIZWAN ALI CS258 S99
InfiniBand system fabric CS258 S99
UDP Communication – Life of a Packet X. Zhang, L. Bhuyan and W. Feng, “Anatomy of UDP and M-VIA for Cluster Communication” Journal of Parallel and Distributed Computing (JPDC), Special issue on Design and Performance of Networks for Super-, Cluster-, and Grid-Computing, Vol. 65, Issue 10, October 2005, pp. 1290-1298. CS258 S99
Timing Measurement in UDP Communication X.Zhang, L. Bhuyan and W. Feng, ““Anatomy of UDP and M-VIA for Cluster Communication” JPDC, October 2005 CS258 S99
100 40 10 Network Bandwidth is Increasing TCP requirements Rule of thumb: 1GHz for 1Gbps 1000 100 Network bandwidth outpaces Moore’s Law 10 GHz and Gbps The gap between the rate of processing network applications and the fast growing network bandwidth is increasing 1 0.1 Moore’s Law .01 1990 Time 1995 2000 2003 2005 2006/7 2010 CS258 S99
Profile of a Packet System Overheads Descriptor & Header Accesses IP Processing Computes TCB Accesses TCP Processing Memory Memory Copy Total Avg Clocks / Packet: ~ 21K Effective Bandwidth: 0.6 Gb/s (1KB Receive) CS258 S99
Five Emerging Technologies • Optimized Network Protocol Stack (ISSS+CODES, 2003) • Cache Optimization (ISSS+CODES, 2003, ANCHOR, 2004) • Network Stack Affinity Scheduling • Direct Cache Access • Lightweight Threading • Memory Copy Engine (ICCD 2005 and IEEE TC) CS258 S99
Stack Optimizations (Instruction Count) • Separate Data & Control Paths • TCP data-path focused • Reduce # of conditionals • NIC assist logic (L3/L4 stateless logic) • Basic Memory Optimizations • Cache-line aware data structures • SW Prefetches • Optimized Computation • Standard compiler capability 3X reduction in Instructions per Packet CS258 S99
Core Core Core Core Core Core Core Core CPU CPU Chipset Memory Memory Memory Memory Network Stack Affinity • Assigns network I/O workloads to designated devices • Separates network I/O from application work • Reduces scheduling overheads • More efficient cache utilization • Increases pipeline efficiency … I/O Interface CPU Dedicated for network I/O Intel calls it Onloading CS258 S99
Direct Cache Access CPU Step 3 CPU Read Cache Step 2 Cache Update Memory Controller Memory Step 1 DMA Write NIC Direct Cache Access (DCA) Normal DMA Writes CPU Step 4 CPU Read Cache Step 2 Snoop Invalidate Memory Controller Memory Step 3 Memory Write Step 1 DMA Write NIC Eliminate 3 to 25 memory accesses by placing packet data directly into cache CS258 S99
Lightweight Threading Builds on helper threads; reduces CPU stall Memory informing event (e.g. cache miss) Thread Manager S/W controlled thread 1 Execution pipeline S/W controlled thread 2 Single Core Pipeline Single Hardware Context Continue computing in single pipeline in shadow of cache miss CS258 S99
Potential Efficiencies (10X) Benefits of Affinity Benefits of Architectural Technques Greg Regnier, et al., “TCP Onloading for DataCenter Servers,” IEEE Computer, vol 37, Nov 2004 On CPU, multi-gigabit, line speed network I/O is possible CS258 S99
I/O Acceleration – Problem Magnitude Security Services Storage over IP Networking Parsing, Tree Construction Crypto Memory Copies & Effects of Streaming CRCs I/O Processing Rates are significantly limited by CPU in the face of Data Movement and Transformation Operations CS258 S99