520 likes | 737 Views
TCP Servers: Offloading TCP/IP Processing in Internet Servers. Liviu Iftode Department of Computer Science University of Maryland and Rutgers University. My Research: Network-Centric Systems. TCP Servers and Split-OS [ NSF CAREER ] Migratory TCP and Service Continuations
E N D
TCP Servers: Offloading TCP/IP Processing in Internet Servers Liviu Iftode Department of Computer Science University of Maryland and Rutgers University
My Research: Network-Centric Systems • TCP Servers and Split-OS [NSF CAREER] • Migratory TCP and Service Continuations • Federated File Systems • Smart Messages [NSF ITR-2] and Spatial Programming for Networks of Embedded Systems • http://discolab.rutgers.edu
Networking and Performance • The transport-layer protocol must be efficient C C C IP Network TCP WAN S S Internet Servers Storage Networks SAN IP or not IP ? TCP or not TCP? D D D
The Scalability Problem Apache web server on 1 Way and 2 Way 300 MHz Intel Pentium II SMP repeatedly accessing a static16 KB file
The TCP/IP Stack APPLICATION SYSTEM CALLS SEND copy_from_application_buffers TCP_send IP_send packet_scheduler setup_DMA RECEIVE copy_to_application_buffers TCP_receive IP_receive software_interrupt_handler hardware_interrupt_handler packet_in KERNEL packet_out
Serialized Networking Actions APPLICATION SYSTEM CALLS SEND copy_from_application_buffers TCP_send IP_send packet_scheduler setup_DMA packet_out RECEIVE copy_to_application_buffers TCP_receive IP_receive software_interrupt_handler hardware_interrupt_handler packet_in Serialized Operations
TCP/IP Processing is Very Expensive • Protocol processing can take up to 70% of the CPU cycles • For Apache web server on uniprocessors [Hu 97] • Can lead to Receive Livelock [Mogul 95] • Interrupt handling consumes a significant amount of time • Soft Timers [Aron 99] • Serialization affects scalability
Outline • Motivation • TCP Offloading using TCP Server • TCP Server for SMP Servers • TCP Server for Cluster-based Servers • Prototype Evaluation
TCP Offloading Approach • Offload network processing from application hosts to dedicated processors/nodes/I-NICs • Reduce OS intrusion • network interrupt handling • context switches • serializations in the networking stack • cache and TLB pollution • Should adapt to changing load conditions • Software or hardware solution?
The TCP Server Idea Host Processor TCP Server TCP/IP Application OS CLIENT FAST COMMUNICATION SERVER
TCP Server Performance Factors • Efficiency of the TCP server implementation • event-based server, no interrupts • Efficiency of communication between host(s) and TCP server • non-intrusive, low-overhead • API • asynchronous, zero-copy • Adaptiveness to load
TCP Servers for Multiprocessor Systems CPU N CPU 0 TCP Server Application Host OS CLIENT SHARED MEMORY Multiprocessor (SMP) Server
TCP Servers for Clusters with Memory-to-Memory Interconnects TCP Server Host Application CLIENT MEMORY-to-MEMORY INTERCONNECT Cluster-based Server
SMP-based Implementation TCP Server Application Host OS IO APIC Disk & Other Interrupts Network and Clock Interrupts Interrupts
ENQUEUE SEND REQUEST SMP-based Implementation (cont’d) TCP Server Application Host OS DEQUEUE AND EXECUTE SEND REQUEST SHARED QUEUE
TCP Server Event-Driven Architecture Dispatcher Monitor Send Handler Receive Handler Asynchronous Event Handler Shared Queue NIC From Application Processors To Application Processors
Dispatcher • Kernel thread executing at the highest priority level in the kernel • Schedules different handlers based using input from the monitor • Executes an infinite loop and does not yield the processor • No other activity can execute on the TCP Server processor
Asynchronous Event Handler (AEH) • Handles asynchronous network events • Interacts with the NIC • Can be an Interrupt Service Routine or a Polling Routine • Is a short running thread • Has the highest priority among TCP server modules • The clock interrupt is used as a guaranteed trigger for the AEH when polling
Send and Receive Handlers • Scheduled in response to a request in the Shared Memory queues • Run at the priority of the network protocol • Interact with the Host processors
Monitor • Observes the state of the system queues and provides hints to the Dispatcher to schedule • Used for book-keeping and dynamic load balancing • Scheduled periodically or when an exception occurs • Queue overflow or empty • Bad checksum for a network packet • Retransmissions on a connection • Can be used to reconfigure the set of TCP servers in response to load variation
TUNNEL SOCKET REQUEST Cluster-based Implementation TCP Server Host Application Socket Stub DEQUEUE AND EXECUTE SOCKET REQUEST VI Channels
SAN TCP Server Architecture Eager Processor Resource Manager TCP/IP Provider Socket Call Processor Request Handler VI Connection Handler NIC - WAN (To Host)
Sockets and VI Channels • Pool of VI’s created at initialization • Avoid cost of creating VI’s in the critical path • Registered memory regions associated with each VI • Send and receive buffers associated with socket • Also used to exchange control data • Socket mapped to a VI on the first socket operation • All subsequent operations on the socket tunneled through the same VI to the TCP server
Socket Call Processing • Host library intercepts socket call • Socket call parameters are tunneled to the TCP server over a VI channel • TCP server performs socket operation and returns results to the host • Library returns control to the application immediately or when the socket call completes (asynchronous vs synchronous processing).
Design Issues for TCP Servers • Splitting of the TCP/IP processing • Where to split? • Asynchronous event handling • Interrupt or polling? • Asynchronous API • Event scheduling and resource allocation • Adaptation to different workloads
SMP-based Prototype • Modified Linux – 2.4.9 SMP kernel on Intel x86 platform to implement TCP server • Most parts of the system are kernel modules, with small inline changes to the TCP stack, software interrupt handlers and the task structures • Instrumented the kernel using on-chip performance monitoring counters to profile the system
Evaluation Testbed • Server • 4-Way 550MHz Intel Pentium II Xeon system with 1GB DRAM and 1MB on chip L2 cache • Clients • 4-way SMPs • 2-Way 300 MHz Intel Pentium II system with 512 MB RAM and 256KB on chip L2 cache • NIC : 3-Com 996-BT Gigabit Ethernet • Server Application: Apache 1.3.20 web server • Client program: sclients [Banga 97] • Trace driven execution of clients
C3 C2 Splitting TCP/IP Processing APPLICATION APPLICATION PROCESSORS SYSTEM CALLS SEND copy_from_application_buffers TCP_send IP_send packet_scheduler setup_DMA packet_out RECEIVE copy_to_application_buffers TCP_receive IP_receive software_interrupt_handler interrupt_handler packet_in DEDICATED PROCESSORS C1
Adapting TCP Servers to Changing Workloads • Monitor the queues • Identify low and high water marks to change the size of the processor set • Execute a special handler for exceptional events • Queue length lower than the low water mark • Set a flag which dispatcher checks • Dispatcher sleeps if the flag is set • Reroute the interrupts • Queue length higher than the high water mark • Wake up the dispatcher on the chosen processor • Reroute the interrupts
Cluster-based Prototype • User-space implementation (bypass host kernel) • Entire socket operation offloaded to TCP Server • C1, C2 and C3 offloaded by default • Optimizations • Asynchronous processing: AsyncSend • Processing ahead: Eager Receive, Eager Accept • Avoiding data copy at host using pre-registered buffers • requires different API: MemNet
Evaluation Testbed • Server • Host and TCP Server: 2-Way 300 MHz Intel Pentium II system with 512 MB RAM and 256KB on chip L2 cache • Clients • 4-Way 550MHz Intel Pentium II Xeon system with 1GB DRAM and 1MB on chip L2 cache • NIC: 3-Com 996-BT Gigabit Ethernet • Server application: Custom web server • Flexibility in modifying application to use our API • Client program: httperf
Related Work • TCP Offloading Engines • Communication Services Platform (CSP) • System architecture for scalable cluster-based servers, using a VIA-based SAN to tunnel TCP/IP packets inside the cluster • Piglet - A vertical OS for multiprocessors • Queue Pair IP - A new end point mechanism for inter-network processing inspired from memory-to-memory communication
Conclusions • Offloading networking functionality to a set of dedicated TCP servers yields up to 30% performance improvement • Performance Essentials: • TCP Server architecture • event driven • polling instead of interrupts • adaptive to load • API • asynchronous, zero-copy