390 likes | 407 Views
Network Server Performance and Scalability. Scott Rixner Rice Computer Architecture Group http://www.cs.rice.edu/CS/Architecture/. June 9, 2005. Rice Computer Architecture. Faculty Scott Rixner Students Mike Calhoun Hyong-youb Kim Jeff Shafer Paul Willmann Research Focus
E N D
Network ServerPerformance and Scalability Scott Rixner Rice Computer Architecture Group http://www.cs.rice.edu/CS/Architecture/ June 9, 2005
Rice Computer Architecture • Faculty • Scott Rixner • Students • Mike Calhoun • Hyong-youb Kim • Jeff Shafer • Paul Willmann • Research Focus • System architecture • Embedded systems • http://www.cs.rice.edu/CS/Architecture/ Network Server Performance and Scalability
Network Servers Today Clients • Content types • Mostly text, small images • Low quality video (300-500 Kbps) Network Server 3 Mbps Internet 1 Gbps Network Server Performance and Scalability
Network Servers in the Future Clients • Content types • Diverse multimedia content • DVD quality video (10 Mbps) Network Server 100 Mbps Internet 100 Gbps Network Server Performance and Scalability
TCP Performance Issues • Network Interfaces • Limited flexibility • Serialized access • Computation • Only about 3000 instructions per packet • However, very low IPC, parallelization difficulties • Memory • Large connection data structures (about 1KB each) • Low locality, high DRAM latency Network Server Performance and Scalability
Selected Research • Network Interfaces • Programmable NIC design • Firmware parallelization • Network interface data caching • Operating Systems • Connection handoff to the network interface • Parallizing network stack processing • System Architecture • Memory controller design Network Server Performance and Scalability
Designing a 10 Gigabit NIC • Programmability for performance • Computation offloading improves performance • NICs have power, area concerns • Architecture solutions should be efficient • Above all, must support 10 Gbps links • What are the computation and memory requirements? • What architecture efficiently meets them? • What firmware organization should be used? Network Server Performance and Scalability
Aggregate Requirements10 Gbps – Maximum-sized Frames 1514-byte Frames at 10 Gbps 812,744 Frames/s Network Server Performance and Scalability
Meeting 10 Gbps Requirements • Processor Architecture • At least 435 MIPS within embedded device • Limited instruction-level parallelism • Abundant task-level parallelism • Memory Architecture • Control data needs low latency, small capacity • Frame data needs high bandwidth, large capacity • Must partition storage Network Server Performance and Scalability
Processor Architecture • 2x performance costly • Branch prediction, reorder buffer, renaming logic, wakeup logic • Overheads translate to greater than 2x core power, area costs • Great for a GP processor; not for an embedded device • Are there other opportunities for parallelism? • Many steps to process a frame – run them simultaneously • Many frames need processing – process simultaneously • Solution: use parallel single-issue cores Network Server Performance and Scalability
Control Data Caching SMPCache trace analysis of a 6-processor NIC architecture Network Server Performance and Scalability
A Programmable10Gbps NIC Instruction Memory I-Cache 0 I-Cache 1 I-Cache P-1 CPU 0 CPU 1 CPU P-1 (P+4)x(S) Crossbar (32-bit) Scratchpad 0 Scratchpad 1 S-pad S-1 PCI Interface Ethernet Interface Ext. Mem. Interface (Off-Chip) PCI Bus DRAM Network Server Performance and Scalability
Network Interface Firmware • NIC processing steps are well defined • Must provide high latency tolerance • DMA to host • Transfer to/from network • Event mechanism is the obvious choice • How do you process and distribute events? Network Server Performance and Scalability
Processor(s) pass data to Ethernet Interface Processor(s) inspect transactions Processor(s) need to enqueue TX Data PCI Interface Finishes Work Task Assignment with an Event Register PCI Read Bit SW Event Bit … Other Bits 0 1 0 1 0 Network Server Performance and Scalability
Process DMAs 0-4 Idle 1 Process DMAs 5-9 Idle Task-level Parallel Firmware PCI Read HW Status PCI Read Bit Proc 0 Proc 1 Transfer DMAs 0-4 0 Idle Idle Transfer DMAs 5-9 1 Time 1 0 Network Server Performance and Scalability
Build Event Process DMAs 0-4 Build Event Process DMAs 5-9 Frame-level Parallel Firmware PCI RD HW Status Proc 0 Proc 1 Transfer DMAs 0-4 Idle Idle Transfer DMAs 5-9 Idle Time Network Server Performance and Scalability
Scaling in Two Dimensions Gbps Network Server Performance and Scalability
A Programmable 10 Gbps NIC • This NIC architecture relies on: • Data Memory System – Partitioned organization, not coherent caches • Processor Architecture – Parallel scalar processors • Firmware – Frame-level parallel organization • RMW Instructions – reduce ordering overheads • A programmable NIC: A substrate for offload services Network Server Performance and Scalability
NIC Offload Services • Network Interface Data Caching • Connection Handoff • Virtual Network Interfaces • … Network Server Performance and Scalability
Network Interface Data Caching • Cache data in network interface • Reduces interconnect traffic • Software-controlled cache • Minimal changes to the operating system • Prototype web server • Up to 57% reduction in PCI traffic • Up to 31% increase in server performance • Peak 1571 Mbps of content throughput • Breaks PCI bottleneck Network Server Performance and Scalability
PCI saturated 30 % Overhead ~60 % Content traffic 60 % utilization 1198 Mb/s of HTTP content Results: PCI Traffic ~1260 Mb/s is limit! Network Server Performance and Scalability
8-16MB caches capture locality Content Locality • Block cache with 4KB block size Network Server Performance and Scalability
Good temporal reuse CPU bottleneck Low temporal reuse Low PCI utilization Results: PCI Traffic Reduction 36-57 % reduction with four traces Up to 31% performance improvement Network Server Performance and Scalability
No magic processor on NIC OS must control work between itself and NIC Move established connections between OS and NIC Connection: unit of control OS decides when and what Benefits Sockets are intact – no need to change applications Zero-copy No port allocation or routing on NIC Can adapt to route changes Sockets TCP Handoff IP Ethernet OS Driver Handoff NIC TCP IP Ethernet / Lookup Connection Handoff to the NIC • Handoff interface: • Handoff • Send • Receive • Ack • … Network Server Performance and Scalability
Traditional offload NIC replicates entire network stack NIC can limit connections due to resource limitations Connection handoff OS decides which subset of connections NIC should handle NIC resource limitations limit amount of offload, not number of connections Connection Handoff OS NIC Network Server Performance and Scalability
OS establishes connections OS decides whether or not to handoff each connection Establishment and Handoff OS Connection • Establish • a connection 2. Handoff NIC Connection Network Server Performance and Scalability
Offloaded connections require minimal support from OS for data transfers Socket layer for interface to applications Driver layer for interrupts, buffer management Data Transfer OS Connection Data 3. Send, Receive, Ack, … NIC Connection Data Network Server Performance and Scalability
Teardown requires both NIC and OS to deallocate connection data structures Connection Teardown OS Connection 5. De-alloc 4. De-alloc NIC Connection Network Server Performance and Scalability
Connection Handoff Status • Working prototype built on FreeBSD • Initial results for web workloads • Reductions in cycles and cache misses on host • Transparently handle multiple NICs • Fewer messages on PCI • 1.4 per packet to 0.6 per packet • Socket-level instead of packet-level communication • ~17% throughput increase (simulations) • To do • Framework for offload policies • Test zero-copy, more workloads • Port to Linux Network Server Performance and Scalability
Virtual Network Interfaces • Traditionally used for user-level network access • Each process has its own “virtual NIC” • Provide protection among processes • Can we use this concept to improve network stack performance within the OS? • Possibly, but we need to understand the behavior of the OS on networking workloads first Network Server Performance and Scalability
Networking Workloads • Performance is influenced by • The operating system’s network stack • The increasing number of connections • Microprocessor architecture trends Network Server Performance and Scalability
Networking Performance • Bound by TCP/IP processing • 2.4GHz Intel Xeon: 2.5 Gbps for one nttcp stream - Hurwitz and Feng, IEEE Micro 2004 Network Server Performance and Scalability
1200 1000 800 HTTP Content Throughput (Mb/s) 600 400 200 0 CS 4 8 16 32 64 128 256 512 1024 2048 IBM Connections NASA WC Throughput vs. Connections • Faster links more connections • More connections worse performance Network Server Performance and Scalability
The End of the Uniprocessor? • Uniprocessors have become too complicated • Clock speed increases have slowed down • Increasingly complicated architectures for performance • Multi-core processors are becoming the norm • IBM Power 4 – 2 cores (2001) • Intel Pentium 4 – 2 hyperthreads (2002) • Sun UltraSPARC IV – 2 cores (2004) • AMD Opteron – 2 cores (2005) • Sun Niagra – 8 cores, 4 threads each (est. 2006) • How do we use these cores for networking? Network Server Performance and Scalability
Parallelism with Data-Synchronized Stacks Linux 2.4.20+, FreeBSD 5+ Network Server Performance and Scalability
Parallelism with Control-Synchronized Stacks DragonflyBSD, Solaris 10 Network Server Performance and Scalability
Parallelization Challenges • Data-Synchronous • Lots of thread parallelism • Significant locking overheads • Control-Synchronous • Reduces locking • Load balancing issues • Which approach is better? • Throughput? Scalability? • We’re optimizing both schemes in FreeBSD 5 to find out • Network Interface • Serialization point • Can virtualization help? Network Server Performance and Scalability
Memory Controller Architecture • Improve DRAM efficiency • Memory access scheduling • Virtual Channels • Improve copy performance • 45-61% of kernel execution time can be copies • Best copy algorithm dependent on copy size, cache residency, cache state • Probe copy • Hardware copy acceleration • Improve I/O performance… Network Server Performance and Scalability
Summary • Our focus is on system-level architectures for networking • Network interfaces must evolve • No longer just a PCI-to-Ethernet bridge • Need to provide capabilities to help the operating system • Operating systems must evolve • Future systems will have 10s to 100s of processors • Networking must be parallelized – many bottlenecks remain • Synergy between the NIC and OS cannot be ignored • Memory performance is also increasingly a critical factor Network Server Performance and Scalability