220 likes | 665 Views
Network Driver Performance. Outline. Software features for high performance NICs Some of the top features include: Scatter-Gather DMA Automatic Tuning of resources Task Offloading support for IPv6 Hardware features for high performance NICs Some of the top features include:
E N D
Outline • Software features for high performance NICs • Some of the top features include: • Scatter-Gather DMA • Automatic Tuning of resources • Task Offloading support for IPv6 • Hardware features for high performance NICs • Some of the top features include: • Task Offloading support • Receive-Side Scaling (RSS) support • Performance Tools • NTttcp • Kernrate Profiler
Goals • This information can be used to optimally tune your network driver to work with your hardware for best networking performance • This information can be used to fine-tune your hardware features to operate at its optimal performance • How to use NTttcp to isolate Network performance problems • How to use Kernrate to identify bottlenecks on hot paths Note: The mention of packets is relevant to NDIS 5.x drivers and translates to NetBuffers and NetBufferLists for NDIS 6.0 drivers on Windows codenamed “Longhorn”
Network Software Optimizations • Scatter Gather DMA • SG DMA yields optimum performance with NDIS 6.0 model • It is highly recommended to pre-allocate the buffer hosting the SCATTER_GATHER_LIST as part of Transmit Control Block during the initialization phase and reuse it. • Use maximum buffer size for MaximumPhysicalMapping parameter in NdisMInitializeScatterGatherDma function to avoid buffer allocation and copy • Using Cached Memory to allocate NIC receive buffers • X86, IA64, and x64 hardware guarantees DMA coherency and there is no need to call IoFlushBuffer since it would become a nop NdisMAllocateSharedMemory( pMpRxbuf->AllocSize, TRUE, // CACHED &pMpRxbuf->AllocVa, &pMpRxbuf->AllocPa);
More Network Software Optimizations • NDIS Safe APIs • Required for NDIS 6.0 model! • It has shown overall TCP/IP improvements of up to 7% in Kernel mode scenarios (e.g. IIS 6.0) • Eliminate the need to call into Kernel for probing and locking buffer • Set NDIS_ATTRIBUTES_USES_SAFE_BUFFER_APIS flag in NdisMSetAttributesEx for NDIS 5.x drivers. The flag does not need to be set for NDIS 6.0 drivers • Example: When using NdisQueryBufferSafe, the VirtualAddress parameter should be set to NULL to avoid mapping of buffers sent down by NDIS • 64-bit DMA Support • Avoid copies for addresses above the 4GB range by setting Dma64Addresses to TRUE in NdisMInitializeScatterGatherDma
Locking Mechanisms Optimizations • Expensive hit to system performance if not used properly • Measurements show that we use approximately 160 cycles for Lock Acquires and 140 cycles for Lock Releases. • Spinlocks should be used to protect data and not code. • Locking at DPC Level • When at DPC level, avoid extra code by using the following: • NdisDprAcquireSpinlock • NdisDprReleaseSpinlock • Reader-Write Locks • To minimize the number of spinlock acquire and release operations, use the NDIS ReadWriteLock functions for scalability: • NdisInitializeReadWriteLock • NdisAcquireReadWriteLock • NdisReleaseReadWriteLock • The Read-Write Locks allow multiple concurrent readers to use a single lock and limit write access to a single writer thread. No read access is allowed during a write access. They will still behave like a spinlock and raise the IRQL to dispatch when acquired.
Auto Tuning Network Drivers • Static: Driver and NIC hardware parameters are based on system configuration such as whether it is a client or server machine, CPU, memory, and what can the NIC do. • Dynamic: System conditions dictate what type of tuning is necessary for optimum performance. It uses resource utilization and network load as metrics for determining the best operating points for the NIC and driver. • Some of the primary auto tuning parameters include: • Interrupt moderation • Receive Buffers allocation • Small buffer coalescing • Packets processed per DPC • Drivers can obtain current processor utilization by using the NdisGetCurrentProcessorCounts function.
Task Offload Support • Checksum Offload • It has shown to improve overall TCP/IP performance by up to 20% • It improves caching effect and eliminates churning – 8%increase • It reduces code path length – 12% improvement • TCP Segmentation Offload • It has shown to improve overall TCP/IP performance by up to 11% • Reduces sender Cycles per Byte cost by 2x (it goes below 1.5) • NDIS 6.0 has support for successor: Giant Send Offload (> 64K) • NDIS 6.0 has IPv6 support for TCP Segmentation Offload • NDIS 6.0 offers support for IPSec Offload
Message Signaled Interrupts (MSI) • MSI has the following attributes: • No acknowledgment is necessary for the message • No sharing is usually necessary • There is support for many interrupts per PCI function • Caveat: It only works on P4 and later chipsets • Advantages of MSI • With no sharing in place, latency is less with a single ISR running • Bus utilization goes down by eliminating some read operations from device • Device can target interrupts at designated processors (e.g. RSS) • It guarantees data buffer coherency because message follows DMA traffic on bus
Receive Side Scaling (RSS) • Existing stack limits receive processing to one CPU • Restricts scalability of Web server to the number of short-lived connections a single CPU can process (per NIC) • Limits transaction throughput to packet receive processing rate of one CPU • Example: A four processor machine can not use more than 25% of its overall CPU cycles when hosting a single NIC on the system • RSS helps both long and short-lived connections • At times when CPU processing is dominated by connection setup, RSS improves performance • Connection setup tasks map well to a general purpose CPU • RSS gives us parallel receive processing = parallel DPCs • Planned availability in Windows Server 2003 Network Scalable Pack Add-on and Longhorn
Receive Side Scaling NDIS NDIS Today Receive Side Scaling NDIS NDIS CPU0 CPU1 CPU2 Parallel DPC CPU0 DPC DPC DPC DPC ISR Parallel ReceivePacket Queues NIC NIC One processor per NIC Multiple processors per NIC
Network Performance Tools • NTttcp benchmark • Uses Winsock 2.x publicly available APIs • Uses Overlapped I/O and Multithreading model • Transfers random data from Memory to Memory • Provides Throughput, CPU, and Interrupt rate • Provides Cycles per Byte metric - key for measuring performance to catch regressions • Provides Packet to ACK ratio to detect link condition • Provides number of Segment Retransmits and Errors • Supports all Windows hardware architectures
More Network Performance Tools • Kernrate Profiling tool • General purpose profiler for tracking CPU utilization • Samples periodically (programmable) to see what is executing • Adjustable granularity • Per-processor, per-process, and total • Supports all Windows hardware architectures • Supports Windows 2000 and beyond • Highly customizable (numerous options) • The profiling tool and its viewer (KrView) can be downloaded from: • http://www.microsoft.com/whdc/system/sysperf/krview.mspx
Call To Action • NDIS 6.0 driver developers need to implement Task Offloading support for IPv6 • Fine-tune your hardware so it operates at its optimal performance point • Fine-tune your network driver to work optimally with your hardware for best performance • For questions, please e-mailndis6fb @ microsoft.com. Please include your name, company name, and phone number
Additional Resources • Email: ndis6fb @ microsoft.com • Web Resources: • Analyzing Driver Performance: http://www.microsoft.com/whdc/driver/perform/drvperf.mspx • High Performing Adapters and Drivers whitepaper: http://www.microsoft.com/whdc/device/network/NetAdapters-Drvs.mspx • Kernrate is available for download from the following: http://www.microsoft.com/whdc/system/sysperf/krview.mspx
© 2005 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.