220 likes | 433 Views
The ‘zero-copy’ initiative. A look at the ‘zero-copy’ concept and an x86 Linux implementation for the case of outgoing packets . From Wikipedia, the free encyclopedia:. Zero-copy is an adjective that refers to computer operations in which the
E N D
The ‘zero-copy’ initiative A look at the ‘zero-copy’ concept and an x86 Linux implementation for the case of outgoing packets
From Wikipedia, the free encyclopedia: Zero-copy is an adjective that refers to computer operations in which the CPU does not perform the task of copying data from one area of memory to another. The availability of zero-copy versions of operating system elements such as device drivers, file systems and network protocol stacks greatly increases the performance of many applications, since using a CPU that is capable of complex operations just to make copies of data can be a great waste of resources. Zero-copy also reduces the number of context-switches from User space to Kernel space and vice-versa. Several OS like Linux support zero copying of files through specific API's like sendfile, sendfile64, etc. Techniques for creating zero-copy software include the use of DMA-based copying, and memory-mapping through an MMU. These features require specific hardware support and usually involve particular memory alignment requirements. Zero-copy protocols are especially important for high-speed networks, as memory copies would cause a serious workload for the host cpu. Still, such protocols have some initial overhead so that avoiding programmed IO (PIO) there only makes sense for large messages.
Application source-code char message[] = “This is a test of network-packet transmission \n”; int main( void ) { int fd = open( “/dev/nic”, O_RDWR ); if ( fd < 0 ) { perror( “/dev/nic” ); exit(1); } int msglen = strlen( message ); int nbytes = write( fd, message, msglen ); if ( nbytes < 0 ) { perror( “write” ); exit(1); } printf( “Transmitted %d bytes \n”, nbytes ); }
Transmit operation user space kernel space Linux OS kernel runtime library file subsystem nic device-driver write() my_write() application program packet buffer user data-buffer copy_from_user() DMA hardware We want to eliminate this copying-operation
Our driver’s packet-layout packet-buffer in kernel-space destn-address source-address TYPE/ LENGTH count -- data -- -- data -- -- data – 16 bytes base-address (64-bits) Packet- length CSO cmd status CSS special Format for Legacy Transmit-Descriptor
Can zero-copy be transparent? • We would like to implement the zero-copy concept in out ‘nic2.c’ character driver in such a manner that no changes would be required to an ‘application’ program’s code • We will show how to do this for ‘outgoing’ packets (i.e., by modifying ‘my_write()’), but achieving zero-copy with ‘incoming’ packets would be a lot more complicated!
TX Descriptor’s CMD byte Command-Byte Format I D E V L E 0 0 R S I C I F C S E O P EOP = End-Of-Packet (1=yes, 0=no) RS = Report Status (1=yes, 0=no) VLE = VLAN-tag Enable Key question: What will the NIC do if we don’t set the EOP-bit in a TX Descriptor?
Splitting our packet-layout packet-buffer in kernel-space destn-address source-address TYPE/ LENGTH count HDR LEN -- data -- -- data -- -- data – base-address (64-bits) Packet- Length (=HDR) CSO cmd EOP=0 status CSS special base-address (64-bits) Packet- Length (=LEN) CSO cmd EOP=1 status CSS special Format for Legacy Transmit-Descriptor Pair
Splitting our packet-buffer packet-buffer in kernel-space destn-address source-address TYPE/ LENGTH count HDR LEN packet-buffer in user-space -- data -- -- data -- -- data – base-address (64-bits) Packet- Length (=HDR) CSO cmd EOP=0 status CSS special base-address (64-bits) Packet- Length (=LEN) CSO cmd EOP=1 status CSS special Format for Legacy Transmit-Descriptor Pair Two physical packet-buffers comprise one logical packet that gets transmitted!
Transmitting a ‘split-packet’ The 82573L controller ‘merges’ the contents of these separate buffers into just a single ethernet-packet Application-program packet-data buffer DMA User-space Kernel-space Device-driver module packet-header buffer DMA NIC hardware
The ‘virt_to_phys()’ macro • Linux provides a convenient macro which kernel-module code can employ to obtain the physical-address for a memory-region from its virtual-address – but it only works for addresses that aren’t in ‘high’ memory • For ‘normal’ memory-regions, conversion between ‘virtual’ and ‘physical’ addresses amounts to a simple addition/subtraction
Linux memory-mapping = persistent mapping = transient mappings kernel space HMA user space 896-MB physical RAM There is more physical RAM in our classroom’s systems than can be ‘mapped’ into the available address-range for kernel virtual addresses CPU’s virtual address-space
Two-Level Translation Scheme PAGE TABLES PAGE DIRECTORY PAGE FRAMES CR3
Linear to Physical linear address physical address-space dir-index table-index offset page table page frame page directory CR3
Address-translation • The CPU examines any virtual address it encounters, subdividing it into three fields 31 22 21 12 11 0 index into page-directory index into page-table offset into page-frame 10-bits 10-bits 12-bits This field selects one of the 1024 array-entries in the Page-Directory This field selects one of the 1024 array-entries in that Page-Table This field provides the offset to one of the 4096 bytes in that Page-Frame
Format of a Page-Table entry 31 12 11 10 9 8 7 6 5 4 3 2 1 0 PAGE-FRAME BASE ADDRESS AVAIL 0 0 D A P C D P W T U W P LEGEND P = Present (1=yes, 0=no) W = Writable (1 = yes, 0 = no) U = User (1 = yes, 0 = no) A = Accessed (1 = yes, 0 = no) D = Dirty (1 = yes, 0 = no) PWT = Page Write-Through (1=yes, 0 = no) PCD = Page Cache-Disable (1 = yes, 0 = no)
Finding the user-buffer’s PFN • To program the ‘base-address’ field in the second TX-Descriptor, our driver’s ‘write()’ function will need to know which physical Page-Frame the application’s buffer lies in • And its PFN (Page-Frame Number) can be found from its virtual address by ‘walking-the-cpu-page-tables’ – even when Linux puts some page-tables in ‘high’ memory
Performing ‘virt_to_phys()’ ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos ) { unsigned int _cr3, *pgdir, *pgtbl, pfn_pgtbl, pfn_frame; unsigned int dindex, pindex, offset; // take apart the virtual-address of the user’s ‘buf’ variable dindex = ((int)buf >> 22) & 0x3FF; // pgdir-index (10-bits) pindex = ((int)buf >> 12) & 0x3FF; // pgtbl-index (10-bits) offset = ((int)buf >> 0) & 0xFFF; // frame-offset (12-bits) // then walk the CPU’s paging-tables to get buf’s physical-address asm(“ mov %%cr3, %%eax \n mov %%eax, %0 “ : “=m”(_cr3) : : “ax” ); pgdir = (unsigned int*)phys_to_virt( _cr3 & ~0xFFF ); pfn_pgtbl = (pgdir[ dindex ] >> 12); pgtbl = (unsigned int *)kmap( &mem_map[ pfn_pgtbl ] ); pfn_frame = (pgtbl[ pindex ] >> 12); kunmap( &mem_map[ pfn_pgtbl ]; txring[ txtail + 1 ].base_address = (pfn_frame << 12) + offset;
Can’t cross a ‘page-boundary’ • In order for the NIC to fetch the user’s data using its Bus-Master DMA capability, it is necessary for the buffer needs to reside in a physically contiguous memory-region • But we can’t be sure Linux will have setup the CPU’s page-tables that way – unless the ‘buf’ is confined to a single page-frame buf
Truncate ‘len’ if necessary ssize_t my_write( struct file *file, const char *buf, size_t len, loff_t *pos ) { if ( offset + len > PAGE_SIZE ) len = PAGE_SIZE – offset; offset len buf PAGE_SIZE PAGE_SIZE PAGE_SIZE
‘zerocopy.c’ • We created this modification of our ‘nic2.c’ device-driver so it’s ‘my_write()’ function lets an application perform transmissions without performing a memory-to-memory copy-operation (i.e., copy_from_user()’ ) • It is not so easy to implement ‘zero-copy’ for receiving packets – can you say why?
Website article • We’ve posted a link on our CS686 website to a frequently cited research-article about the various issues that arise when trying to implement the ‘zero-copy’ concept for the case of ‘incoming’ network-packets: The Need for Asynchronous, Zero-Copy Network I/O, by Ulrich Drepper, Red Hat, Inc.