1 / 54

Implementation of TCP/IP in Linux (kernel 2.2)

Implementation of TCP/IP in Linux (kernel 2.2). Rishi Sinha. Goals. Goals To help you implement your customized stack by identifying key points of the code structure To point out some tricks and optimizations that evolved after 4.3BSD and that are part of Linux TCP/IP code.

eros
Download Presentation

Implementation of TCP/IP in Linux (kernel 2.2)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementation of TCP/IP in Linux (kernel 2.2) Rishi Sinha

  2. Goals • Goals • To help you implement your customized stack by identifying key points of the code structure • To point out some tricks and optimizations that evolved after 4.3BSD and that are part of Linux TCP/IP code

  3. TCP/IP source code • /usr/src/linux/net/ • All relative pathnames in this document are relative to /usr/src/linux/ • http://lxr.linux.no cross-references all the Linux kernel code • You can install and run it locally; I haven’t tried

  4. The various layers (yawn…) BSD socket INET socket Appletalk IPX TCP/UDP IP (Link) (Physical)

  5. Address families supported • include/linux/socket.h • UNIX Unix domain sockets • INET TCP/IP • AX25 Amateur radio • IPX Novell IPX • APPLETALK Appletalk • X25 X.25 • More; about 24 in all

  6. Setting things up – socket-side How the INET address family registers itself with BSD socket layer

  7. struct socket • BSD socket • short type – SOCK_DGRAM, SOCK_STREAM • struct proto_ops *ops – TCP/UDP operations for this socket; bind, close, read, write etc. • struct inode *inode – the file inode associated with this socket • struct sock *sk – the INET socket associated with this socket

  8. No connections BSD socket INET socket? Operations to use? (How to create socket?)

  9. struct sock • INET socket • struct socket *socket – associated BSD socket • struct sock *next, **pprev – socks are in linked lists • struct dst_entry *dst_cache – pointer to the route cache entry used by this socket • struct sk_buff_head *receive_queue – head of the receive queue • struct sk_buff_head *write_queue – head of the send queue

  10. struct sock continued • __u32 daddr – foreign IP address • __32 rcv_saddr – bound local IP address • __u16 dport – destination port • unsigned short num – local port • struct proto *prot – contains TCP/UDP specific operations (repetition with struct socket’s ops field)

  11. No connections BSD socket? INET socket Reaching transport layer?

  12. protocols vector • Array of struct net_proto, which has • name, say INET, UNIX, IPX, etc • initialization function, say inet_proto_init • This protocols array is static in net/protocols.c • This file uses conditional compilation to include protocols as chosen in make config

  13. inet_proto_init • protocols vector is traversed at system init time, and each init function called • Each of these protocol init functions registers itself with BSD sockets by giving its name and socket create function • Where does the BSD socket layer store this information?

  14. net_families • BSD socket layer stores info for each registering protocol in this array • This is an array of struct net_proto_family, which is • int family • int (*create)(struct socket *sock, int protocol)

  15. BSD socket layer now has INET inet_create() IPX ipx_create() UNIX unix_create()

  16. So in socket() call • BSD socket layer looks for specified address family, say INET • BSD socket layer calls create function for that family, say inet_create() • inet_create() does switch (BSD_socket->type) • case SOCK_DGRAM: fill BSD_socket->proto_ops with UDP operations • case SOCK_STREAM: fill BSD_socket->proto_ops with TCP operations

  17. Socket layer is satisfied TCP’s proto_ops BSD socket: AF_INET, SOCK_STREAM INET socket Write queue Receive queue Lots of other TCP data

  18. Reaching sockets through file descriptors • Per process file table > inode > BSD socket etc. • Not describing here

  19. Setting things up – device side How network interfaces come up and attach themselves to the stack

  20. No connections What is my name (since I don’t have a /dev file)? Give packets to whom? Network interface card

  21. struct device • No device file for network devices • Why? Design choice, probably because network devices “push” data • Each interface is represented by a struct device • All struct devices are chained and the chain head is called dev_base

  22. struct device continued • char *name – say eth0 • unsigned long base_addr – I/O address • unsigned int irq – IRQ number • struct device *next • int (*init)(struct device *dev) • int (*hard_start_xmit)(struct sk_buff *skb, struct device *dev) – transmission function

  23. dev_base • drivers/net/Space.c cleverly threads struct devices for all possible interfaces into a list starting at dev_base (static data structure declaration, no code execution yet) • List includes limited number of devices of each type, i.e. eth0 to eth7 and no more possible

  24. ethif_probe() • For each of these 8 struct devices, names are eth0 to eth7 and init funtion is ethif_probe() • During system init time the list of struct devices is traversed, and the init function called for each • So ethif_probe() called for eth0; calls probe_list()

  25. probe_list() • probe_list() goes through a list of all ethernet devices the system has drivers for • The probe function for each driver is called, and • if success, assign proper function pointers from the driver code to this struct device (ethx) • if failure, no more eth devices exist, remove this struct device from the list and return

  26. After all devices in Space.c traversed through Give packets to whom? functions from 3com driver functions from HP driver lo0 dev_base eth0, 3Com card eth1, HP card

  27. Modularized driver • Much simpler, because the driver’s probe is executed at module load time • If it finds a device, it appends a struct device to the end of the dev_base list

  28. backlog queue • Very very distinct from socket listen backlog queue! • Systemwide queue that interfaces immediately drop packets onto • Device driver writers simply call netif_rx(), which does the actual queueing

  29. Link layer is satisfied backlog queue functions from 3com driver functions from HP driver lo0 dev_base eth0, 3Com card eth1, HP card

  30. Setting things up – between link and network layers How packets reach the correct protocol stack

  31. No connections IP? ARP? IPX? BOOTP? Who takes packets off the backlog queue? Who gets these packets? backlog queue

  32. net_bh() • Bottom-half handler for network interrupt interrupt • Executes when network interrupt is not masked • So the fast handler (actual ISR), is driver code that calls netif_rx() to queue the packet onto backlog queue, and marks net_bh() for execution • net_bh() takes packets off backlog and passes to the protocol specified in ethernet header

  33. ptype_base • ptype_base is the head of a list of possible packet types the link layer may receive (IP, ARP, IPX, BOOTP, etc.) that the system can handle • How is it built? • For every protocol in the protocols vector, when its init function is called (inet_proto_init), it calls functions like ip_init(), tcp_init() and arp_init()

  34. dev_add_pack completes the picture • Those subprotocols interested in registering a packet type (IP, ARP), get their init functions (ip_init(), arp_init()) to call dev_add_pack(), specifying a handler function • This adds the packet type to ptype_base • So net_bh( ) hands off packets to the right protocol stack

  35. Setting things up – between network and transport layers How packets reach the correct transport protocol

  36. inet_protos • An array of transport layer protocols in INET • Built at the time of inet_proto_init() • By calling inet_add_protocol() for every transport protocol • Registers handlers for transport protocols

  37. Packet movement through stack Transmission and reception, queues, interrupts

  38. struct sk_buff • Each packet that arrives on the wire is encased in a buffer called sk_buff • An sk_buff is just the data with a lot of additional information about the packet • There is a one-to-one relationship between packets and sk_buffs, i.e. one packet, one buffer • sk_buffs can be allocated in multiples of 16 bytes

  39. struct sk_buff continued • INET sock queues are queues of sk_buffs • Data coming from the socket calls are copied into sk_buffs • Data arriving from the network is copied into sk_buffs • sk_buff picture with fields

  40. struct sk_buff continued

  41. Queues • backlog queue • INET sock queues • TCP has a number of queues for out-of-order, connection backlog, error packets (?)

  42. Packet reception • Packet received by hardware • Receive interrupt generated • Driver handler copies data from hardware into fresh sk_buff • Calls netif_rx() to queue on backlog • Schedules net_bh() with mark_bh(NET_BH) • net_bh() executes the next time the scheduler is run or a system call returns or a slow interrupt handler returns

  43. Packet reception continued • net_bh() tries to send any pending packets, then dequeues packets from the backlog and passes them to correct handler, say ip_rcv() • ip_rcv() may call ip_local_deliver() or ip_forward() • ip_local_deliver() results in call to tcp_v4_rcv() through the inet_protos list • tcp_v4_rcv() queues data at the correct socket’s queue

  44. Packet reception continued • When the socket’s owner reads, tcp_recvmsg() is invoked through BSD socket’s proto_ops • If instead the socket’s owner had blocked on a read, that process will be woken using wake_up (wait queue)

  45. Packet transmission • Quite different for TCP and UDP in terms of copying of user data to kernel space • TCP does its own checksumming, while IP does checksumming for UDP. Why? Next section. • net_bh() again takes care of flushing out packets that have piled up at the device’s queue

  46. Tricks and optimizations TCP/IP enhancements, most due to Van Jacobson, arrived after 4.3BSD

  47. Checksum and copy

  48. Checksum and copy continued • Linux goes over every byte of data only once (if the packet does not get fragmented) • Uses checksum_and_copy() • TCP data from socket gets filled into MSS-sized segments by TCP, so checksum-copying happens here

  49. User Buffer (ubuff) INET Socket (struct sock) write_queue • sk_buff structure • partially used sk_buff • newly allocated sk_buff Checksum and copy continued

  50. Checksum and copy continued • UDP, on the other hand, does not stuff anything into MSS-sized buffers, so there is no need to copy data from user space at UDP layer • UDP passes data and a callback function to IP • IP copies this data into an sk_buff, using the callback function, which is a checksum_and_copy function • Large ping replies from a Linux host srrive in reverse order of frgaments! Why?

More Related