170 likes | 345 Views
LINUX NETWORK IMPLEMENTATION Jianyong Zhang. Introduction. The layer structure of network: BSD socket layer: general data structure for different protocols. INET socket layer: end points for the IP-based protocols TCP and UDP ARP layer Link layer: Ethernet, SLIP, PLIP
E N D
Introduction • The layer structure of network: • BSD socket layer: general data structure for different protocols. • INET socket layer: end points for the IP-based protocols TCP and UDP • ARP layer • Link layer: Ethernet, SLIP, PLIP • Hardware: NIC, serial port, parallel port-
Socket system call • C interface system call routines: Socket(), bind(), listen(), connect(), accept(), send(), sendto(), recv(), recvfrom(), getsockopt(), setsockopt(). • All are based on the system call socketcall(). • Socket() return a file descriptor, read(), write(), select(), ioctl() use struct file: filef_opsock_read • Socket inode: struct socket *sock_alloc(void) • {… inode->i_mode = S_IFSOCK|S_IRWXUGO; • inode->i_sock = 1; • inode->i_uid = current->fsuid; • inode->i_gid = current->fsgid; • sock->inode = inode; … • }
Generic system call • socketcall() function: • asmlinkage int sys_socketcall(int call, unsigned long *args) • {… • unsigned long a0,a1; • /* copy_from_user should be SMP safe. */ • if (copy_from_user(a, args, nargs[call])) • return -EFAULT; • a0=a[0]; • a1=a[1]; • switch(call) • { • case SYS_SOCKET: • err = sys_socket(a0,a1,a[2]); • break; • case SYS_BIND: • err = sys_bind(a0,(struct sockaddr *)a1, a[2]); • break; … } …. • }
Important structures • 1. struct socket { • socket_state state; /* SS_FREE, SS_UNCONNECTED, SS_CONNECTING, SS_CONNECTED, SS_DISCONNECTIN*/ • unsigned long flags; • struct proto_ops *ops; • struct inode *inode; • struct fasync_struct *fasync_list; /* Asynchronous wake up list*/ struct file *file; /* File back pointer*/ • struct sock *sk; • struct wait_queue *wait; • short type;//SOCK_STREAM, SOCK_DGRAM, SOCK_RAW • unsigned char passcred; • unsigned char tli; • };
Important structures • 2. struct proto_ops { • int family; • int (*dup) (struct socket *newsock, struct socket *oldsock); • int (*release) (struct socket *sock, struct socket *peer); • int (*bind) (); int (*connect) (); • int (*socketpair) (struct socket *sock1, struct socket *sock2); • int (*accept) (); • int (*getname) (); • unsigned int (*poll) (); int (*ioctl) (); • int (*listen) (struct socket *sock, int len); • int (*shutdown) (struct socket *sock, int flags); • int (*setsockopt) (struct socket *sock, int level, int optname, • int (*getsockopt) (); • int (*fcntl) (); • int (*sendmsg) (); • int (*recvmsg) (); • };
Important structures • 3 . Struct sk_buff {.. . }: • manage individual communication packets, • a doule-link list • 4. Struct sock { … } • INET socket • 5. Struct device {…} • contols an abstract network device: network interface.
Getting the data from A to B • 1. A,B call socket(), then are connected by calling connect(), accept(). • 2. A: write(socket,data.len): verify_area(). • {… file = fget(socket); inode = file->f_dentry->d_inode; • if (!file->f_op || !(write= file->f_op->write)) goto out; • down(&inode->i_sem); • ret = write(file, data, len, &file->f_pos); • up(&inode->i_sem);… } • 3. Sock_write() { …struct socket *sock; • sock = socki_lookup(file->f_dentry->d_inode); … • msg.msg_iov=&iov; iov.iov_base=(void *)ubuf; … • return sock_sendmsg(sock, &msg, size); } • 4. For INET socket, it will call inet_sendmsg().
Getting the data from A to B • 5. inet_sendmsg() { • struct sock *sk = sock->sk; … • return sk->prot->sendmsg(sk, msg, size); } • /* call tcp_v4_sendmsg() */ • 6. Call tcp_do_sendmsg(sk, msg) {… • struct sk_buff *skb; • tmp = MAX_HEADER + sk->prot->max_header; • skb = sock_wmalloc(sk, tmp, 0, GFP_KERNEL); • skb_reserve(skb, MAX_HEADER + sk->prot->max_header); • skb->csum = csum_and_copy_from_user(from, skb_put(skb, copy), copy, 0, &err); • /*TCP data bytes are SKB_PUT() on top, later TCP+IP+DEV headers are SKB_PUSH()'d beneath. */ • tcp_send_skb(sk, skb, queue_it); …}
Getting the data from A to B • 5. tcp_send_skb() call tcp_transmit_skb(sk, skb_clone(skb, GFP_KERNEL)); • 6. tcp_transmit_skb(struct sock *sk, struct sk_buff *skb) {… struct tcp_opt *tp = &(sk->tp_pinfo.af_tcp); • /* Build TCP header and checksum it. */ … • tp->af_specific->queue_xmit(skb); • 7. Ip_queue_xmit() /* Queues a packet to be sent, and starts the transmitter if necessary. This routine also needs to put in the total length and compute the checksum. */ • {… • /* Make sure we can route this packet. */ • skb->dst = dst_clone(sk->dst_cache); • /* OK, we know where to send it, allocate and build IP header. */… • /* Do we need to fragment. Again this is inefficient. We need to somehow lock the original buffer and use bits of it. */… • /* Add an IP checksum. */…
Getting the data from A to B • skb->dst->output(skb); … } • 7. Bh synchronization with barrier: • start_bh_atomic(void), end_bh_atomic(void) • 8. Dev_queue_xmit() {… • start_bh_atomic(); q = dev->qdisc; • if (q->enqueue) { • q->enqueue(skb, q); • qdisc_wakeup(dev); • end_bh_atomic(); … return;} • if (dev->flags&IFF_UP) { • dev->hard_start_xmit(skb, dev); • end_bh_atomic(); • return;} • } • 9. For the WD8013 card, call ei_start_xmit(), pass the data to network adaptor, which in turn sends the packet to the Ethernet.
Getting the data from A to B • 10. The data, embedded in an Ethernet packet, are received by NIC in B. (NIC is assumed WD8013) • 11. NIC trigger an interrupt. This is handled by ei_interrupt(). Call ei_receive() (ei_* functions are chip-specific code for many 8390-based ethernet adaptors) • 12. Ei_receive() { … struct sk_buff *skb; • skb = dev_alloc_skb(pkt_len+2);…. • netif_rx(skb); …} • 13 netif_rx() receive a packet from a device driver and queue it for the upper (protocol) levels. Call {skb_queue_tail(&backlog,skb); mark_bh(NET_BH); } • 14. There is only one list of backlog in the entire system. • 15. Do_bottom_half() calls net_bh()
Getting the data from A to B • 10. net_bh() {… • skb = skb_dequeue(&backlog); • /* Bump the pointer to the next structure. skb->data and skb->nh.raw point to the MAC and encapsulated data */ • skb->h.raw = skb->nh.raw = skb->data; • /* Fetch the packet protocol ID. */ • type = skb->protocol; • /* We got a packet ID. Now loop over the "known protocols" list. There are two lists. The ptype_all list of taps (normally empty) and the main protocol list which is hashed perfectly for normal protocols. */… • if (ptype->type == type && (ptype->dev==skb->dev)) • {/*We already have a match queued. Deliver to it*/ • skb2=skb_clone(skb, GFP_ATOMIC); • pt_prev->func(skb2, skb->dev, pt_prev);…}
Getting the data from A to B • 10. Call ip_rcv() {… • /* check the header for correctness and deal with all the IP options. Ip_forward() and ip_defrag() */ … • return skb->dst->input(skb); } • 11 ip_local_deliver() {… • /* Reassemble IP fragments.*/ skb = ip_defrag(skb); • /*Deliver to raw sockets. This is fun as to avoid copies we want to make no surplus copies. */ … • /* Pass on the datagram to each protocol that wants it, based on the datagram protocol. */... • ipprot->handler(skb2, ntohs(iph->tot_len) - (iph->ihl * 4)); …} • 12 tcp_v4_rcv(), udp_rcv(),icmp_rcv()
Getting the data from A to B • 13. tcp_v4_rcv() {… • /* check the header for correctness */ … • if (!atomic_read(&sk->sock_readers)) • return tcp_v4_do_rcv(sk, skb); • __skb_queue_tail(&sk->back_log, skb); • do_time_wait: case TCP_TW_ACK: tcp_v4_send_ack(); • …} • 14. tcp_v4_do_rcv() call • { …__skb_queue_tail(&nsk->back_log, skb); • if (sk->state == TCP_ESTABLISHED) { /* Fast path */ • if (tcp_rcv_established(sk, skb, skb->h.th, skb->len)) • goto reset; • return 0; } • tcp_rcv_state_process(sk, skb, skb->h.th, skb->len);…}
Getting the data from A to B • 15. TCP receive function for the ESTABLISHED state. • * It is split into a fast path and a slow path. The fast path is disabled when: • * - A zero window was announced from us - zero window probing • * is only handled properly in the slow path. • * - Out of order segments arrived. • * - Urgent data is expected. • * - There is no buffer space left • * - Unexpected TCP flags/window values/header lengths are received (detected by checking the TCP header against pred_flags) • * - Data is sent in both directions. Fast path only supports pure senders or pure receivers (this means either the sequence number or the ack value must stay constant) • * When these conditions are not satisfied it drops into a standard • * receive procedure patterned after RFC793 to handle all cases. • * The first three cases are guaranteed by proper pred_flags setting, • * the rest is checked inline. Fast processing is turned on in • * tcp_data_queue when everything is OK.
Getting the data from A to B • 16. Tcp_data() enter the buffer sk_buff in the list • 17. Data_ready() wake up the waiting processes. • 18 The former actions are carried up in the kernel, outside of any process. • 19. B executes read(socket, data, len). • 20. Through sys_read() --- sock_read() – inet_rcvmsg()– tcp_rcvmsg(). • 21 This completes the data’s travels from process A to process B. • 22 The data is copied only four times: • 1) From the user space of A to kernel memory • 2) From kernel memory to network card. • 3) From network card to another computer’s kernel memory • 4) From B’s kernel memory to B’s user space