420 likes | 877 Views
Socket Layer. COMS W6998 Spring 2010 Erich Nahum. Outline. Sockets API Refresher Linux Sockets Architecture Interface between BSD sockets and AF_INET Interface between AF_INET and TCP/UDP Receive Path Send Path. BSD Socket API. Originally developed by UC Berkeley at the dawn of time
E N D
Socket Layer COMS W6998 Spring 2010 Erich Nahum
Outline • Sockets API Refresher • Linux Sockets Architecture • Interface between BSD sockets and AF_INET • Interface between AF_INET and TCP/UDP • Receive Path • Send Path
BSD Socket API • Originally developed by UC Berkeley at the dawn of time • Used by 90% of network oriented programs • Standard interface across operating systems • Simple, well understood by programmers
User Space Socket API • socket() / bind() / accept() / listen() • Initialization, addressing and hand shaking • select() / poll() / epoll() • Waiting for events • send() / recv() • Stream oriented (e.g. TCP) Rx / Tx • sendto() / recvfrom() • Datagram oriented (e.g. UDP) Rx / TX • close(), shutdown() • Closing down an association
Standard Socket Sequence The ‘server’ application socket() bind() The ‘client’ application socket() listen() bind() 3-way handshake connect() accept() read() data flow to server write() write() data flow to client read() close() 4-way handshake close()
Socket() System Call • Creating a socket from user space is done by the socket() system call: • int socket (int family, int type, int protocol); • On success, a file descriptor for the new socket is returned. • For open() system call (for files), we also get a file descriptor as the return value. • “Everything is a file” Unix paradigm. • The first parameter, family, is also sometimes referred to as “domain”.
Socket(): Family • A family is a suite of protocols • Each family is a subdirectory of linux/net • E.g., linux/net/ipv4, linux/net/decnet, linux/net/packet • IPv4: PF_INET • IPv6: PF_INET6. • Packet sockets: PF_PACKET • Operate at the device driver layer. • pcap library for Linux uses PF_PACKET sockets • pcap library is in use by sniffers such as tcpdump. • Protocol Family == Address Family • PF_INET == AF_INET (in /include/linux/socket.h)
Address/Protocol Families /* Supported address families. */ #define AF_UNSPEC 0 #define AF_UNIX 1 /* Unix domain sockets */ #define AF_LOCAL 1 /* POSIX name for AF_UNIX */ #define AF_INET 2 /* Internet IP Protocol */ #define AF_AX25 3 /* Amateur Radio AX.25 */ #define AF_IPX 4 /* Novell IPX */ #define AF_APPLETALK 5 /* AppleTalk DDP */ #define AF_NETROM 6 /* Amateur Radio NET/ROM */ #define AF_BRIDGE 7 /* Multiprotocol bridge */ #define AF_ATMPVC 8 /* ATM PVCs */ #define AF_X25 9 /* Reserved for X.25 project */ #define AF_INET6 10 /* IP version 6 */ #define AF_ROSE 11 /* Amateur Radio X.25 PLP */ #define AF_DECnet 12 /* Reserved for DECnet project */ #define AF_NETBEUI 13 /* Reserved for 802.2LLC project*/ #define AF_SECURITY 14 /* Security callback pseudo AF */ #define AF_KEY 15 /* PF_KEY key management API */ .. #define AF_ISDN 34 /* mISDN sockets */ #define AF_PHONET 35 /* Phonet sockets */ #define AF_IEEE802154 36 /* IEEE802154 sockets */ #define AF_MAX 37 /* For now.. */ include/linux/socket.h
Socket(): Type • SOCK_STREAM and SOCK_DGRAM are the mostly used types. • SOCK_STREAM for TCP, SCTP • SOCK_DGRAM for UDP. • SOCK_RAW for RAW sockets. • There are cases where protocol can be either SOCK_STREAM or SOCK_DGRAM; for example, Unix domain socket (AF_UNIX).
Socket(): Protocol • Protocol is protocol number within a family. • Internet protocols are assigned by IANA • http://www.iana.org/assignments/protocol-numbers/ • For AF_INET, it’s usually 0. • IPPROTO_IP is 0, see: include/linux/in.h. • For SCTP: • protocol is IPPROTO_SCTP (132) sockfd = socket(AF_INET, SOCK_STREAM, IPPROTO_SCTP); • For UDP-Lite: • protocol is IPPROTO_UDPLITE (136)
Socket Layer Architecture Application User BSD Socket Layer Socket Interface PF_INET PF_PACKET PF_UNIX PF_IPX SOCK_ STREAM SOCK_ DGRAM SOCK _RAW …. …. Protocol Layers SOCK _RAW SOCK_ DGRAM TCP UDP IPV4 Kernel Network Device Layer Device Layer Ethernet Token Ring PPP SLIP FDDI Intel E1000 Hardware
Key Concepts • Function pointer tables (“ops”) • In-kernel interfaces for socket functions • Binding between BSD sockets and AF_XXX families • Binding between AF_INET and transports (TCP, UDP) • Socket data structures • struct socket (BSD socket) • struct sock (protocol family socket, network state) • struct packet_sock (PF_PACKET) • struct inet_sock (PF_INET) • struct udp_sock • struct tcp_sock
Socket Data Structures • For every socket which is created by a user space application, there is a corresponding struct socket and struct sock in the kernel. • These are confusing. • struct socket: include/linux/net.h • Data common to the BSD socket layer • Has only 8 members • Any variable “sock” always refers to a struct socket • struct sock : include/net/sock/h • Data common to the Network Protocol layer (i.e., AF_INET) • has more than 30 members, and is one of the biggest structures in the networking stack. • Any variable “sk” always refers to a struct sock.
struct socket struct socket { socket_state state; // SS_CONNECTING etc. short type; // SOCK_STREAM etc. unsigned long flags; struct fasync_struct *fasync_list; wait_queue_head_t wait; // tasks waiting struct file *file; // back ptr to inode struct sock *sk; // AF specific state const struct proto_ops *ops; // AF specific operations }; include/linux/net.h
Socket State typedef enum { SS_FREE = 0, /* not allocated */ SS_UNCONNECTED, /* unconnected to an socket */ SS_CONNECTING, /* in process of connecting */ SS_CONNECTED, /* connected to socket */ SS_DISCONNECTING /* in process of disconnecting */ } socket_state; • These states are not layer 4 states (like TCP_ESTABLISHED or TCP_CLOSE). include/linux/net.h
Socket Types enum sock_type { SOCK_STREAM = 1, SOCK_DGRAM = 2, SOCK_RAW = 3, SOCK_RDM = 4, SOCK_SEQPACKET = 5, SOCK_DCCP = 6, SOCK_PACKET = 10, }; include/linux/net.h
Comment in include/net/sock.h /* * This structure really needs to be cleaned up. * Most of it is for TCP, and not used by any of * the other protocols. */
struct sock_common /* minimal network layer representation of sockets */ struct sock_common { /* * first fields are not copied in sock_copy() */ union { struct hlist_node skc_node; // main hash linkage for lookup struct hlist_nulls_node skc_nulls_node; // main hash for TCP/UDP }; atomic_t skc_refcnt; int skc_tx_queue_mapping; // tx queue for this connection union { unsigned int skc_hash; // hash value for lookup __u16 skc_u16hashes[2]; }; unsigned short skc_family; // network address family volatile unsigned char skc_state; // Connection state unsigned char skc_reuse; // SO_REUSEADDR setting int skc_bound_dev_if; // bound if !=0 union { struct hlist_node skc_bind_node; // bind hash linkage struct hlist_nulls_node skc_portaddr_node; // bind hash for UDP/Lite }; struct proto *skc_prot; // protocol handlers in a net family }; include/net/sock.h
Outline • Sockets API Refresher • Linux Sockets Architecture • Interface between BSD sockets and AF_INET • Interface between AF_INET and TCP/UDP • Receive Path • Send Path
BSD Socket AF Interface • Main data structures • struct net_proto_family • struct proto_ops • Key function sock_register(struct net_proto_family *ops) • Each address family: • Implements the struct net _proto_family. • Calls the function sock_register( ) when the protocol family is initialized. • Implement the struct proto_ops for binding the BSD socket layer and protocol family layer.
BSD Socket Layer net_proto_family AF Socket Layer • Describes each of the supported protocol families struct net_proto_family { int family; int (*create)(struct net *net, struct socket *sock, int protocol, int kern); struct module *owner; } • Specifies the handler for socket creation • create() function is called whenever a new socket of this type is created
BSD Socket Layer INET and PACKET proto_family AF Socket Layer static const struct net_proto_family inet_family_ops = { .family = PF_INET, .create = inet_create, .owner = THIS_MODULE, /* af_inet.c */ }; static const struct net_proto_family packet_family_ops = { .family = PF_PACKET, .create = packet_create, .owner = THIS_MODULE, /* af_packet.c */ };
BSD Socket Layer proto_ops AF Socket Layer • Defines the binding between the BSD socket layer and address family (AF_*) layer. • The proto_ops tables contain function exported by the AF socket layer to the BSD socket layer • It consists of the address family type and a set of pointers to socket operation routines specific to a particular address family.
BSD Socket Layer struct proto_ops AF Socket Layer struct proto_ops { int family; struct module *owner; int (*release); int (*bind); int (*connect); int (*socketpair); int (*accept); int (*getname); unsigned int (*poll); int (*ioctl); int (*compat_ioctl); int (*listen); int (*shutdown); int (*setsockopt); int (*getsockopt); int (*compat_setsockopt); int (*compat_getsockopt); int (*sendmsg); int (*recvmsg); int (*mmap); ssize_t (*sendpage); ssize_t (*splice_read); }; include/linux/net.h
BSD Socket Layer PF_PACKET proto_ops AF Socket Layer static const struct proto_ops packet_ops = { .family = PF_PACKET, .owner = THIS_MODULE, .release = packet_release, .bind = packet_bind, .connect = sock_no_connect, .socketpair = sock_no_socketpair, .accept = sock_no_accept, .getname = packet_getname, .poll = packet_poll, .ioctl = packet_ioctl, .listen = sock_no_listen, .shutdown = sock_no_shutdown, .setsockopt = packet_setsockopt, .getsockopt = packet_getsockopt, .sendmsg = packet_sendmsg, .recvmsg = packet_recvmsg, .mmap = packet_mmap, .sendpage = sock_no_sendpage, }; net/packet/af_packet.c
BSD Socket Layer PF_INET proto_ops AF Socket Layer net/ipv4/af_inet.c
Outline • Sockets API Refresher • Linux Sockets Architecture • Interface between BSD sockets and AF_INET • Interface between AF_INET and TCP/UDP • Binding between IP and TCP/UDP (upcall) • Binding between AF_INET and TCP (downcall) • Receive Path • Send Path
AF_INET Layer AF_INET Transport API Transport Layer • struct inet_protos • Interface between IP and the transport layer • Is the upcall binding from IP to transport • Method for demultiplexing IP packets to proper transport • struct proto • Defines interface for individual protocols (TCP, UDP, etc) • Is the downcall binding for AF_INET to transport • Transport-specific functions for socket API • struct inet_protosw • Describes the PF_INET protocols • Defines the different SOCK types for PF_INET • SOCK_STREAM (TCP), SOCK_DGRAM (UDP), SOCK_RAW
BSD Socket Layer Recall IP’s inet_protos AF Socket Layer • Receive binding from the IP layer to the transport layer. • init_inet( ) calls inet_add_protocol (p) to add each protocol to the hash queues. net_protocol udp_rcv() 0 handler inet_protos[MAX_INET_PROTOS] udp_err() err_handler gso_send_check gso_segment gro_receive gro_complete net_protocol igmp_rcv() 1 handler Null err_handler gso_send_check gso_segment gro_receive gro_complete net_protocol MAX_INET_PROTOS
BSD Socket Layer struct proto AF Socket Layer /* Networking protocol blocks we attach to sockets. * socket layer -> transport layer interface */ struct proto { void (*close); int (*connect); int (*disconnect); struct sock * (*accept); int (*ioctl); int (*init); void (*destroy); void (*shutdown); int (*setsockopt); int (*getsockopt); int (*sendmsg); int (*recvmsg); int (*sendpage); int (*bind); int (*backlog_rcv); void (*hash); void (*unhash); int (*get_port); } include/linux/net.h
BSD Socket Layer udp_prot AF Socket Layer struct proto udp_prot = { .name = "UDP", .owner = THIS_MODULE, .close = udp_lib_close, .connect = ip4_datagram_connect, .disconnect = udp_disconnect, .ioctl = udp_ioctl, .destroy = udp_destroy_sock, .setsockopt = udp_setsockopt, .getsockopt = udp_getsockopt, .sendmsg = udp_sendmsg, .recvmsg = udp_recvmsg, .sendpage = udp_sendpage, .backlog_rcv = __udp_queue_rcv_skb, .hash = udp_lib_hash, .unhash = udp_lib_unhash, .get_port = udp_v4_get_port, .memory_allocated = &udp_memory_allocated, .sysctl_mem = sysctl_udp_mem, .sysctl_wmem = &sysctl_udp_wmem_min, .sysctl_rmem = &sysctl_udp_rmem_min, .obj_size = sizeof(struct udp_sock), .slab_flags = SLAB_DESTROY_BY_RCU, .h.udp_table = &udp_table, #ifdef CONFIG_COMPAT .compat_setsockopt = compat_udp_setsockopt, .compat_getsockopt = compat_udp_getsockopt, #endif }; net/ipv4/af_inet.c
BSD Socket Layer inet_protosw AF Socket Layer static struct inet_protosw inetsw_array[] = { { .type = SOCK_STREAM, .protocol = IPPROTO_TCP, .prot = &tcp_prot, .ops = &inet_stream_ops, .no_check = 0, .flags = INET_PROTOSW_PERMANENT | INET_PROTOSW_ICSK, }, { .type = SOCK_DGRAM, .protocol = IPPROTO_UDP, .prot = &udp_prot, .ops = &inet_dgram_ops, .no_check = UDP_CSUM_DEFAULT, .flags = INET_PROTOSW_PERMANENT, }, { .type = SOCK_RAW, .protocol = IPPROTO_IP, /* wild card */ .prot = &raw_prot, .ops = &inet_sockraw_ops, .no_check = UDP_CSUM_DEFAULT, .flags = INET_PROTOSW_REUSE, } }; • On startup (inet_init()), TCP, UDP, and Raw socket protocols are inserted into the inetsw_array[]. • Other protocols call inet_register_protosw() • inet_unregister_protosw()will not remove protocols with PERMANENT set. net/ipv4/af_inet.c
Relationships struct sock struct sock_common sk_common skc_node sk_lock skc_refcnt struct socket sk_backlog skc_hash state ... ... type (*sk_prot_creator) skc_proto flags sk_socket skc_net fasync_list sk_send_head wait ... file struct proto sk udp_lib_close proto_ops ipv4_dgram_connect udp_sendmsg udp_recvmsg ... struct proto_ops PF_INET af_inet.c inet_release inet_bind inet_accept ...
Example: inet_accept() int inet_accept(struct socket *sock, struct socket *newsock, int flags) { struct sock *sk1 = sock->sk; int err = -EINVAL; struct sock *sk2 = sk1->sk_prot->accept(sk1, flags, &err); if (!sk2) goto do_err; lock_sock(sk2); WARN_ON(!((1 << sk2->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT | TCPF_CLOSE))); sock_graft(sk2, newsock); newsock->state = SS_CONNECTED; err = 0; release_sock(sk2); do_err: return err; }