480 likes | 587 Views
Returning to zombies and parents and children. What happens to the children if the parent dies? /* ILLUSTRATING THAT UPON A PARENT KILL THE ZOMBIES /* ARE ASSIGNED A PID OF 1 (INIT). #include <stdio.h> #include <unistd.h> main(int argc, char **argv) { int pid, i=1, j=1;
E N D
Returning to zombies and parents and children. • What happens to the children if the parent dies? /* ILLUSTRATING THAT UPON A PARENT KILL THE ZOMBIES /* ARE ASSIGNED A PID OF 1 (INIT). #include <stdio.h> #include <unistd.h> main(int argc, char **argv) { int pid, i=1, j=1; ( pid=fork() ) ? printf("%d\n", pid) : printf("%d\n", pid) ; system("ps -l > befor_fork "); pid=fork(); system("ps -l > after_fork ");
/* in parent */ if( pid ) { printf("printing form the parent, fork returned = %d\n",pid); system("ps -l > parent_start "); for( ; ; ) { sleep(2); printf("my parent is %d -----<>\n", getppid()); } system("ps -l > parent_end "); }
/* in child */ if( !pid ) { printf("printing form the child, fork returned = %d\n",pid); /* printf("printing form the child: %d\n",getppid());*/ system("ps -l > child_start "); for( ; ; ) { sleep(1); printf("my parent is %d -----<><>\n", getppid()); system("ps -l > child_inside "); } } }
Interesting question from Gilbert Rahme: • > It says in the book that if host A and host B are exchanging packets, A announces its MSS to the peer TCP which tells the peer the maximum amount of data that it can send per segment. It also says that MSS is often set to path MTU minus the fixed sizes of the IP and TCP headers. > > My question is: the above will work if the path followed to send packets from A to B is the same as the path followed to send packets from B to A but most of the time this path is not the same. So, how would host A (who announces its MSS to host B) know what is the path MTU for packets going from B to A if A has no idea about what path it might be ?? • MTU is IP; MSS is TCP
The TCP MSS value specifies the maximum amount of TCP data in a single IP datagram that the local system can accept (reassemble). • The IP datagram may be fragmented into multiple packets when sent. Theoretically, this value could be as large as 65495, but such a large value is never used. Typically, an end system will use the "outgoing interface MTU" minus 40 as its reported MSS. For example, an Ethernet MSS value would be 1460 (1500 - 40 = 1460).
Interesting Answer: (RFC 1191) • When one IP host has a large amount of data to send to another host, the data is transmitted as a series of IP datagrams. It is usually preferable that these datagrams be of the largest size that does not require fragmentation anywhere along the path from the source to the destination. (For the case against fragmentation, see [5].) This datagram size is referred to as the Path MTU (PMTU), and it is equal to the minimum of the MTUs of each hop in the path. A shortcoming of the current Internet protocol suite is the lack of a standard mechanism for a host to discover the PMTU of an arbitrary path. • In this memo, we describe a technique for using the Don't Fragment (DF) bit in the IP header to dynamically discover the PMTU of a path. The basic idea is that a source host initially assumes that the PMTU of a path is the (known) MTU of its first hop, and sends all datagrams on that path with the DF bit set.
If any of the datagrams are too large to be forwarded without fragmentation by some router along the path, that router will discard them and return ICMP Destination Unreachable messages with a code meaning "fragmentation needed and DF set" [7]. Upon receipt of such a message (henceforth called a "Datagram Too Big" message), the source host reduces its assumed PMTU for the path. • The PMTU of a path may change over time, due to changes in the routing topology. Reductions of the PMTU are detected by Datagram Too Big messages, except on paths for which the host has stopped setting the DF bit. To detect increases in a path's PMTU, a host periodically increases its assumed PMTU (and if it had stopped, resumes setting the DF bit).
This will almost always result in datagrams being discarded and Datagram Too Big messages being generated, because in most cases the PMTU of the path will not have changed, so it should be done infrequently. • When the IP sender receives this Internet Control Message Protocol (ICMP) message, it should learn to use a smaller IP MTU for packets sent to this destination, and subsequent packets should be able to get through. • Since this mechanism essentially guarantees that host will not receive any fragments from a peer doing PMTU Discovery, it may aid in interoperating with certain hosts that (improperly) are unable to reassemble fragmented datagrams.
Various problems can cause the PMTUD algorithm to fail, so that the IP sender will never learn the smaller path MTU but will continue unsuccessfully to retransmit the too-large packet, until the retransmissions time out. Some problems include the following: • The router with the too-small next hop path fails to generate the necessary ICMP error message. • Some router in the reverse path between the small-MTU router and the IP sender discards the ICMP error message before it can reach the IP sender. • Confusion in the IP sender's stack in which it ignores the received ICMP error message. • With the above problems, a workaround is to configure the IP sender to disable PMTUD. This causes the IP sender to send their datagrams with the DF flag clear. When the large packets reach the small-MTU router, that router fragments the packets into multiple smaller ones. The smaller, fragmented data reaches the destination where it is reassembled into the original large packet.
Some general comments on a Unix executable. • Loaded during an exec. • Three regions; text, data and stack. • Text is the program text (machine code). • Data is the bss (scratch pad memory for use by uninitialized variables - a size indicator), as well as the constants. • Stack. Stack frames which contains the parameters to a function, local vars, and the data necessary to recover the previous stack frame. • Since a Unix process can execute in either kernel or user mode it has separate stacks for each mode. • The kernel stack contains the stack frames for functions executing within the kernel mode (system calls).
Some notes on the context switch and the exec. • A context switch represents the storing of the current context as expressed in the PC, the IR, and the GP register set. This definition is from an assembly language programmer perspective. • A call results in a context switch. (BALR). Hence there are opcodes for the pushing and popping of the stack. Along with these calls are defined techniques for using these instructions. • From the Unix perspective the context of a process is its state as defined by its text, the global user variables and data structures, the register set, the values in the process table slot and the content of its user and kernel stacks.
When the Unix kernel begins to execute another process a context switch is performed (notice this is NOT a call but rather the kernel selecting another process to execute). • This means that the system executes in the context of another process. • Moving between user and kernel mode is a change in modenot a context switch. • Special processes are not scheduled to handle interrupts. • An exec does NOT result in a context switch. Instead the exec’ed program overlays the current context. This means that the current context is LOST.
Therefore on the accept / fork transition when the exec is invoked within the child the child context is lost. • The descriptors are NOT lost on an exec. They remain assigned within the NEW CONTEXT. The standard within Unix (BUT NOT TCP SERVERS) is to close all descriptors before an exec. This prevents nasty surprises. • A read of a pipe returns EOF only if no processes have the pipe open for writing. • An exit on a child will remove the descriptors of the child, but will not result in a close. The close can only occur when a reference count is equal to zero.
wait() and waitpid() functions. • Both return two values: • process id of the terminated child • termination status of the child (int) returned through a pointer. pid_t wait(int *statloc); pid_t waitpid(pid_t pid, int *statloc, int options); process ID on success, ffff on error. • Three macros available to allow examination of termination status and determine if the child terminated normally, was killed by a signal, or job-control stopped. • WIFEXITED • WEXITSTATUS • If no terminated children for the process calling wait AND the process has children executing, the wait blocks until the first terminate.
wait() and waitpid() • Modify TCP client (loop) to establish five connections. • Use only first connection in the call to str_cli. • When the client terminates five FINs will be sent and five SIGCHLD signals will be sent at about the same time. • The code below has been added to the server to handle the SIGCHLD: void sig_chld(int signo) { pid_t pid; int stat; pid = wait(&stat); printf("child %d terminated\n", pid); return; }
wait() and waitpid() • The signal catch has been established with the following code in the server: Signal(SIGCHLD, sig_chld); • After the five children terminate only one printf fires. • Execute ps and four other children are still extant. • The problem is that the signal handler is only executed one time because Unix signals are not normally queued. • Which child gets handled is non-deterministic. • If the server and client are on different hosts the signal handler may be executed two times.
wait() and waitpid() void sig_chld (int signo) { pid_t pid; int stat; while ( (pid = waitpid(-1 &stat, WNOHANG) ) > 0) { printf ("child terminated.....", pid); } return; } • waitpid in a loop will fetch the status of any children that have terminated. • Specify WNOHANG; thus telling waitpid not to block if there exist any running children not yet terminated. • Cannot prevent wait from blocking if children not terminated.
TCP echo server/client • Connection abort before accept returns • The 3 way handshake is complete and then the client TCP sends a RST. • On the server side the connection is in the complete queue waiting for an accept to be called. client server SYN Posix1.g specifies that the return must be ECONNABORTED (software caused error). Server can the ignore the error and call accept again. SYN, ack ack abort RST accept called after RST
TCP Echo server • Server process (not host) crash scenario (server on separate host from client). • When server process is 'kill -9" a FIN will be sent to the client and the client TCP will respond with an ACK. • SIGCHLD is sent to the server parent. • Client is blocked in a call to fgets. • Client will still send data; receipt of FIN by the client TCP only indicates that the server has closed its end of the connection and will not be sending any more data. • When the server receives the data it responds with an RST. • Client will not see the RST because it calls readline which returns 0 (EOF) because of the FIN. • The client quits with "server terminated prematurely" error message. • Client is working with two descriptors and should block on BOTH.
SIGPIPE signal. • When a process writes to a socket that has received an RST the SIGPIPE signal is sent to the process. • Default action of this signal is to terminate the process. • Process must therefore catch this signal to avoid being terminated. • Set the action on SIGPIPE to SIG_IGN. • Interesting example of machines with different architectures being unable to communicate (big-endian Solaris and little-endian Intel) due to differences in what constitutes an integer. Page 138 - 139
ASSIGNMENT: • DUE: FRIDAY, SEPTEMBER 20, 2002. • Undergrads: Problems 5.1, 5.2, 5.3, 5.4 • Graduates: Problems 5.1 through 5.7. • All code in both printed format (paper) and binary executable on disc. • Forward email homework to: vvenkatra@yahoo.com • In email correspondence please put your NAME and Class number in the title; Please describe yourself as either • U (undergrad) • G (graduate)
Chapter Six: I/O multiplexing • The developed code requires the ability to have the kernel notify the process if one or more I/O conditions are ready • (input ready to be read, descriptor ready for more data to be output). • I/O multiplexing is used to: • Handle multiple descriptors • Handle both a listening and a connected socket. • Handle TCP and UDP simultaneously. • Handle multiple services and mulitple protocols (Inetd). • I/O multiplexing may be used by any application - it is not restricted to network applications.
I/O multiplexing • Five I/O models available under Unix. • Blocking I/O • Non-Blocking I/O • I/O Multiplexing (select and poll) • Signal driven I/O (SIGIO) • Asynchronous I/O (Posix.1g) • Blocking I/O model: the most prevalent usage. • In the blocking model the system does not return from the call until the data is read/written or a socket connected (or error). • The process is ‘blocked’ the entire time. • This is the model used thus far in our client/server designs. • The problem is that the blocking I/O call, which is a system call, may be interrupted by a signal.
Normally two distinct phases for input operations: • waiting for the data to be ready • copying the data from the kernel to the process. (kernel buffer to application buffer). • Non-Blocking I/O model. • Tells the kernel that upon invocation if not successful immediately then return an error. • By successful it is implied that data is either pushed or pulled. • When a app sits in a loop calling 'recvfrom' on a non-blocking descriptor the result is referred to as 'polling'. • The ‘polling’ model. • Moronic in extremis. • Only seen in dedicated systems such as war fighting engines, aircraft, real-time control systems, et cetera.
Signal Driven I/O • Can use SIGIO signal to determine whan a descriptor is ready. • Install a signal handler using sigaction system call. • When the datagram is ready the SIGIO signal is generated and the signal handler catches it. • Advantage of this model is that there is no blocking waiting for the signal to arrive. Asynchronous I/O model • Main difference between Signal model and Asynch model is that the asynch model communicates with the process when the entire operation is COMPLETE. • In signal driven I/O the signal communicates when the I/O operation may be initiated.
I/O Models • A synchronous I/O operation causes the requesting process to be blocked until the I/O operation is complete. • Asynchronous I/O does not cause the requesting process to be blocked. • SELECT function int select (int maxfdp1, fd_set *readset, fd_set *writeset, fd_set *exceptset, const struct timeval *timeout) • The select function allows the process to instruct the kernel to wait for any one of multiple events to occur and to wake up the process upon this event occurring. • The descriptors referenced do NOT have to be sockets; any descriptor is eligible for monitoring.
Select function • timeval struct struct timeval { long tv_sec; long tv_usec; } • Three possibilities: • Wait forever - specify timeout arg as a null pointer. • Wait a fixed amount of time • Do not wait at all - specify a timeval struct that is set to zero. • This approach is effectively the same as polling. • Select will be interrupted if a signal is caught; on BSD will never auto restart - SVR4 will restart if the SA_RESTART flag is specified in the in the signal handler.
Select function • The const qualifier on timeval means that the argument is not modified by a return on an even occuring. • Therefore can only know time remaining by getting sys time prior to select invoke, get sys time again on return and then subtracting the diff from timeval value. • Linux modifies timeval. • The readset, writeset and exceptset specify the descriptors to be tested by the kernel. • Each bit in these integers corresponds to a descriptor. • The first element of the 'set' array refers to descriptors 0 to 31, second element refers to 32 to 63 et cetera. • Max number of descriptors is 1024 though most systems specify far fewer. (32 , 32 bit integers in the set). • FD_SETSIZE is the number of descriptors in the fd_set datatype.
Select function • select modifies the descriptor sets. • When select is called the descriptors of interest are specified upon return the result indicates which descriptors are ready. • Must turn on ALL the bits of interest whenever select is called since any fd that is not ready will have its bit cleared in the descriptor set. • maxfdp1 represents the largest descriptor that we are using - therefore must be ONE more than the number of descriptor sets (specifying the NUMBER of descriptors and not the largest value - similar to array math). • Can use select as a precision timer by setting all fd sets to null pointer and then using a timeval structure.
Select function • The return value from select indicates the number of bits that are ready across the set. • If the timer expires before anything coming ready 0 is returned. • A return of ffff indicates an error (probably signal caught). • Implementation details are in the fd_set datatype. There are four supporting macros; • void FD_ZERO (fdset, *fdset); // clear all the bits in fdset • void FD_SET (int fd, fd_set *fdset); // turn on bit fd in fdset • void FD_CLR (int fd, fd_set *fdset); // turn off the bit for fd in fdset • void FD_ISSET (int fd, fd_set *fdset); // test if bit fd is on in fdset.
Select function • Sockets and ready descriptors. A socket is ready for reading if any of the following obtain. • Number of bytes of data in the receive buffer >= the current size of the low-water mark for the receive buffer. (low-water mark defaults to 1 for TCP) • The read half of the connection is closed ( has received a FIN). • The socket is a listening socket and the number of completed connections (?) is nonzero. • A socket error is pending. ERRNO will be set to the specific error condition. Pending errors will be cleared.
select function • Socket is ready for writing (descriptor indicates ready) iff, • Number of bytes of available space in the socket send buffer is >= the current low-water mark for the send buffer AND either the socket is connected or the socket doesn’t require a connection. Low water mark default is 2048 for TCP. • Write half of connection is closed. An attempt to write will generate SIGPIPE. • A socket error is pending. A write op will return with ERRNO set to the spec’ed error. • When an error occurs on a socket it is marked both readable and writable by select.
select function • str_clifunction (slight return) • Now instead of blocking in a call to fgets (fputs) we block on the call to select and can demultiplex the signals. • Why is this good? • Because we no longer have an interrupted system call. • The client now can handle three separate conditions: • If the peer TCP sends data the socket becomes readable and select will notify the process. • If the peer TCP sends a FIN (peer terminates) the socket becomes readable and read returns EOF. • If the peer TCP sends a RST (peer has bounced) the socket again becomes readable and read returns a -1 and errno has the specific error code.
In the example there only needs to be one descriptor set *readset. • Sample code; void str_cli (FILE *fp, int sockfd) fd_set rset; FD_ZERO (&rset); for (; ;) { FD_SET (fileno(fp), &rset); FD_SET (sockfd, &rset); maxfdp1 = max(fileno(fp), sockfd) + 1; Select (maxfdp1, &rset, NULL, NULL, NULL); …. if (FD_ISSET(sockfd, &rset)) { // socket is readable
str_cli function (slight return) • Now instead of blocking in a call to fgets (fputs) we block on the call to select and can demultiplex the signals. • Why is this good? • Because we no longer have an interrupted system call. • The client now can handle three separate conditions: • If the peer TCP sends data the socket becomes readable and select will notify the process. • If the peer TCP sends a FIN (peer terminates) the socket becomes readable and read returns EOF. • If the peer TCP sends a RST (peer has bounced) the socket again becomes readable and read returns a -1 and errno has the specific error code.
If in our echo server-client model we redirect the input/output to files then the size of the output file is always smaller than the input file. • The problem arises in our example because the last of the input will be transmitted prior to the receipt of the last reply. (There is a finite amount of time spent in the pipe). • If we close the client on the transmission of the last of the input then the server has not replied to all inputs. (on the transmission of the EOF there is a return to main and a subsequent termination). • Therefore what is needed is a way to close one-half of the TCP connection.
shutdown function • Half-close. Desired behavior is to send a FIN to the server to let the server know that data transmission has ended but at the same time we wish to leave the socket descriptor open for reading. • Thus we have the shutdown function. • close() decrements the descriptor reference count (and closes the socket ONLY if the count is = 0). • shutdown() allows the initiation of the normal TCP termination sequence regardless of the reference count. • close() terminates BOTH directions of data transfer.
server int shutdown (int sockfd, int howto); client write DATA FIN shutdown ack + FIN write DATA close read returns 0 FIN ack (data) + FIN
shutdown() • howtoargument: • SHUT_RD where the read-half of the connection is closed. No more data will be received on that socket and any buffer data is lost (for the issuing socket). • SHUT_WR write half of the connection is closed. Any data currently in the socket send buffer will be sent, followed by TCP’s normal termination sequence (done without reference to the socket descriptors reference count). • SHUT_RDWR read and write halves are both closed. Equivalent to calling shutdown twice (SHUT_RD, SHUT_WR).
return to the echo server (page 162) • rewrite the server as a single process which uses select() to handle any number of clients. • This is done as opposed to forking one child per client. • The server maintains only a read descriptor set. • descriptors 0, 1,2 are for stdin, stdout, stderr. • Therefore listening socket fd will be 3 • Setup an array of int named client. • All elements in this array are initialized to -1. • When first client establishes a connection with the server, the listening descriptor becomes readable and the server calls accept(). • The new connected descriptor returned by accept() will have an fd of 4. • This value will be stored in the client array.
When a connected client sends a FIN the descriptor will become readable. • The server will then close the socket after a readline returns 0. • client[0] is set to -1 • descriptor 4 in the descriptor set is set to 0. nready = Select(maxfd + 1, &rset, NULL, NULL, NULL); //* block in the select wait for FIN, RST, connect or data becoming available *// if ( FD_ISSET (listenfd, &rset) ) { // socket is readable connfd = Accept(listenfd, (SA *) &cliaddr, &clilen); //* add descriptor to client array */ } for (i =0; i < maxi; i++) { //* check all clients for data */ if (sockfd = client[i] < 0) continue if (FD_ISSET(sockfd, &rset)) { // socket is readable //* read and close or write as necessary *//
pselect function (Posix.1g) int pselect (int maxfdp1, fd_set *readset, fd_set *writeset, fd_set *exceptset, const struct timespec *timeout, const sigset_t *sigmask); • Two changes from the select function; • uses the timespec struct instead of timeeval (specifies nS rather than uS). • Adds a 6th argument, a pointer to a signal mask. • The signal mask can disable the delivery of certain signals. • Test some global variables that are set for the disabled signals by the handlers which will allow pselect to reset the signal mask.
The pselect function permits the following problem to be resolved; (the signal handler for SIGINT sets the global intr_flag and returns). if (intr_flag) handle_intr(); // signal handler if ( (nready = select( ……) ) < 0 ) { if (errno = = EINTR) { if (intr_flag) handle_intr(); • Between the test of intr_flag and the call to select if the signal occurs it will be lost if select blocks.
Using the pselect. sigemptyset (&zeromask); sigemptyset (&newmask); sigaddset (&newmask, SIGINT); sigprocmask(SIG_BLOCK, &newmask, &oldmask); if (intr_flag) handle_intr(); if (nready = pselect (….,&zeromask) ) < 0) { if (errno == EINTR) { if (intr_flag) handle_intr(); }
pselect (continued) • Before testing the intr_flag block SIGINT. When pselect is called, it replaces the signal mask of the process with an empty set (zeromask). • pselect then checks the descriptors, possibly going to sleep. • When pselect returns, the signal mask of the process is reset to its value before pselect was called (SIGINT is blocked). • Net result is that the signal won’t be lost while pselect blocks.
poll function int poll (struct pollfd *fdarray, unsigned long nfds, int timeout); • First arg is a pointer to the first element of an array of structures. Each element of the array is a struct that specifies the conditions to be tested for given descriptor fd. struct pollfd { int fd; // descriptor to be checked short events; // events of interest on fd short revents; // events that occurred on fd. • The conditions to be tested for are spec’ed by events member.
poll continued • The poll() function will return the status of the events descriptor in the corresponding revents member. • Shortcuts the value / result arguments. • Each of the two members (events, revents) is composed of one or more bits that spec a defined condition. • POLLIN, POLLRDNORM, POLLRDBAND, POLLPRI are input events. • POLLIN normal data can be read (priority in token ring/bus) • POLLRDNORM normal data can be read • POLLRDBAND priority band data can be read. • POLLPRI high priority data can be read.
poll continued; • POLLOUT, POLLWRNORM, POLLWRBAND deal with output events. • POLLOUT normal data can be written • POLLWRNORM normal data can be written ?????? • POLLWRBAND priority band data can be written. • POLLER, POLLHUP, POLLNVAL deal with errors. • POLLERR indicates an error has occurred. • POLLHUP indicates that there has been a hangup. • POLLNVAL the fd is not an open file.