490 likes | 687 Views
Transport Control Protocol (TCP) (Reliable Byte-Stream). Outline. Transport Protocols (multiplexing and demul) Sliding Window Revisited Flow Control Adaptive Timeout Connection Establishment/Termination. End-to-End Protocols. Underlying best-effort IP network drop messages
E N D
Outline • Transport Protocols (multiplexing and demul) • Sliding Window Revisited • Flow Control • Adaptive Timeout • Connection Establishment/Termination
End-to-End Protocols • Underlying best-effort IP network • drop messages • re-orders messages • delivers duplicate copies of a given message • limits messages to some finite size • delivers messages after an arbitrarily long delay • Common end-to-end services • guarantee message delivery • deliver messages in the same order they are sent • deliver at most one copy of each message • support arbitrarily large messages • allow the receiver to flow control the sender • support multiple application processes on each host
Distinguishing between processes • Several processes may use the same transport (end-to-end) protocol • How can we distinguish between them? (demultiplexing) • OS (e.g. Unix) process id’s – • Not good, we desire protocol to be OS independent • Port #’s – transport protocol has a list of port #’s (or mailboxes) • Good, it is OS independent • The application requests a port • The transport layer (i.e., TCP in the OS) gives it an unused port number
gather data from multiple ports (applications) • envelope data with TCP header • TCP header contains port # info Multiplexing at send host: Demultiplexing at rcv host: Multiplexing/demultiplexing • deliver received segments to correct port (application) • use port # in TCP header to decide which application = port = process P1a P1b application P3c P2a P2b P2c transport (tcp) network (IP) datalink (LAN) host 3 host 1 host 2
How demultiplexing works 0 4 8 16 19 31 • IP datagram header: • has source IP address, • destination IP address, • transport protocol number • identifies the transport protocol of the application (e.g. TCP, UPD) TOS Length V ersion HLen Ident Flags Offset TTL Protocol Checksum SourceAddr DestinationAddr Pad Options (variable) (variable) Data
Each IP datagram carries 1 transport-layer segment (see figure) transport-layer header has source port # destination port #(well-known for specific applications) Transport protocol uses port # to give data to the application process Note the absence of source/destination addresses How demultiplexing works (continued) 32 bits source port # dest port # other header fields application data (message) TCP/UDP segment format
0 16 31 SrcPort DstPort Checksum Length Data Simple Demultiplexor (UDP) • Unreliable and unordered datagram service • Adds multiplexing/demultiplexing • No flow control • Endpoints identified by ports • servers have well-known ports • see /etc/services on Unix • Header format • Optional checksum • psuedo header + UDP header + data
P1 SP: 9157 Client A Client B DP: 64 Server SP: 64 SP: 64 SP: 5775 DP: 9157 DP: 5775 DP: 64 Connectionless demux (cont’d) 9157 64 • Client A sends message to server using “well-known” destination port 64 • Client expects a response • Server responds using as destination port # the source port # of the request
Well-Known Ports • What if you want to talk to a server over a typical application (http, ftp, etc?) • Each application has a fixed well-known port # (standardized) • The application first sends messages to the well-known port
Multiple Processes at Server • An interaction between two machines is identified by the five-tuple (source-IPAddr, dest-IPAddr, transport prot #, src port #, dst port#) • The OS maintains a queue of data (stream) for each of these tuples • If the server needs to create a special process to handle a new connection • Spawns a new process • The new process gets a pointer to all the streams (i.e. connections) of the parent process. • The new process continues to serve the remote application • The parent process closes the stream of the remote application. • The remote application does not notice the difference
Application process Application process W rite Read bytes bytes … … TCP TCP Send buffer Receive buffer … Segment Segment Segment T ransmit segments TCP Overview • Connection-oriented • Byte-stream • app writes bytes • TCP sends segments • app reads bytes • Full duplex • Flow control: keep sender from overrunning receiver • Congestion control: keep sender from overrunning network
Data Link Versus Transport Protocols • Transport protocols have to handle all of the following, which may not be necessary in a datalink protocol. • Connect to many different hosts • need explicit connection establishment and termination • Different RTT’s of different sources • need adaptive timeout mechanism • Tolerate long delay in network • need to be prepared for arrival of very old packets • Handle packet reorder • Different capacity at destination • need to accommodate different node capacity • Different network capacity • need to be prepared for network congestion
32 bits source port # dest port # sequence number acknowledgement number head len not used Receive window A P U R S F checksum Urg data pnter Options (variable length) application data (variable length) TCP segment structure URG: urgent data (generally not used) counting by bytes of data (not segments!) ACK: ACK # valid PSH: push data now (generally not used) # bytes rcvr willing to accept RST, SYN, FIN: connection estab (setup, teardown commands) Internet checksum (as in UDP)
TCP services/components • Connection management • Reliability • Sequence numbers and ACKs • Time out mechanism • Flow control • sender will not overwhelm receiver • Congestion control • Prevent overflow along the path to the destination
Data (SequenceNum) Sender Receiver Acknowledgment (SequenceNum) + AdvertisedWindow Data Transfer • Each connection identified with 4-tuple: • (SrcPort, SrcIPAddr, DstPort, DstIPAddr) • Sliding window + flow control • acknowledgment, SequenceNum, AdvertisedWinow • Checksum • pseudo header + TCP header + data
Seq. #’s: byte stream “number” of first byte in segment’s data ACKs: seq # of next byte expected from other side cumulative ACK Q: how receiver handles out-of-order segments A: TCP spec doesn’t say, - up to implementor time TCP seq. #’s and ACKs Host B Host A User types ‘C’ Seq=42, ACK=79, data = ‘C’ host ACKs receipt of ‘C’, echoes back ‘C’ Seq=79, ACK=43, data = ‘C’ ACKs are piggybacked host ACKs receipt of echoed ‘C’ Seq=43, ACK=80 simple telnet scenario
Seq=92 timeout time TCP: retransmission scenarios Host A Host B Host A Host B Seq=92, 8 bytes data Seq=92, 8 bytes data Seq=100, 20 bytes data ACK=100 timeout X ACK=100 ACK=120 loss LastByteAcked = 100 Seq=92, 8 bytes data Seq=92, 8 bytes data LastByteAcked = 120 ACK=120 Seq=92 timeout ACK=100 LastByteAcked = 100 LastByteAcked = 120 lost ACK scenario premature timeout time
TCP retransmission scenarios (more) Host A Host B Seq=92, 8 bytes data ACK=100 Seq=100, 20 bytes data timeout X loss LastByteAcked = 120 ACK=120 time Cumulative ACK scenario
TCP ACK generation[RFC 1122, RFC 2581] TCP Receiver action Delayed ACK. Wait up to 500ms for next segment. If no next segment, send ACK Immediately send single cumulative ACK, ACKing both in-order segments Immediately send duplicate ACK, indicating seq. # of next expected byte Immediate send ACK, provided that segment starts at lower end of gap Event at Receiver Arrival of in-order segment with expected seq #. All data up to expected seq # already ACKed Arrival of in-order segment with expected seq #. One other segment has ACK pending Arrival of out-of-order segment higher-than-expect seq. # . Gap detected Arrival of segment that partially or completely fills gap
Example • Assume bytes 0 .. 79 have been sent, received, and acknowledged. • Assume the sender sends the following segments (of 20 bytes each) • 80, 100, 120, 140, 160 • Assume they are received in the following order • 80 – receiver does not send an ack (delayed ACK) • 100 – receiver sends an ack(120) because 80 was not acked • 140 – gap detected, immediately send ack(120) • 160 – gap detected, immediately send ack(120) • 120 – gap closed, immediately send ack(180)
Time-out period often relatively long: long delay before resending lost packet Detect lost segments via duplicate ACKs. Sender often sends many segments back-to-back If segment is lost, there will likely be many duplicate ACKs. If sender receives 3 duplicate ACKs for the same data, it supposes that segment after ACKed data was lost: fast retransmit:resend segment before timer expires Note, this assumes that reorder is rare (which may not necessarily be true). Fast Retransmit
Sliding Window Revisited Sending application Receiving application TCP TCP LastByteWritten LastByteRead LastByteAcked LastByteSent NextByteExpected LastByteRcvd • Sending side • LastByteAcked ≤ LastByteSent ≤ LastByteWritten • buffer bytes between LastByteAcked+1 and LastByteWritten • Receiving side • LastByteRead < NextByteExpected ≤LastByteRcvd+1 • buffer bytes between LastByteRead+1 and LastByteRcvd
Preventing Buffer Overflow • Send buffer size: MaxSendBufferReceive buffer size: MaxRcvBuffer • Restrictions on the speed of the sender’s application • Application increases LastByteWritten when it creates more data. • block sender’s application if the following would be violated: • LastByteWritten - LastByteAcked≤MaxSendBuffer • Restrictions caused by a slow receiver’s application • Receiver application removes bytes and increases LastByteRead • Drop new packets if they would cause to violate • LastByteRcvd – LastByteRead ≤ MaxRcvBuffer
Flow Control • AdvertisedWindow is received in every ACK • If ack(X) has AdvertisedWindow = 10, then • rcvr has buffer space for bytes X up to X + 9. • Restrictions on sender (due to advertised window) • LastByteSent≤LastByteAcked + AdvertisedWindow • LastByteSent≤LastByteAcked + CongestionWindow • More on the congestion window in a later chapter • Therefore, two “windows” restrict the sender’s ability to send a new message
Flow Control (cont) • Receiving side • LastByteRcvd – LastByteRead ≤ MaxRcvBuffer • AdvertisedWindow sent in every ACK(NextByteExpected) is the buffer size minus the “green” portion AdvertisedWindow = MaxRcvBuffer -(NextByteExpected - LastByteRead+1) • Sender persists when AdvertisedWindow= 0(send a message with 1 byte to get back an ack to prevent deadlock) Receiving application TCP LastByteRead NextByteExpected LastByteRcvd
Silly Window Syndrome • Maximum Segment Size (MSS) • What if window (congestion and advertised) allows us to send << MSS for the next msg? • Transmit? Wait for window to increase? (congestion-window or advertised-window growth) • Small segments can “hang around” forever: • If my congestion window is closed (LastByteSent – LastByteAcked = Cwindow), and I receive an ack for 10 more bytes, then I can only send 10 more bytes (i.e. a small packet) Sender Receiver
Receiver’s help for Silly Window S. • Rcvr does not send an advertised-window update with a value less than min(MSS, half empty buffer) • This takes care of advertised window, what about the congestion window?
Nagle’s Algorithm • How long to wait before sending data less than MSS? Use ack’s as a clocking mechanism • You want to have at most one small message in flight • When the application produces data to send : • If both the available data and the window ≥ MSS • Send a full segment • else • If there is unACKed data in flight • Don’t send the data (just buffer it) • else • Send the new data now
Adaptive Retransmission(Original Algorithm) • The RTT is different for each destination • Compute an average of the RTT, and set your timer accordingly • Measure SampleRTT for each segment/ ACK pair • Compute weighted average of RTT • EstRTT = ax EstRTT + b x SampleRTT • where a+b = 1 • a between 0.8 and 0.9 • b between 0.1 and 0.2 • Set timeout based on EstRTT • TimeOut=2 x EstRTT
Jacobson/ Karels Algorithm • New Calculations for average RTT – take variance into account • If variance is too big/small, EstRTT may not be useful. • Diff = SampleRTT - EstRTT • EstRTT = EstRTT + (dx Diff) (i.e., a= 1 - d, b = d) • Dev = Dev + d(|Diff| - Dev) (i.e.,Devis avg,|Diff| is sample) • where d is a small factor between 0 and 1 • Consider variance when setting timeout value • TimeOut = m x EstRTT + f x Dev • where m = 1 and f = 4 • Notes • algorithm only as good as granularity of execution (500ms on Unix) • accurate timeout mechanism important to congestion control (later)
Karn/Partridge Algorithm • Do not sample RTT when retransmitting • Sender can’t distinguish between the above two scenarios • Also, double timeout after each retransmission (exponential back-off) Sender Receiver Sender Receiver Original transmission Original transmission TT TT ACK Retransmission SampleR SampleR Retransmission ACK
Protection Against Wrap Around • Problem: seq numbers may wrap around quickly • 32-bit SequenceNum (max. segment lifetime is 2 min.) Bandwidth Time Until Wrap Around T1 (1.5 Mbps) 6.4 hours Ethernet (10 Mbps) 57 minutes T3 (45 Mbps) 13 minutes FDDI (100 Mbps) 6 minutes STS-3 (155 Mbps) 4 minutes STS-12 (622 Mbps) 55 seconds STS-24 (1.2 Gbps) 28 seconds
Keeping the Pipe Full • Problem: advertised window too small to maintain throughput • Must be at least as big as the bandwidth-delay product • 16-bit AdvertisedWindow (64KB) Bandwidth Delay x Bandwidth Product (100ms RTT) T1 (1.5 Mbps) 18KB Ethernet (10 Mbps) 122KB T3 (45 Mbps) 549KB FDDI (100 Mbps) 1.2MB STS-3 (155 Mbps) 1.8MB STS-12 (622 Mbps) 7.4MB STS-24 (1.2 Gbps) 14.8MB
TCP Extensions • Implemented as header options • Accurate round-trip estimation • Store timestamp in outgoing segments • Receiver echoes it back, improving timeout accuracy • Even retransmissions can now be counted towards the sample • Extend sequence space with 32-bit timestamp (PAWS) • Decide if a message is old based on the timestamp above • Shift (scale) advertised window • How many bits to shift the advertised window value to the left • E.g., if scale factor = 4, then advertised window = 16 * adv. window field in ack.
Connection EstablishmentInitialize TCP variables: seq. #s, buffers, flow control info (e.g. AdvWindow) Three way handshake: Step 1:client host sends TCP SYN segment to server specifies initial seq # no data Step 2:server host receives SYN, replies with SYNACK segment server allocates buffers specifies server initial seq. # Step 3: client receives SYNACK, replies with ACK segment, which may contain data time TCP Connection Management server client SYN=1, ACK=0, SeqNo=SNa SYN=1, ACK=1, SeqNo=SNb, ACKNo=SNa+1 SYN=0, ACK=1,SeqNo=SNa + 1, ACKNo=SNb+1,data connection establishment
Sequence Number Selection • Initial sequence number (ISN) selection • Why not simply chose 0? • Must avoid overlap with earlier incarnation (i.e., earlier connection between same host-client and same port numbers). • How do you know if a SYN message is from an old connection or from a new one? • The server does not keep information about old connections after it closes them • Thus, when you receive an ack for the ISN of this connection then you new is not from an old connection
ISN and Quiet Time • Use local clock to select ISN • Clock wraparound must be greater than max segment lifetime (MSL) • Upon startup, cannot assign sequence numbers for MSL seconds (initial clock value could be off) • Or use a random ISN • Either way, we can still have sequence number overlap! (within the same connection) • If sequence number space not large enough for high-bandwidth connections, will wrap before MSL • Extended sequence number space extension should help to mitigate this problem
Connection Tear-down Features • Allow unilateral close • TCP must continue to receive data even after closing the forward connection • Timed-Wait at the end of the connection • Performed by the party that first closes one side of the connection (usually the client). • Cannot forget about a connection immediately • Ensures both end-points are closed • Protection against old messages with same seq #
Step 1:client end system sends TCP FIN control segment to server Step 2:server receives FIN, replies with ACK. Closes connection, sends FIN. Step 3:client receives FIN, replies with ACK. Enters “timed wait” - will respond with ACK to received FINs Step 4:server, receives ACK. Connection closed. Connection Tear-Down client server close FIN Can be combined into one segment ACK close FIN ACK timed wait closed closing connection
Tear-down Packet Exchange Sender Receiver FIN ACK Data send Data ack FIN ACK
Timed-wait: clean connection close (book) Sender Receiver FIN ACK FIN ACK SYN FIN Sender thinks the second connection has been terminated
Timed-wait: clean connection close (me) Sender Receiver FIN ACK FIN Connection officially closed (normally) ACK FIN Connection closed abnormally RST
Timed-wait: protection against data msgs from earlier connections at the server • SYN msgs from earlier connections do not cause problems at the server • The ISN of the server is different from the previous connection • ISN’s are not reused within 2 message lifetimes • Thus, the ACK from the client (if old) will not contain the new ISN • However, what about old DATA messages? (see next …)
Timed-wait: protection against data msgs from earlier connections at the server (continued) • Assume you are using a clock for an ISN • Assume the connection is very fast • Seq # of the packets from previous connection have gone slightly beyond the clock’s value • If a new connection is made (same ports) • ISN will be similar to the SN of the old messages • Old messages could then be confused with new ones • Timed-Wait prevents this. • All messages of the connection will die before you start a new connection (see ISN discussion in previous slide)
Need for Random ISN • Since we have timed-wait, do we still need a random/nondecreasing initial sequence number? • Think replay! • Can a connection be replayed at the client? • Can a connection be replayed at the server? • What if both client and server did a timed-wait? • Can a connection be replayed at the client? • Can a connection be replayed at the server?
TCP Connection Management (cont) TCP server lifecycle TCP client lifecycle
CLOSED Active open /SYN Passive open Close Close LISTEN SYN/SYN + ACK Send/ SYN SYN/SYN + ACK SYN_RCVD SYN_SENT ACK SYN + ACK/ACK Close /FIN ESTABLISHED Close /FIN FIN/ACK FIN_WAIT_1 CLOSE_WAIT FIN/ACK ACK Close /FIN ACK + FIN/ACK FIN_WAIT_2 CLOSING LAST_ACK Timeout after two ACK ACK segment lifetimes FIN/ACK TIME_WAIT CLOSED