550 likes | 749 Views
Transport Control Protocol (TCP) (Reliable Byte-Stream). Outline. Transport Protocols (multiplexing and demul) Sliding Window Revisited Flow Control Adaptive Timeout Connection Establishment/Termination. TCP. UDP. IP. Ethernet. FDDI. ATM. Modem. Internet Protocol (IP) Graph. BGP.
E N D
Outline • Transport Protocols (multiplexing and demul) • Sliding Window Revisited • Flow Control • Adaptive Timeout • Connection Establishment/Termination
TCP UDP IP Ethernet FDDI ATM Modem Internet Protocol (IP) Graph BGP DHCP FTP HTTP NV TFTP
End-to-End Protocols • Underlying best-effort IP network • drop messages • re-orders messages • delivers duplicate copies of a given message • limits messages to some finite size • delivers messages after an arbitrarily long delay • Common end-to-end services (we need a transport protocol to accomplish this) • guarantee message delivery • deliver messages in the same order they are sent • deliver at most one copy of each message • support arbitrarily large messages • allow the receiver to flow control the sender • support multiple application processes on each host
Demultiplexing • At each protocol layer, there may be several protocol choices at the next level up. • E.g., both TCP and UDP reside on top of IP • When a packet arrives, how do you know who to give it to?
IP demultiplexing 0 4 8 16 19 31 • IP datagram header: • has source IP address, • destination IP address, • transport protocol number • identifies the transport protocol of the application (e.g. TCP, UPD) TOS Length V ersion HLen Ident Flags Offset TTL Protocol Checksum SourceAddr DestinationAddr Pad Options (variable) (variable) Data
Transport level demultiplexing • Several processes may use the same transport (end-to-end) protocol • How can we distinguish between them? (demultiplexing) • OS (e.g. Unix) process id’s – • Not good, we desire protocol to be OS independent • Port #’s – transport protocol has a list of port #’s (or mailboxes) • Good, it is OS independent • The application requests a port through the OS • The transport layer (i.e., TCP or UDP in the OS) gives the application an unused port number • The application may request a random port # or a specific port #
Demultiplexing at rcv host Trans. layer: • deliver received segments to correct port (application) • use port # in TCP header to decide • gather data from multiple ports (applications) • envelope data with TCP header • TCP header contains port # info Multiplexing at sending host Trans. layer: Transport level (mul/demul)-tiplexing = port = process P1a P1b application P3c P2a P2b P2c transport (tcp) network (IP) datalink (LAN) host 3 host 1 host 2
Each IP datagram carries 1 transport-layer segment (see figure) transport-layer header has source port # destination port # Transport protocol uses port # to give data to the application process Note the absence of source/destination IP addresses Transport level (mul/demul)-tiplexing: header TCP/UDP segment format 32 bits source port # dest port # other header fields application data (message)
0 16 31 SrcPort DstPort Checksum Length Data Simple Demultiplexor (UDP) • Unreliable and unordered datagram service • Adds multiplexing/demultiplexing (i.e. port #s) • No flow control or reliability • Endpoints identified by ports • servers have well-known ports • see /etc/services on Unix • Header format • Optional checksum • psuedo header + UDP header + data
P1 SP: 9157 Client A Client B DP: 64 Server SP: 64 SP: 5775 SP: 64 DP: 9157 DP: 64 DP: 5775 Connectionless demux (cont’d) 5775 9157 64 • Client A acquires port 9157 from the OS • Client A sends message to server using “well-known” destination port 64 • Client expects a response • Server responds using as destination port # the source port # of the request
Well-Known Ports • What if you want to talk to a server over a typical application (http, ftp, etc?) • Each application has a fixed well-known port # (standardized) • The application first sends messages to the well-known port
Application process Application process W rite Read bytes bytes … … TCP TCP Send buffer Receive buffer … Segment Segment Segment T ransmit segments TCP Overview • Connection-oriented • Byte-stream • app writes bytes • TCP sends segments • app reads bytes • Full duplex • Flow control: keep sender from overrunning receiver • Congestion control: keep sender from overrunning network
Data Link Versus Transport Protocols • Transport protocols have to handle all of the following, which may not be necessary in a datalink protocol. • Connect to many different hosts • need explicit connection establishment and termination • Different RTT’s of different sources • need adaptive timeout mechanism • Tolerate long delay in network • need to be prepared for arrival of very old packets • Handle packet reorder • Different capacity at destination • need to provide flow control • Different network capacity • need to provide for congestion control
32 bits source port # dest port # sequence number acknowledgement number head len not used Receive window A P U R S F checksum Urg data pnter Options (variable length) application data (variable length) TCP segment structure URG: urgent data (generally not used) counting by bytes of data (not segments!) ACK: ACK # valid PSH: push data now (generally not used) # bytes rcvr willing to accept RST, SYN, FIN: connection estab (setup, teardown commands) Internet checksum (as in UDP)
TCP services/components • Connection management (we will cover this later) • Reliability • Sequence numbers and ACKs • Time out mechanism • Flow control • sender will not overwhelm receiver • Congestion control • Prevent overflow along the path to the destination
Data (SequenceNum) Sender Receiver Acknowledgment (SequenceNum, AdvertisedWindow) Data Transfer • Each TCP connection is identified with a 4-tuple: • (SrcIPAddr, SrcPort, DstIPAddr, DstPort) • Sliding window + flow control • Acknowledgment(SequenceNum, AdvertisedWinow) • Checksum • pseudo header + TCP header + data
Seq. #’s: byte stream “number” of first byte in segment’s data ACKs: seq # of next byte expected from other side cumulative ACK Q: how receiver handles out-of-order segments A: TCP spec doesn’t say, - up to implementor time TCP seq. #’s and ACKs Host B Host A User types ‘CDE’ Seq=42, ACK=79, data = ‘CDE’ host ACKs receipt of ‘CDE’, sends back ‘XYZ’ Seq=79, ACK=45, data = ‘XYZ’ ACKs are piggybacked host ACKs receipt of ‘XYZ’ Seq=45, ACK=82 simple telnet scenario
Seq=92 timeout time TCP: retransmission scenarios Host A Host B Host A Host B Seq=92, 8 bytes data Seq=92, 8 bytes data Seq=100, 20 bytes data ACK=100 timeout X ACK=100 ACK=120 loss LastByteAcked = 100 Seq=92, 8 bytes data Seq=92, 8 bytes data LastByteAcked = 120 ACK=120 Seq=92 timeout ACK=100 LastByteAcked = 100 LastByteAcked = 120 lost ACK scenario premature timeout time
TCP retransmission scenarios (more) Host A Host B Seq=92, 8 bytes data ACK=100 Seq=100, 20 bytes data timeout X loss LastByteAcked = 120 ACK=120 time Cumulative ACK scenario
TCP ACK generation[RFC 1122, RFC 2581] TCP Receiver action Delayed ACK. Wait up to 500ms for next segment. If no next segment, send ACK Immediately send single cumulative ACK, ACKing both in-order segments Immediately send duplicate ACK, indicating seq. # of next expected byte Immediate send ACK, provided that segment starts at lower end of gap Event at Receiver Arrival of in-order segment with expected seq #. All data up to expected seq # already ACKed Arrival of in-order segment with expected seq #. One other segment has ACK pending Arrival of out-of-order segment higher-than-expect seq. # . Gap detected Arrival of segment that partially or completely fills gap
Example • Assume bytes 0 .. 79 have been sent, received, and acknowledged. • Assume the sender sends the following segments (of 20 bytes each) • 80, 100, 120, 140, 160 • Assume they are received in the following order • 80 – receiver does not send an ack (delayed ACK) • 100 – receiver sends an ack(120) because 80 was not acked • 140 – gap detected, immediately send ack(120) • 160 – gap detected, immediately send ack(120) • 120 – gap closed, immediately send ack(180)
Time-out period often relatively long: long delay before resending lost packet Detect lost segments via duplicate ACKs. How? Sender often sends many segments back-to-back If segment is lost, there will likely be many duplicate ACKs. If sender receives 3 duplicate ACKs for the same data, it supposes that segment after ACKed data was lost: fast retransmit:resend segment before timer expires Note, this assumes that reorder is rare (which may not necessarily be true). Fast Retransmit
Fast Retransmit • Problem: coarse-grain TCP timeouts lead to idle periods • Fast retransmit: use duplicate ACKs to trigger retransmission (3 of them in case there is reorder) Sender Receiver Packet 10 Packet 20 ACK 20 Packet 30 ACK 30 Packet 40 ACK 30 Packet 50 Packet 60 ACK 30 ACK 30 Retransmit packet 30 ACK 70
Sender LastByteAcked LastByteSent Sliding Window • You should already know how the sliding window protocol works (with cumulative ack’s), except that now seq #’s are bytes. • LastByteSent ≤ LastByteAcked + CongestionWindow • The sender window is also known as the “congestion” window if it is allowed to change size over time (more on this in the next chapter).
Sliding Window Revisited Sending application Receiving application TCP TCP LastByteWritten LastByteRead LastByteAcked LastByteSent NextByteExpected LastByteRcvd • Sending side • LastByteAcked ≤ LastByteSent ≤ LastByteWritten • buffer bytes between LastByteAcked+1 and LastByteWritten • Receiving side • LastByteRead < NextByteExpected ≤LastByteRcvd+1 • buffer bytes between LastByteRead+1 and LastByteRcvd
Preventing Buffer Overflow • Sender buffer size: MaxSendBufferReceiver buffer size: MaxRcvBuffer • Restrictions on the speed of the sender’s application • Application increases LastByteWritten when it creates more data. • Block sender’s application if the following would be violated: • LastByteWritten - LastByteAcked≤MaxSendBuffer • Restrictions caused by a slow receiver’s application • Receiver application removes bytes and increases LastByteRead • Drop new incoming packets if they would cause to violate • LastByteRcvd – LastByteRead ≤ MaxRcvBuffer
Flow Control • Objective: prevent dropping of packets at receiver. • AdvertisedWindowis received in every ACK • If ack X has AdvertisedWindow = 10,ack(X,10), then • rcvr has buffer space for bytes X up to X + 9. • How do you compute the advertised window?
Receiving application Flow Control (cont) advertised window TCP • LastByteRcvd – LastByteRead ≤ MaxRcvBuffer • ACK(NextByteExpected,AdvertisedWindow)AdvertisedWindow is the buffer size minus the “green” portion AdvertisedWindow = MaxRcvBuffer – (NextByteExpected - LastByteRead+1) • Sender persists when AdvertisedWindow= 0(send a message with 1 byte to get back an ack with a new advertised window to prevent deadlock) LastByteRead NextByteExpected LastByteRcvd buffer size
Restrictions on sender (due to advertised window) • LastByteSent - LastByteAcked≤ AdvertisedWindow • LastByteSent - LastByteAcked≤ CongestionWindow • More on the congestion window in a later chapter • Therefore, two “windows” restrict the sender’s ability to send a new message • Congestion window is determined by the sender • Advertised window is determined by the receiver • Usually the congestion window is the one that slows you down, not the receiver.
Silly Window Syndrome • Maximum Segment Size (MSS) – largest packet TCP will create • What if window (congestion and advertised) allows us to send << MSS for the next msg? • Transmit? Wait for window to increase? (congestion-window or advertised-window growth) • Small segments can “hang around” forever: • If my congestion window is closed (LastByteSent – LastByteAcked = Cwindow), and I receive an ack for 10 more bytes, then I can only send 10 more bytes (i.e. a small packet) Sender Receiver
Receiver’s help for Silly Window S. • Rcvr does not send an advertised-window update with a value less than a MSS • This takes care of advertised window, what about the congestion window?
Nagle’s Algorithm • How long to wait before sending data less than MSS? Use ack’s as a clocking mechanism • You want to have at most one small message in flight • When the application produces data to send : • If both the available data and the window ≥ MSS • Send a full segment • else • If there is unACKed data in flight • Don’t send the data (just buffer it and wait) • else • Send the new data now
Adaptive Retransmission(Original Algorithm) • The RTT is different for each destination • Compute an average of the RTT, and set your timer accordingly • Measure SampleRTT for each segment/ ACK pair • Compute weighted average of RTT • EstRTT = ax EstRTT + b x SampleRTT • where a+b = 1 • a between 0.8 and 0.9 • b between 0.1 and 0.2 • Set timeout based on EstRTT • TimeOut=2 x EstRTT
Jacobson/ Karels Algorithm • New Calculations for average RTT – take variance into account • If variance is too big/small, EstRTT may not be useful. • EstRTT = same as before • Diff = SampleRTT - EstRTT • Dev = a x Dev + b x|Diff| (i.e., Devis avg,|Diff| is sample) • Consider variance when setting timeout value • TimeOut = m x EstRTT + f x Dev • where m = 1 and f = 4 • Notes • algorithm only as good as granularity of execution (500ms on Unix) • accurate timeout mechanism important to congestion control (later)
Karn/Partridge Algorithm • Do not sample RTT when retransmitting • Sender can’t distinguish between the above two scenarios • Also, double timeout after each retransmission (exponential back-off) Sender Receiver Sender Receiver Original transmission Original transmission TT TT ACK Retransmission SampleR SampleR Retransmission ACK
Protection Against Wrap Around • Problem: seq numbers may wrap around quickly • 32-bit SequenceNum (max. segment lifetime is 2 min.) Bandwidth Time Until Wrap Around T1 (1.5 Mbps) 6.4 hours Ethernet (10 Mbps) 57 minutes T3 (45 Mbps) 13 minutes FDDI (100 Mbps) 6 minutes STS-3 (155 Mbps) 4 minutes STS-12 (622 Mbps) 55 seconds STS-24 (1.2 Gbps) 28 seconds
Keeping the Pipe Full • Problem: advertised window too small to maintain throughput • Must be at least as big as the bandwidth-delay product • 16-bit AdvertisedWindow (64KB) Bandwidth Delay x Bandwidth Product (100ms RTT) T1 (1.5 Mbps) 18KB Ethernet (10 Mbps) 122KB T3 (45 Mbps) 549KB FDDI (100 Mbps) 1.2MB STS-3 (155 Mbps) 1.8MB STS-12 (622 Mbps) 7.4MB STS-24 (1.2 Gbps) 14.8MB
TCP Extensions • Implemented as header options • Accurate round-trip estimation • Store timestamp in outgoing segments • Receiver echoes it back, improving timeout accuracy • Even retransmissions can now be counted towards the sample • Extend sequence space with 32-bit timestamp (PAWS) • Decide if a message is old based on the timestamp above • Shift (scale) advertised window • How many bits to shift the advertised window value to the left • E.g., if scale factor = 4, then advertised window = 16 * adv. window field in ack.
Connection EstablishmentInitialize TCP variables: seq. #s, buffers, flow control info (e.g. AdvWindow) Three way handshake: Step 1:client host sends TCP SYN segment to server specifies initial seq # no data Step 2:server host receives SYN, replies with SYNACK segment server allocates buffers specifies server initial seq. # Step 3: client receives SYNACK, replies with ACK segment, which may contain data time TCP Connection Management server client SYN=1, ACK=0, SeqNo=SNa SYN=1, ACK=1, SeqNo=SNb, ACKNo=SNa+1 SYN=0, ACK=1,SeqNo=SNa + 1, ACKNo=SNb+1,data connection establishment
Both ends may open at the same time • What if both end points try to connect to each other at the same time? • Yes, it is possible, how would it look like?
Sequence Number Selection • Initial sequence number (ISN) selection • Why not simply chose 0? • Must avoid overlap with earlier incarnation (i.e., earlier connection between same host-client and same port numbers). • Messages from old connections could be confused as belonging to the new connection • Entire connections can be “replayed”.
ISN selection • Different connections should use different ISNs • Use local clock to select ISN • Clock wraparound must be greater than max segment lifetime (MSL) • Upon startup, cannot assign sequence numbers for MSL seconds (initial clock value could be off) • Or simply use a random ISN
SYN msgs: old or new? • How do you know if a SYN message is from an old connection or from a new one? • The server does not keep information about old connections after it closes them (it actually may, but in general it doesn’t, more on this later) • Thus, when you receive an ack for the ISN of this connection then you know the ack is not from an old connection
Connection Tear-down Features • Allow unilateral close • TCP must continue to receive data even after closing the forward connection • Timed-Wait at the end of the connection • Cannot forget about a connection immediately • Ensures both end-points are closed • Timed-wait is performed by the party which initiated the disconnection (usually, not always, client)
Step 1:client end system sends TCP FIN control segment to server Step 2:server receives FIN, replies with ACK. Closes connection, sends FIN. Step 3:client receives FIN, replies with ACK. Enters “timed wait” - will respond with ACK to received FINs Step 4:server, receives ACK. Connection closed. Connection Tear-Down client server close FIN=1 Can be combined into one segment ACK=1 close FIN=1 ACK=1 timed wait closed closing connection
Timed-Wait (are you sure the ack was received?) client server close FIN=1 Can be combined into one segment Timed-wait gets reset to 2*MSL with each new FIN received ACK=1 close FIN=1 ACK=1 timed wait FIN=1 ACK=1 Closed !!!
Need for Timed-wait: clean connection close (book) Client Server FIN ACK FIN ACK SYN FIN client thinks the second connection has been terminated
Need for Timed-wait: clean connection close (me) Sender Receiver FIN ACK FIN Connection officially closed (normally) ACK FIN Connection closed abnormally RST
Each side is closed separately FIN ACK Data send Data ack FIN ACK