360 likes | 578 Views
Transport Layer. Michalis Faloutsos Many slides from Kurose-Ross. Transport Layer Functionality. Hide network from application layer Transport layer resides at end points Sees the network as a black box. Transport Layers of the Internet. TCP: reliable protocol
E N D
Transport Layer Michalis Faloutsos Many slides from Kurose-Ross
Transport Layer Functionality • Hide network from application layer • Transport layer resides at end points • Sees the network as a black box
Transport Layers of the Internet • TCP: reliable protocol • Guarantees end-to-end delivery • Self-controls rate: congestion and flow control • Connection oriented: handshake, state • Ordered delivery of packets to application • UDP: unreliable protocol • Non-regulated sending rate • Multiplexing-demultiplexing
full duplex data: bi-directional data flow in same connection MSS: maximum segment size connection-oriented: handshaking (exchange of control msgs) init’s sender, receiver state before data exchange flow controlled: sender will not overwhelm receiver point-to-point: one sender, one receiver reliable, in-order byte steam: no “message boundaries” pipelined: TCP congestion and flow control set window size send & receive buffers TCP: What and How For more: RFCs: 793, 1122, 1323, 2018, 2581
32 bits source port # dest port # sequence number acknowledgement number head len not used rcvr window size U A P R S F checksum ptr urgent data Options (variable length) application data (variable length) TCP segment structure URG: urgent data (generally not used) counting by bytes of data (not segments!) ACK: ACK # valid PSH: push data now (generally not used) # bytes rcvr willing to accept RST, SYN, FIN: connection estab (setup, teardown commands) Internet checksum (as in UDP)
TCP overview • TCP is a sliding window protocol • Sender can have (Window) bytes in flight • Operates with cumulative ACKs • It includes control for the sending rate • Flow control: receiver-set sending rate • Congestion control: network-aware sending rate Congwin
Seq. #’s: byte stream “number” of first byte in segment’s data ACKs: seq # of next byte expected from other side cumulative ACK Q: how receiver handles out-of-order segments A: TCP spec doesn’t say, - up to implementor time TCP seq. #’s and ACKs Host B Host A User types ‘C’ Seq=42, ACK=79, data = ‘C’ host ACKs receipt of ‘C’, echoes back ‘C’ Seq=79, ACK=43, data = ‘C’ host ACKs receipt of echoed ‘C’ Seq=43, ACK=80 simple telnet scenario
TCP in a nutshell I. Slow start phase (actually this is fast increase) • Start with a window of 1 (or 2) • Successful ACK: Increase window by one 1 max size segment • Do this up to a threshold: sshthresh II. Congestion control phase • Increase window by 1 max size segment every RTT • Drop window in half, if there is congestion • Packet loss: duplicate ACKs • Time expiration
end-end control (no network assistance) transmission rate limited by congestion window size, Congwin, over segments: w * MSS throughput = Bytes/sec RTT TCP Congestion Control Congwin w segments, each with MSS bytes sent in one RTT:
TCP is “probing” for usable bandwidth: ideally: transmit as fast as possible (Congwin as large as possible) without loss increaseCongwin until loss (congestion) loss: decreaseCongwin, then begin probing (increasing) again TCP congestion control: Intuition
TCP has two “phases” slow start: start from small, increase quickly congestion avoidance: Additive Increase Multiplicative Decrease important variables: Congwin threshold: defines threshold between two slow start phase, congestion control phase TCP congestion control:
exponential increase (per RTT) in window size loss event: timeout (Tahoe TCP) and/or or three duplicate ACKs (Reno TCP) Slowstart algorithm time TCP Slowstart Host A Host B one segment RTT initialize: Congwin = 1 for (each segment ACKed) Congwin++ until (loss event OR CongWin > threshold) two segments four segments
Why Call it Slow Start ? • The original version of TCP suggested that the sender transmit as much as the Advertised Window permitted. • Routers may not be able to cope with this “burst” of transmissions. • Slow start is slower than the above version -- ensures that a transmission burst does not happen at once.
TCP Congestion Avoidance Congestion avoidance /* slowstart is over */ /* Congwin > threshold */ Until (loss event) { every w segments ACKed: Congwin++ } threshold = Congwin/2 Congwin = 1 perform slowstart 1 1: TCP Reno skips slowstart (fast recovery) after three duplicate ACKs
Remember: bytes vs packets! CW += MSS * MSS/CW Thres = Max( 2* MSS, InFlightData/2) MSS: max segment size InFlighData: un-ACK-ed data RFC 2581: TCP Congestion Control TCP Congestion: Real Life is Hairy! Congestion avoidance /* slowstart is over */ /* Congwin > threshold */ Until (loss event) { every w segments ACKed: Congwin++ } threshold = Congwin/2 Congwin = 1 perform slowstart 1
Fairness goal: if N TCP sessions share same bottleneck link, each should get 1/N of link capacity TCP congestion avoidance: AIMD:additive increase, multiplicative decrease increase window by 1 per RTT decrease “window” by factor of 2 on loss event TCP Fairness and AIMD TCP connection 1 bottleneck router capacity R TCP connection 2
Two competing sessions: Additive increase gives slope of 1, as throughout increases multiplicative decrease decreases throughput proportionally Why is TCP fair? equal bandwidth share R loss: decrease window by factor of 2 congestion avoidance: additive increase Connection 2 throughput loss: decrease window by factor of 2 congestion avoidance: additive increase Connection 1 throughput R
Macroscopic Description of Throughput • Assume window toggling: W/2 to W • High rate: W * MSS / RTT • Low rate: W * MSS / 2 RTT • Rate increase is linearly between two extremes • Average throughput: • 0.75 * W * MSS / RTT
TCP: reliable data transfer event: data received from application above Simplified sender, assuming create, send segment • one way data transfer • no flow, congestion control wait for event event: timer timeout for segment with seq # y wait for event retransmit segment event: ACK received, with ACK # y ACK processing
TCP sender 00sendbase = initial_sequence number 01 nextseqnum = initial_sequence number 02 03 loop (forever) { 04 switch(event) 05 event: data received from application above (not exceeding window) 06 create TCP segment with sequence number nextseqnum 07 start timer for segment nextseqnum 08 pass segment to IP 09 nextseqnum = nextseqnum + length(data) 10 event: timer timeout for segment with sequence number y 11 retransmit segment with sequence number y 12 compute new timeout interval for segment y 13 restart timer for sequence number y 14 event: ACK received, with ACK field value of y 15 if (y > sendbase) { /* cumulative ACK of all data up to y */ 16 cancel all timers for segments with sequence numbers < y 17 sendbase = y 18 } 19 else { /* a duplicate ACK for already ACKed segment */ 20 increment number of duplicate ACKs received for y 21 if (number of duplicate ACKS received for y == 3) { 22 /* TCP fast retransmit */ 23 resend segment with sequence number y 24 restart timer for segment y 25 } 26 } /* end of loop forever */ Simplified TCP sender
TCP Receiver: ACK generation[RFC 1122, RFC 2581] TCP Receiver action delayed ACK. Wait up to 500ms for next segment. If no next segment, send ACK immediately send single cumulative ACK send duplicate ACK, indicating seq. # of next expected byte immediate ACK if segment starts at lower end of gap Event in-order segment arrival, no gaps, everything else already ACKed in-order segment arrival, no gaps, one delayed ACK pending out-of-order segment arrival higher-than-expect seq. # gap detected arrival of segment that partially or completely fills gap
Host A Host B Seq=92, 8 bytes data ACK=100 timeout X loss Seq=92, 8 bytes data ACK=100 time time lost ACK scenario TCP: retransmission scenarios Host A Host B Seq=92, 8 bytes data Seq=100, 20 bytes data Seq=92 timeout ACK=100 ACK=120 Seq=100 timeout Seq=92, 8 bytes data ACK=120 premature timeout, cumulative ACKs
Q: how to set TCP timeout value? longer than RTT note: RTT will vary too short: premature timeout unnecessary retransmissions too long: slow reaction to segment loss Q: how to estimate RTT? SampleRTT: measured time from segment transmission until ACK receipt ignore retransmissions, cumulatively ACKed segments SampleRTT will vary, want estimated RTT “smoother” use several recent measurements, not just current SampleRTT TCP Round Trip Time and Timeout
Setting the timeout EstimtedRTT plus “safety margin” large variation in EstimatedRTT -> larger safety margin TCP Round Trip Time and Timeout EstimatedRTT = (1-x)*EstimatedRTT + x*SampleRTT Exponential weighted moving average influence of given sample decreases exponentially fast typical value of x: 0.1 Timeout = EstimatedRTT + 4*Deviation Deviation = (1-x)*Deviation + x*|SampleRTT-EstimatedRTT|
A problem • When there are retransmissions, it is unclear if the ACK is for the original transmission or for a retransmission. • How do we overcome this ?
The Karn Patridge Algorithm • Take SampleRTT measurements only for segments that have been sent once ! • This eliminates the possibility that wrong RTT estimates are factored into the estimation. • Another change -- Each time TCP retransmits, it sets the next timeout to 2 X Last timeout --> This is called the Exponential Back-off (primarily for avoiding congestion).
Jacobson Karels Algorithm • An issue with the Karn/Patridge scheme is that it does not take into account the variation between RTT samples. • New method proposed -- the Jacobson Karels Algorithm. • Estimated RTT = Estimated RTT + d X Difference • Difference = Sample RTT - Estimated RTT • Deviation = Deviation + d (|Difference| - deviation) • Timeout = m Estimated RTT + f deviation. • The values of m and f are computed based on experience -- Typically m = 1 and f = 4.
Silly Window Syndrome • Suppose a MSS worth of data is collected and advertised window is MSS/2. • What should the sender do ? -- transmit half full segments or wait to send a full MSS when window opens ? • Early implementations were aggressive -- transmit MSS/2. • Aggressively doing this, would consistently result in small segment sizes -- called the Silly Window Syndrome.
Issues .. • We cannot eliminate the possibility of small segments being sent. • However, we can introduce methods to coalesce small chunks. • Delaying ACKs -- receiver does not send ACKs as soon as it receives segments. • How long to delay ? • Ultimate solution falls to the sender -- when should I transmit ?
Nagle’s Algorithm • If sender waits too long --> bad for interactive connections. • If it does not wait long enough -- silly window syndrome. • How do we solve this? • Timer -- clock based • If both available data and Window ≥ MSS, send full segment. • Else, if there is unACKed data in flight, buffer new data until ACK returns. • Else, send new data now. • Note -- Socket interface allows some applications to turn off Nagle’s algorithm by setting the TCP-NODELAY option.
Recall:TCP sender, receiver establish “connection” before exchanging data segments initialize TCP variables: seq. #s buffers, flow control info (e.g. RcvWindow) client: connection initiator Socket clientSocket = new Socket("hostname","port number"); server: contacted by client Socket connectionSocket = welcomeSocket.accept(); TCP Connection Management
TCP Set-up Three way handshake: Step 1:client end system sends TCP SYN control segment to server • specifies initial seq # Step 2:server end system receives SYN, replies with SYNACK control segment • ACKs received SYN • allocates buffers • specifies server-> receiver initial seq. # Step 3: Client replies with an ACK (using servers seq number)
Closing a connection: client closes socket:clientSocket.close(); Step 1:client end system sends TCP FIN control segment to server Step 2:server receives FIN, replies with ACK. Closes connection, sends FIN. Last ACK is never ACK-ed!! client server close FIN ACK close FIN ACK timed wait closed TCP Connection Management (cont.)
Step 3:client receives FIN, replies with ACK. Enters “timed wait” - will respond with ACK to received FINs Step 4:server, receives ACK. Connection closed. Sends FIN. Last ACK is never ACK-ed TCP Connection Management (cont.) client server closing FIN ACK closing FIN ACK timed wait closed closed