630 likes | 757 Views
Chapter 7 Internet Transport Protocols. Our goals: understand principles behind transport layer services: Multiplexing / demultiplexing data streams of several applications reliable data transfer flow control congestion control. Chapter 6: rdt principles Chapter 7: multiplex/ demultiplex
E N D
Transport Layer Our goals: understand principles behind transport layer services: Multiplexing /demultiplexing data streams of several applications reliable data transfer flow control congestion control Chapter 6: rdt principles Chapter 7: multiplex/ demultiplex Internet transport layer protocols: UDP: connectionless transport TCP: connection-oriented transport connection setup data transfer flow control congestion control Transport Layer 2
Transport vs. network layer • Transport layerusesNetwork layer services • adds value to these services
receive segment from L3deliver each received segment to the right socket gather data from multiple sockets, envelop data with headers (later used for demultiplexing), pass to L3 = socket = process application application application transport transport transport P3 P1 P2 P1 P4 network network network link link link physical physical physical Multiplexing at send host: Demultiplexing at rcv host: Multiplexing/demultiplexing host 3 host 2 host 1
host receives IP datagrams each datagram has source IP address, destination IP address in its header used by network to get it there each datagram carries one transport-layer segment each segment has source, destination port number in its header host uses port #s(*) to direct segment to correct socket from socket data gets to the relevant application process How demultiplexing works 32 bits source IP addr dest IP addr. L3 hdr other IP header fields source port # dest port # L4 header other header fields application data (message) appl. msg TCP/UDP segment format (*) to find a TCP socket on server, source & dest. IP address is also needed, see details later
Processes create sockets with port numbers a UDP socket is identified by a pair of numbers: (my IP address , my port number) Client decides to contact: a server (peer IP-address) an application ( “WKP”) puts those into the UDP packet sent, written as: destIP address - in the IP header of the packet dest port number- in its UDP header When server receives a UDP segment: checks destination portnumber in segment directs UDP segment to the socket with that port number single server socket per application type (packets from different remote sockets directed to same socket) msg waits in socket queue and processed in its turn. answer sent to the client socket (listed in Source fields of query packet) Connectionless demultiplexing (UDP) Realtime UDP applications have individual server sockets per client. However their port numbers are distinct, since they are coordinated in advance by some signaling protocol. This is possible since port number is not used to specify the application.
client socket: port=5775, IP=B client socket: port=9157, IP=A server socket: port=53, IP = C S-IP: C S-IP: C D-IP: B D-IP: A SP: 53 SP: 53 DP: 9157 DP: 5775 message S-IP: B Client IP:B D-IP: C S-IP: A D-IP: C SP: 9157 P2 P3 P1 DP: 53 Connectionless demux (cont) L5 L4 Reply Reply L3 L2 message message L1 Wait for application server IP: C client IP: A Getting Service SP: 5775 Getting Service DP: 53 SP = Source port number DP= Destination port number S-IP= Source IP Address D-IP=Destination IP Address message IP-Header UDP-Header SP and S-IP provide “return address”
TCP socket identified by 4-tuple: local (my) IP address local (my) port number remote (peer) IP address remote (peer) port # host receiving a packet uses all four values to direct the segment to appropriate socket Connection-oriented demux (TCP) • Server host may support many simultaneous TCP sockets: • each socket identified by its own 4-tuple • Web server dedicates a different socket to each connecting client • If you open two browser windows, you generate 2 sockets at each end
client socket: LP= 9157, L-IP= A RP= 80 , R-IP= C server socket: LP= 80 , L-IP= C RP= 9157, R-IP= A server socket: LP= 80 , L-IP= C RP= 5775, R-IP= B packet: S-IP: B D-IP: C message packet: SP: 5775 S-IP: A packet: P6 P2 P4 P5 P1 P1 P3 H3 DP: 80 D-IP: C S-IP: B SP: 9157 H4 D-IP: C DP: 80 server socket: LP= 80 , L-IP= C RP= 9157, R-IP= B client socket: LP= 9157, L-IP= B RP= 80 , R-IP= C client socket: LP= 5775, L-IP= B RP= 80 , R-IP= C SP: 9157 message DP: 80 message Connection-oriented demux (cont) L5 L4 L3 L2 L1 server IP: C Client IP: B client IP: A LP= Local Port , RP= Remote Port L-IP= Local IP , R-IP= Remote IP “L”= Local = My“R”= Remote = Peer
Client socket has a port number unique in host packet for client socket directed by the host OS based on dest. port only each server application has an always active waiting socket; that socket receives all packets not belonging to any established connection these are packets that open new connections Connection-oriented Sockets • when waiting socket accepts a ‘new connection’ segment, • a new socket is generated at server with same port number • this is the working socket for that connection • next sockets arriving at server on connection will be directed to working socket • socket will be identified using all 4 identifiers • last slide shows working sockets on the server side Note: Client IP + Client Port are globally unique
simple transport protocol “best effort” service, UDP segments may be: lost delivered out of order to application with no correction by UDP UDP will discard bad checksum segments if so configured by application connectionless: no handshaking between UDP sender, receiver each UDP segment handled independently of others Why is there a UDP? no connection establishment saves delay no congestion control: better delay & BW simple: less memory & RT small segment header UDP: User Datagram Protocol [RFC 768] • typical usage: realtime appl. • loss tolerant • rate sensitive • other uses (why?): • DNS • SNMP
UDP segment structure 32 bits Total length of segment (bytes) source port # dest port # length checksum application data (variable length) • Checksum computed over: • the whole segment, plus • part of IP header: • both IP addresses • protocol field • total IP packet length • Checksum usage: • computed at destination to detect errors • on error, discard segment, • checksum is optional • if not used, sender puts checksum = all zeros • computed zero = all ones
full duplex data: bi-directional data flow in same connection MSS: maximum segment size connection-oriented: handshaking (exchange of control msgs) init’s sender, receiver state before data exchange flow controlled: sender will not overwhelm receiver point-to-point: one sender, one receiver works between sockets reliable, in-order byte stream: no “message boundaries” pipelined: TCP congestion and flow control set window size send & receive buffers TCP: OverviewRFCs: 793, 1122, 1323, 2018, 2581
32 bits source port # dest port # sequence number acknowledgement number head len not used rcvr window size R S F A U P checksum ptr urgent data Options (variable length) application data (variable length) TCP segment structure hdr length in 32 bit words FLAGS counting by bytes of data (not segments!) ACK: ACK # valid PSH, URG seldom used not clearly defined URG: indicates startof urgent data # bytes rcvr willing to accept PSH: indicates urgent data ends in this segm. ptr = end urgent data SYN: initialize conn., synchronize SN FIN: I wish to disconn. RST: break conn. immediately Internet checksum (as in UDP)
SN: byte stream “number” of first byte in segment’s data AN: SN of next byte expected from other side it’s a cumulative ACK Qn: how receiver handles out-of-order segments? puts them in receive buffer but does not acknowledge them time TCP sequence # (SN) and ACK # (AN) Host B Host A host A sends100 data bytes SN=42, AN=79, 100 data bytes host B ACKs 100 bytes and sends50 data bytes SN=79, AN=142, 50 data bytes host ACKs receipt of data , sends no dataWHY? SN=142, AN=129 , no data simple data transfer scenario (some time after conn. setup)
Connection Setup: Objective Agree on initial sequence numbers a sender should not reuse a seq# before it is sure that all packets with the seq# are purged from the network the network guarantees that a packet too old will be purged from the network: network bounds the life time of each packet To avoid waiting for them to disappear, choose initial SN (ISN) far away from previous session needs connection setup so that the sender tells the receiver initial seq# Agree on other initial parameters e.g. Maximum Segment Size
Setup:establish connection between the hosts before exchanging data segments called: 3 way handshake initialize TCP variables: seq. #s buffers, flow control info (e.g. RcvWindow) client : connection initiator opens socket and cmds OS to connect it to server server : contacted by client has waiting socket accepts connection generates working socket Teardown:end of connection(we skip the details) Three way handshake: Step 1:client host sends TCP SYN segment to server specifies initial seq # (ISN) no data Step 2:server host receives SYN, replies with SYNACK segment (also no data) allocates buffers specifies server initial SN & window size Step 3: client receives SYNACK, replies with ACK segment, which may contain data TCP Connection Management
B A SYN , SN = X SYNACK , SN = Y, AN = X+1 ACK , SN = X+1 , AN = Y+1 TCP Three-Way Handshake (TWH) X+1 Y+1 Send Buffer Send Buffer Y+1 X+1 Receive Buffer Receive Buffer
Connection Close Objective of closure handshake: each side can release resource and remove state about the connection Close the socket FIN I am done. Are you done too? client server initial close : release resource? no data fromclient close release resource FIN : I am done too. Goodbye! close release resource
TCP creates reliable service on top of IP’s unreliable service pipelined segments cumulative acks single retransmission timer receiver accepts out of order segments but does not acknowledge them Retransmissions are triggered by timeout events in some versions of TCP also by triple duplicate ACKs (see later) Initially consider simplified TCP sender: ignore flow control, congestion control TCP reliable data transfer
data rcvd from app: create segment with seq # seq # is byte-stream number of first data byte in segment start timer if not already running (timer relates to oldest unACKed segment) expiration interval: TimeOutInterval timeout (*): retransmit segment that caused timeout restart timer ACK rcvd: if ACK acknowledges previously unACKed segments update what is known to be ACKedNote: Ack is cumulative start timer if there are outstanding segments TCP sender events: (*) retransmission done also on triple duplicate Ack (see later)
Transport Layer NextSeqNum = InitialSeqNum SendBase = InitialSeqNum loop (forever) { switch(event) event: data received from application above if (NextSeqNum-send_base < N) then { create TCP segment with sequence number NextSeqNum if (timer currently not running) start timer pass segment to IP NextSeqNum = NextSeqNum + length(data) } else reject data /* in truth: keep in send buffer until new Ack */ event: timer timeout retransmit not-yet-acknowledged segment with smallest sequence number start timer event: ACK received, with ACK field value of y if (y > SendBase) { SendBase = y if (there are currently not-yet-acknowledged segments) start timer } } /* end of loop forever */ TCP sender(simplified) • Comment: • SendBase-1: last • cumulatively ACKed byte • Example: • SendBase-1 = 71;y= 73, so the rcvrwants 73+ ;y > SendBase, sothat new data is ACKed 7-25
application takes data: free the room in buffer give the freed cells new numbers circular numbering WIN increases by the number of bytes taken TCP actions on receiver events: data rcvd from IP: • if Checksum fails, ignore segment • If checksum OK, then : • if data came in order: • update AN &WIN, as follows: • AN grows by the number of new in-order bytes • WIN decreases by same # • if data out of order: • Put in buffer, but don’t count it for AN/ WIN
TCP: retransmission scenarios SN=92, 8 bytes data SN=92, 8 bytes data AN=100 AN=100 X SN=100 , 20 bytes data loss TIMEOUT AN=120 SN=92, 8 bytes data stop timer time timer setting actual timer run Host A Host B Host A Host B start timer for SN 92 start timer for SN 92 stop timer start timer for SN 100 start timer for new SN 92 stop timer NO timer AN=100 A. normal scenario NO timer time B. lost ACK + retransmission
TCP retransmission scenarios (more) SN=92, 8 bytes data SN=92, 8 bytes data SN=100, 20 bytes data AN=100 SN=100, 20 bytes data X AN=100 AN=120 loss TIMEOUT SN=92, 8 bytes data stop timer AN=120 stop stop AN=120 time Host A Host B Host A Host B start timer for SN 92 start timer for SN 92 start for 92 DROP ! start for 100 NO timer NO timer time redundant ACK C. lost ACK, NO retransmission D. premature timeout Transport Layer 7-28
Transport Layer TCP ACK generation (Receiver rules) TCP Receiver action Delayed ACK. Wait up to 500ms for next segment. If no data segment to send, then send ACK Immediately send single cumulative ACK, ACKing both in-order segments Immediately send duplicate ACK, indicating seq. # of next expected byte This Ack carries no data & no new WIN Immediately send ACK, provided that segment starts at lower end of 1stgap Event at Receiver Arrival of in-order segment with expected seq #. All data up to expected seq # already ACKed Arrival of in-order segment with expected seq #. One other segment has ACK pending Arrival of out-of-order segment with higher-than-expect seq. # . Gap detected Arrival of segment that partially or completely fills gap [RFC 1122, RFC 2581]
Transport Layer time-out period often relatively long: Causes long delay before resending lost packet idea:detect lost segments via duplicate ACKs. sender often sends many segments back-to-back if segment is lost, there will likely be many duplicate ACKs for that segment Rule: If sender receives 4 ACKs for same data (= 3 duplicates), it assumes that segment after ACKed data was lost: fast retransmit:resend segment immediately (before timer expires) Fast Retransmit (Sender Rules)
Fast Retransmit scenario Host A Host B seq # x1 seq # x2 seq # x3 X ACK # x2 seq # x4 seq # x5 ACK # x2 * no data in segment * no window change ACK # x2 ACK # x2 triple duplicate ACKs resend seq X2 timeout time Transport Layer
Transport Layer Fast retransmit algorithm: event: ACK received, with ACK field value of y if (y > SendBase) { SendBase = y if (there are currently not-yet-acknowledged segments) start timer } else {if (segment carries no data & doesn’t change WIN) increment count of dup ACKs received for y if (count of dup ACKs received for y = 3) { { resend segment with sequence number y count of dup ACKs received for y = 0 } } a duplicate ACK for already ACKed segment fast retransmit
Q: how to set TCP timeout interval? should be longer than RTT but: RTT will vary if too short: premature timeout unnecessary retransmissions if too long: slow reaction to segment loss Set timeout = average + safe margin : General idea Timeout Interval Average margin
Estimating Round Trip Time • SampleRTT: measured time from segment transmission until receipt of ACK for it • SampleRTT will vary, want a “smoother” estimated RTT use several recent measurements, not just current SampleRTT EstimatedRTT = (1- )*EstimatedRTT + *SampleRTT • Exponential weighted moving average • influence of past sample decreases exponentially fast • typical value: = 0.125
Setting Timeout Problem: using the average of SampleRTT will generate many timeouts due to network variations Solution: EstimatedRTT plus “safety margin” large variation in EstimatedRTT -> requires larger safety margin Estimate average deviation of RTT: freq. RTT DevRTT = (1-)*DevRTT + *|SampleRTT-EstimatedRTT| (typically, = 0.25) Then set timeout interval: TimeoutInterval = EstimatedRTT + 4*DevRTT
TCP at A sends data to B The picture below shows the TCP receive-buffer at B flow control matches the send rate of A to the receiving application’s drain rate at B Receive buffer size set by OS at connection init WIN = window size = number bytes A may send starting at AN flow control sender won’t overflow receiver’s buffer by transmitting too much, too fast TCP Flow Control: Simple Case AN Receive Buffer data from IP TCP datain buffer spare room (sent by TCP at A) data taken by application WIN node B : Receive process • application process at B may be slow at reading from buffer
Formulas: AN = first byte not received yet sent to A in TCP header AckedRange = = AN – FirstByteNotReadByAppl= = # bytes rcvd in sequence ¬ taken WIN = RcvBuffer – AckedRange= “SpareRoom” AN and WIN sent to A in TCP header Data received out of sequence is considered part of ‘spare room’ range Procedure: Rcvr advertises “spare room” by including value of WIN in his segments Sender A is allowed to send at mostWIN bytes in the range starting with AN guarantees that receive buffer doesn’t overflow TCP Flow control: General Case non-ACKed data in buffer(arrived out of order) ignored Rcv Buffer data taken by application data from IP ACKed datain buffer s p a r e r o o m (sent by TCP at A) WIN AN node B : Receive process
TCP Congest’n Ctrl Overview (1) • Closed-loop, end-to-end, window-based congestion control • Designed by Van Jacobson in late 1980s, based on the AIMD algorithm of Dah-Ming Chu and Raj Jain • Works well so far: the bandwidth of the Internet has increased by more than 200,000 times • Many versions • TCP-Tahoe: this is a less optimized version • TCP-Reno: many OSs today implement Reno type congestion control • TCP-Vegas: not currently used For more details: see Stevens: TCP/IP illustrated; K-R chapter 6.7, or read:http://lxr.linux.no/source/net/ipv4/tcp_input.c for linux implementation
TCP Congest’n Ctrl Overview (2) • Dynamic window size [Van Jacobson] • Initialization: MI (Multiplicative Increase) • Slow start • Steady state: AIMD (Additive Increase / Multiplicative Decrease) • Congestion Avoidance • “Congestion is timeout || 3 duplicate ACK” • TCP Tahoe: treats both cases identically • TCP Reno: treat each case differently • “Congestion = (also) higher latency” • TCP Vegas
sender limits rate by limiting number of unACKed bytes “in pipeline”: cwnd: differs from WIN(how, why?) sender limited byewnd≡ min(cwnd,WIN)(effecive window) roughly, cwndis dynamic, function of perceived network congestion ACK(s) ewnd rate = bytes/sec RTT General method LastByteSent-LastByteAcked cwnd (*) cwnd bytes RTT Transport Layer
The Basic Two Phases MSS Congestion avoidance Additive Increase cwnd Slow start Multiplicative Increase
loss, so decrease rate fast X Pure AIMD: Bandwidth Probing Principle • “probing for bandwidth”: increase transmission rate on receipt of ACK, until eventually loss occurs, then decrease transmission rate • continue to increase on ACK, decrease on loss (since available bandwidth is changing, depending on other connections in network) ACKs being received, so increase rate slowly X X X AI TCP’s “sawtooth” behavior AI MD X sending rate MD this model ignores Slow Start time • Q: how fast to increase/decrease? • details to follow Transport Layer
(*) doubled per RTT: exponential increase in window size (very fast!) therefore slowstart lasts a short time Slowstart algorithm time TCP Slowstart: MI * used in all TCP versions Host A Host B one segment • initialize: cwnd = 1 MSS • for (each segment ACKed) • cwnd += MSS (*) • until (congestion event OR • cwnd ≥ threshold) • On congestion event: • {Threshold = cwnd/2 • cwnd = 1 MSS } RTT two segments four segments
TCP: congestion avoidance (CA) AIMD • when cwnd > ssthresh grow cwnd linearly:as long as all ACKs arrive • increase cwnd by ≈1 MSS per RTT • approach possible congestion slower than in slowstart • implementation: cwnd += MSS^2/cwndfor each ACK received • ACKs: increase cwndby 1 MSS per RTT: additive increase • loss(*): cut cwnd in half : multiplicative decrease • true in macro picture • in actual algorithm may have Slow Start first to grow up to this value (+) (*) = Timeout or 3 Duplicate (+) depends on case & TCPtype Transport Layer
TCP Tahoe • Initialize with SlowStartstate with cwnd = 1 MSS • When cwnd ≥ ssthresh change to CA state • When sense congestion(*): • set ssthresh =ewnd/2 (+) • set cwnd = 1 MSS • change state to SlowStart • (*) Timeout or Triple Duplicate Ack • (+) recall ewnd = min(cwnd, WIN); in our discussion here we assume that WIN > cwnd, so ewnd=cwnd TCP Tahoe T/O or 3 Dup MD AI CA CA SSt SSt