670 likes | 840 Views
Ch. 7 : Internet Transport Protocols. Our goals: understand principles behind transport layer services: Multiplexing / demultiplexing data streams of several applications reliable data transfer flow control congestion control. Chapter 6: rdt principles Chapter 7: multiplex/ demultiplex
E N D
Ch. 7 : Internet Transport Protocols
Transport Layer Our goals: understand principles behind transport layer services: Multiplexing /demultiplexing data streams of several applications reliable data transfer flow control congestion control Chapter 6: rdt principles Chapter 7: multiplex/ demultiplex Internet transport layer protocols: UDP: connectionless transport TCP: connection-oriented transport connection setup data transfer flow control congestion control Transport Layer 2
Transport vs. network layer • Transport layerusesNetwork layer services • adds more value to these services
Multiplexing & Demultiplexing
receive segment from L3deliver each received segment to correct socket gather data from multiple sockets, envelop data with headers (later used for demultiplexing), pass to L3 = socket = process application application application transport transport transport P3 P1 P2 P1 P4 network network network link link link physical physical physical Multiplexing at send host: Demultiplexing at rcv host: Multiplexing/demultiplexing host 3 host 2 host 1
host receives IP datagrams each datagram has source IP address, destination IP address in its header each datagram carries one transport-layer segment each segment has source, destination port number in its header host uses port numbers, and sometimes also IP addresses to direct segment to correct socket from socket data gets to the relevant application process How demultiplexing works 32 bits source IP addr dest IP addr. L3 hdr other IP header fields source port # dest port # L4 header other header fields application data (message) appl. msg TCP/UDP segment format
Processes create sockets with port numbers a UDP socket is identified by a pair of numbers: (my IP address , my port number) Client decides to contact: a server ( peer IP-address) + an application ( peer port #) Client puts those into the UDP packet he sends; they are written as: destIP address - in the IP header of the packet dest port number- in its UDP header When server receives a UDP segment: checks destination portnumber in segment directs UDP segment to the socket with that port number (packets from different remote sockets directed to same socket) the UDP message waits in socket queue and is processed in its turn. answer message sent to the client UDP socket (listed in Source fields of query packet) Connectionless demultiplexing (UDP)
client socket: port=5775, IP=B client socket: port=9157, IP=A server socket: port=53, IP = C S-IP: C S-IP: C D-IP: B D-IP: A SP: 53 SP: 53 DP: 9157 DP: 5775 message S-IP: B Client IP:B D-IP: C S-IP: A D-IP: C SP: 9157 P2 P1 P3 DP: 53 Connectionless demux (cont) L5 L4 Reply Reply L3 L2 message message L1 Wait for application server IP: C client IP: A Getting Service SP: 5775 Getting Service DP: 53 SP = Source port number DP= Destination port number S-IP= Source IP Address D-IP=Destination IP Address message IP-Header UDP-Header SP and S-IP provide “return address”
TCP socket identified by 4-tuple: local (my) IP address local (my) port number remote IP address remote port number receiving host uses all four values to direct segment to appropriate socket Connection-oriented demux (TCP) • Server host may support many simultaneous TCP sockets: • each socket identified by its own 4-tuple • Web servers have a different socket for each connecting client • If you open two browser windows, you generate 2 sockets at each end • non-persistent HTTP will open a different socket for each request
client socket: LP= 9157, L-IP= A RP= 80 , R-IP= C server socket: LP= 80 , L-IP= C RP= 9157, R-IP= A server socket: LP= 80 , L-IP= C RP= 5775, R-IP= B packet: S-IP: B D-IP: C message packet: SP: 5775 S-IP: A packet: P1 P4 P6 P1 P2 P5 P3 H3 DP: 80 D-IP: C S-IP: B SP: 9157 H4 D-IP: C DP: 80 server socket: LP= 80 , L-IP= C RP= 9157, R-IP= B client socket: LP= 5775, L-IP= B RP= 80 , R-IP= C client socket: LP= 9157, L-IP= B RP= 80 , R-IP= C SP: 9157 message DP: 80 message Connection-oriented demux (cont) L5 L4 L3 L2 L1 server IP: C Client IP: B client IP: A LP= Local Port , RP= Remote Port L-IP= Local IP , R-IP= Remote IP “L”= Local = My“R”= Remote = Peer
simple transport protocol “best effort” service, UDP segments may be: lost delivered out of order to application with no correction by UDP UDP will discard bad checksum segments if so configured by application connectionless: no handshaking between UDP sender, receiver each UDP segment handled independently of others Why is there a UDP? no connection establishment saves delay no congestion control: better delay & BW simple: small segment header UDP: User Datagram Protocol [RFC 768] • typical usage: realtime appl. • loss tolerant • rate sensitive • other uses (why?): • DNS • SNMP
UDP segment structure 32 bits Total length of segment (bytes) source port # dest port # length checksum application data (variable length) • Checksum computed over: • the whole segment • part of IP header: • both IP addresses • protocol field • total IP packet length • Checksum usage: • computed at destination to detect errors • in case of error, UDP will discard the segment, or
Sender: treat segment contents as sequence of 16-bit integers checksum: addition (1’s complement sum) of segment contents sender puts checksum value into UDP checksum field Receiver: compute checksum of received segment check if computed checksum equals checksum field value: NO - error detected YES - no error detected. UDP checksum Goal: detect “errors” (e.g., flipped bits) in transmitted segment
full duplex data: bi-directional data flow in same connection MSS: maximum segment size connection-oriented: handshaking (exchange of control msgs) init’s sender, receiver state before data exchange flow controlled: sender will not overwhelm receiver point-to-point: one sender, one receiver between sockets reliable, in-order byte steam: no “message boundaries” pipelined: TCP congestion and flow control set window size send & receive buffers TCP: OverviewRFCs: 793, 1122, 1323, 2018, 2581
32 bits source port # dest port # sequence number acknowledgement number head len not used rcvr window size U A P R S F checksum ptr urgent data Options (variable length) application data (variable length) TCP segment structure hdr length in 32 bit words URG: urgent data (generally not used) counting by bytes of data (not segments!) ACK: ACK # valid PSH: push data now (generally not used) # bytes rcvr willing to accept RST, SYN, FIN: connection estab (setup, teardown commands) Internet checksum (as in UDP)
SN: byte stream “number” of first byte in segment’s data AN: SN of next byte expected from other side cumulative ACK Qn: how receiver handles out-of-order segments? puts them in receive buffer but does not acknowledge them time TCP sequence # (SN) and ACK (AN) Host B Host A host A sends100 data bytes SN=42, AN=79, 100 data bytes host B ACKs 100 bytes and sends50 data bytes SN=79, AN=142, 50 data bytes host ACKs receipt of data , sends no dataWHY? SN=142, AN=129 , no data simple data transfer scenario (some time after conn. setup)
Connection Setup: Objective Agree on initial sequence numbers a sender should not reuse a seq# before it is sure that all packets with the seq# are purged from the network the network guarantees that a packet too old will be purged from the network: network bounds the life time of each packet To avoid waiting for seq #s to disappear, start new session with a seq# far away from previous needs connection setup so that the sender tells the receiver initial seq# Agree on other initial parameters e.g. Maximum Segment Size
Setup:establish connection between the hosts before exchanging data segments called: 3 way handshake initialize TCP variables: seq. #s buffers, flow control info (e.g. RcvWindow) client : connection initiator opens socket and cmds OS to connect it to server server : contacted by client has waiting socket accepts connection generates working socket Teardown:end of connection(we skip the details) Three way handshake: Step 1:client host sends TCP SYN segment to server specifies initial seq # no data Step 2:server host receives SYN, replies with SYNACK segment (also no data) allocates buffers specifies server initial SN & window size Step 3: client receives SYNACK, replies with ACK segment, which may contain data TCP Connection Management
B A SYN , SN = X SYNACK , SN = Y, AN = X+1 ACK , SN = X+1 , AN = Y+1 TCP Three-Way Handshake (TWH) X+1 Y+1 Send Buffer Send Buffer Y+1 X+1 Receive Buffer Receive Buffer
Connection Close Objective of closure handshake: each side can release resource and remove state about the connection Close the socket I am done. Are you done too? client server initial close : release resource? close release resource I am done too. Goodbye! close release resource
Ch. 7 : Internet Transport Protocols Part B
TCP creates reliable service on top of IP’s unreliable service pipelined segments cumulative acks single retransmission timer receiver accepts out of order segments but does not acknowledge them Retransmissions are triggered by timeout events Initially consider simplified TCP sender: ignore flow control, congestion control TCP reliable data transfer
data rcvd from app: create segment with seq # seq # is byte-stream number of first data byte in segment start timer if not already running (think of timer as for oldest unACKed segment) expiration interval: TimeOutInterval timeout: retransmit segment that caused timeout restart timer ACK rcvd: if acknowledges previously unACKed segments update what is known to be ACKed start timer if there are outstanding segments TCP sender events:
Transport Layer NextSeqNum = InitialSeqNum SendBase = InitialSeqNum loop (forever) { switch(event) event: data received from application above create TCP segment with sequence number NextSeqNum if (timer currently not running) start timer pass segment to IP NextSeqNum = NextSeqNum + length(data) event: timer timeout retransmit not-yet-acknowledged segment with smallest sequence number start timer event: ACK received, with ACK field value of y if (y > SendBase) { SendBase = y if (there are currently not-yet-acknowledged segments) start timer } } /* end of loop forever */ TCP sender(simplified) • Comment: • SendBase-1: last • cumulatively ACKed byte • Example: • SendBase-1 = 71;y= 73, so the rcvrwants 73+ ;y > SendBase, sothat new data is ACKed
application takes data: free the room in buffer give the freed cells new numbers circular numbering WIN increases by the number of bytes taken TCP actions on receiver events: data rcvd from IP: • if Checksum fails, ignore segment • If checksum OK, then : • if data came in order: • update AN+WIN • AN grows by the number of new in-order bytes • WIN decreases by same # • if data out of order: • Put in buffer, but don’t count it for AN/ WIN
TCP: retransmission scenarios SN=92, 8 bytes data SN=92, 8 bytes data AN=100 AN=100 X SN=100 , 20 bytes data loss TIMEOUT AN=120 SN=92, 8 bytes data stop timer time timer setting actual timer run Host A Host B Host A Host B start timer for SN 92 start timer for SN 92 stop timer start timer for SN 100 start timer for new SN 92 stop timer NO timer AN=100 A. normal scenario NO timer time B. lost ACK + retransmission
TCP retransmission scenarios (more) SN=92, 8 bytes data SN=92, 8 bytes data SN=100, 20 bytes data AN=100 SN=100, 20 bytes data X AN=100 AN=120 loss TIMEOUT SN=92, 8 bytes data stop timer AN=120 stop stop AN=120 time Host A Host B Host A Host B start timer for SN 92 start timer for SN 92 star fort 92 start for 100 NO timer NO timer time redundant ACK C. lost ACK, NO retransmission D. premature timeout Transport Layer 3-29
Transport Layer TCP ACK generation[RFC 1122, RFC 2581] TCP Receiver action Delayed ACK. Wait up to 500ms for next segment. If no data segment to send, then send ACK Immediately send single cumulative ACK, ACKing both in-order segments Immediately send duplicate ACK, indicating seq. # of next expected byte This Ack carries no data & no new WIN Immediately send ACK, provided that segment starts at lower end of gap Event at Receiver Arrival of in-order segment with expected seq #. All data up to expected seq # already ACKed Arrival of in-order segment with expected seq #. One other segment has ACK pending Arrival of out-of-order segment with higher-than-expect seq. # . Gap detected Arrival of segment that partially or completely fills gap
Transport Layer time-out period often relatively long: Causes long delay before resending lost packet detect lost segments via duplicate ACKs. sender often sends many segments back-to-back if segment is lost, there will likely be many duplicate ACKs for that segment If sender receives 3 ACKs for same data, it assumes that segment after ACKed data was lost: fast retransmit:resend segment before timer expires Fast Retransmit
Transport Layer Host A Host B seq # x1 seq # x2 seq # x3 X ACK # x2 seq # x4 seq # x5 ACK # x2 ACK # x2 ACK # x2 triple duplicate ACKs resend seq X2 timeout time
Transport Layer Fast retransmit algorithm: event: ACK received, with ACK field value of y if (y > SendBase) { SendBase = y if (there are currently not-yet-acknowledged segments) start timer } else {if (segment carries no data & doesn’t change WIN) increment count of dup ACKs received for y if (count of dup ACKs received for y = 3) { { resend segment with sequence number y count of dup ACKs received for y = 3 } } a duplicate ACK for already ACKed segment fast retransmit
receive side of TCP connection at B has a receive buffer: flow control matches the send rate of A to the receiving application’s drain rate at B Receive buffer size set by OS at connection init WIN = window size = number bytes A may send starting at AN flow control sender won’t overflow receiver’s buffer by transmitting too much, too fast TCP Flow Control for A’s data AN Receive Buffer data from IP TCP datain buffer spare room (sent by TCP at A) data taken by application WIN node B : Receive process • application process at B may be slow at reading from buffer
Formulas: AN = first byte not received yet sent to A in TCP header AckedRange = = AN – FirstByteNotReadByAppl == # bytes rcvd in sequence & not taken WIN = RcvBuffer – AckedRange= SpareRoom AN and WIN sent to A in TCP header Data rcvd out of sequence is considered part of ‘spare room’ range Procedure: Rcvr advertises “spare room” by including value of WIN in his segments Sender A is allowed to send at mostWIN bytes in the range starting with AN guarantees that receive buffer doesn’t overflow TCP Flow control: how it works non-ACKed data in buffer(arrived out of order) ignored Rcv Buffer data taken by application data from IP ACKed datain buffer s p a r e r o o m (sent by TCP at A) WIN AN node B : Receive process
Q: how to set TCP timeout value? longer than RTT note: RTT will vary too short: premature timeout unnecessary retransmissions too long: slow reaction to segment loss Q: how to estimate RTT? SampleRTT: measured time from segment transmission until ACK receipt ignore retransmissions, cumulatively ACKed segments SampleRTT will vary, want estimated RTT “smoother” use several recent measurements, not just current SampleRTT TCP Round Trip Time and Timeout
High-level Idea Set timeout = average + safe margin
Estimating Round Trip Time • SampleRTT: measured time from segment transmission until ACK receipt • SampleRTT will vary, want a “smoother” estimated RTT use several recent measurements, not just current SampleRTT EstimatedRTT = (1- )*EstimatedRTT + *SampleRTT • Exponential weighted moving average • influence of past sample decreases exponentially fast • typical value: = 0.125
Setting Timeout Problem: using the average of SampleRTT will generate many timeouts due to network variations Solution: EstimtedRTT plus “safety margin” large variation in EstimatedRTT -> larger safety margin freq. RTT DevRTT = (1-)*DevRTT + *|SampleRTT-EstimatedRTT| (typically, = 0.25) Then set timeout interval: TimeoutInterval = EstimatedRTT + 4*DevRTT
TCP Congestion Control • Closed-loop, end-to-end, window-based congestion control • Designed by Van Jacobson in late 1980s, based on the AIMD alg. of Dah-Ming Chu and Raj Jain • Works well so far: the bandwidth of the Internet has increased by more than 200,000 times • Many versions • TCP/Tahoe: this is a less optimized version • TCP/Reno: many OSs today implement Reno type congestion control • TCP/Vegas: not currently used For more details: see TCP/IP illustrated; or readhttp://lxr.linux.no/source/net/ipv4/tcp_input.c for linux implementation
TCP & AIMD: congestion • Dynamic window size [Van Jacobson] • Initialization: MI • Slow start • Steady state: AIMD • Congestion Avoidance • Congestion = timeout • TCP Tahoe • Congestion = timeout || 3 duplicate ACK • TCP Reno & TCP new Reno • Congestion = higher latency • TCP Vegas
Visualization of the Two Phases MSS Congestion avoidance threshold Congwing Slow start
exponential increase (per RTT) in window size (not so slow!) In case of timeout: Threshold=CongWin/2 Slowstart algorithm time TCP Slowstart: MI Host A Host B one segment RTT initialize: Congwin = 1 MSS for (each segment ACKed) Congwin+MSS until (congestion event OR CongWin > threshold) two segments four segments
TCP Tahoe Congestion Avoidance Congestion avoidance /* slowstart is over */ /* Congwin > threshold */ Until (timeout) { /* loss event */ on every ACK: CWin/MSS+= 1/(Cwin/MSS) } threshold = Congwin/2 Congwin = 1 MSS perform slowstart TCP Tahoe