?Please indicate the source: http://blog.csdn.net/gaoxiangnumber1
Welcome to my github: https://github.com/gaoxiangnumber1
3.1 Introduction and Transport-Layer Services
- A transport-layer protocol provides for logical communication between application processes running on different hosts. Application processes use the logical communication to send messages to each other.
- Transport-layer protocols are implemented in the end systems but not in network routers.
- On the sending side, the transport layer converts the application-layer messages it receives from a sending application process into transport-layer segments by breaking the application messages into smaller chunks and adding a transport-layer header to each chunk.
- The transport layer then passes the segment to the network layer at the sending end system, where the segment is encapsulated within a network-layer packet (a datagram) and sent to the destination. Network routers act only on the network-layer fields of the datagram, they do not examine the fields of the transport-layer segment encapsulated with the datagram.
- On the receiving side, the network layer extracts the transport-layer segment from the datagram and passes the segment up to the transport layer. The transport layer then processes the received segment, making the data in the segment available to the receiving application.
3.1.1 Relationship Between Transport and Network Layers
- A transport-layer protocol provides logical communication between processes running on different hosts; a network-layer protocol provides logical communication between hosts. Within an end system, a transport protocol moves messages from application processes to the network layer and vice versa.
- A computer network may have multiple transport protocols, with each protocol offering a different service model to applications.
- The services that a transport protocol can provide are often constrained by the service model of the underlying network-layer protocol. If the network-layer protocol can’t provide delay or bandwidth guarantees for transport-layer segments sent between hosts, then the transport-layer protocol cannot provide delay or bandwidth guarantees for application messages sent between processes. But certain services can be offered by a transport protocol even when the underlying network protocol doesn’t offer the corresponding service at the network layer.
3.1.2 Overview of the Transport Layer in the Internet
- A TCP/IP network, makes two distinct transport-layer protocols available to the application layer. One is UDP (User Datagram Protocol), which provides an unreliable, connectionless service to the invoking application. The other is TCP (Transmission Control Protocol), which provides a reliable, connection-oriented service to the invoking application.
- The Internet’s network-layer protocol has a name—IP, for Internet Protocol. IP provides logical communication between hosts. The IP service model does not guarantee orderly delivery of segments, and it does not guarantee the integrity of the data in the segments. So IP is said to be an unreliable service. Every host has at least one network-layer address, a so-called IP address.
- The most fundamental responsibility of UDP and TCP is to extend IP’s delivery service between two end systems to a delivery service between two processes running on the end systems. Extending host-to-host delivery to process-to-process delivery is called transport-layer multiplexing and demultiplexing.
- UDP and TCP provide integrity checking by including error-detection fields in their segments’ headers.
- These two minimal transport-layer services (process-to-process data delivery and error checking) are the only two services that UDP provides. Moreover, UDP is an unreliable service—it does not guarantee that data sent by one process will arrive intact to the destination process.
- TCP provides reliable data transfer by using flow control, sequence numbers, acknowledgments, and timers. TCP thus converts IP’s unreliable service between end systems into a reliable data transport service between processes.
- TCP also provides congestion control that prevents any one TCP connection from swamping the links and routers between communicating hosts with an excessive amount of traffic. TCP strives to give each connection traversing a congested link an equal share of the link bandwidth. This is done by regulating the rate at which the sending sides of TCP connections can send traffic into the network.
- UDP traffic is unregulated. An application using UDP transport can send at any rate it pleases, for as long as it pleases.
3.2 Multiplexing and Demultiplexing
- Transport-layer multiplexing and demultiplexing is extending the host-to-host delivery service provided by the network layer to a process-to-process delivery service for applications running on the hosts. A multiplexing/demultiplexing service is needed for all computer networks.
- At the destination host, the transport layer receives segments from the network layer just below. The transport layer must deliver the data in these segments to the appropriate application process running in the host.
- Suppose your computer have four network application processes running—two Telnet processes, one FTP process, and one HTTP process. When the transport layer in your computer receives data from the network layer below, it needs to direct the received data to one of these four processes.
- A process (as part of a network application) can have one or more sockets, doors through which data passes from the network to the process and through which data passes from the process to the network.
- Thus the transport layer in the receiving host does not deliver data directly to a process, but instead to an intermediary socket. Because at any given time there can be more than one socket in the receiving host, each socket has a unique identifier. The format of the identifier depends on whether the socket is a UDP or a TCP socket.
- Each transport-layer segment has a set of fields in the segment. At the receiving end, the transport layer examines these fields to identify the receiving socket and then directs the segment to that socket.
- The job of gathering data chunks at the source host from different sockets, encapsulating each data chunk with header information (that will later be used in demultiplexing) to create segments, and passing the segments to the network layer is called multiplexing.
The job of delivering the data at the end host in a transport-layer segment to the correct socket is called demultiplexing.
- Multiplexing and demultiplexing are concerns whenever a single protocol at one layer (at the transport layer or elsewhere) is used by multiple protocols at the next higher layer.
- Transport-layer multiplexing requires
(1) sockets have unique identifiers;
(2) each segment have special fields that indicate the socket to which the segment is to be delivered.
- These special fields are the source port number field and the destination port number field. Each port number is a 16-bit number, ranging from 0 to 65535. The port numbers ranging from 0 to 1023 are called well-known port numbers and are restricted, which means that they are reserved for use by well-known application protocols such as HTTP (port number 80) and FTP (port number 21).
- When we develop a new application, we must assign the application a port number. Each socket in the host could be assigned a port number, and when a segment arrives at the host, the transport layer examines the destination port number in the segment and directs the segment to the corresponding socket. The segment’s data then passes through the socket into the attached process.
- When a UDP socket is created, the transport layer automatically assigns a port number to the socket which is in the range 1024 to 65535 that is currently not being used by any other UDP port in the host.
- If the application developer writing the code were implementing the server side of a “well-known protocol,” then the developer would have to assign the corresponding well-known port number.
- Suppose a process in Host A, with UDP port 19157, wants to send a chunk of application data to a process with UDP port 46428 in Host B. The transport layer in Host A creates a transport-layer segment that includes the application data, the source port number (19157), the destination port number (46428), and two other values. The transport layer then passes the resulting segment to the network layer. The network layer encapsulates the segment in an IP datagram and deliver the segment to the receiving host.
- When the segment arrives at the receiving Host B, the transport layer at the receiving host examines the destination port number in the segment (46428) and delivers the segment to its socket identified by port 46428. Note that Host B could be running multiple processes, each with its own UDP socket and associated port number. As UDP segments arrive from the network, Host B directs (demultiplexes) each segment to the appropriate socket by examining the segment’s destination port number.
- A UDP socket is fully identified by a two-tuple consisting of a destination IP address and a destination port number. So if two UDP segments have different source IP addresses and/or source port numbers, but have the same destination IP address and destination port number, then the two segments will be directed to the same destination process via the same destination socket.
- In the A-to-B segment the source port number serves as part of a “return address”, when B wants to send a segment back to A, the destination port in the B-to-A segment will take its value from the source port value of the A-to-B segment.
- One difference between a TCP socket and a UDP socket is that a TCP socket is identified by a four-tuple: (source IP address, source port number, destination IP address, destination port number).
- When a TCP segment arrives from the network to a host, the host uses all four values to direct (demultiplex) the segment to the appropriate socket. In contrast with UDP, two arriving TCP segments with different source IP addresses or source port numbers will (with the exception of a TCP segment carrying the original connection-establishment request) be directed to two different sockets.
- The server host can support many simultaneous TCP connection sockets, with each socket attached to a process, and with each socket identified by its own four-tuple.
- Host C initiates two HTTP sessions to server B, and Host A initiates one HTTP session to B. Hosts A and C and server B each have their own unique IP address—A, C, and B, respectively. Host C assigns two different source port numbers (26145 and 7532) to its two HTTP connections.
- Because Host A is choosing source port numbers independently of C, it might also assign a source port of 26145 to its HTTP connection. But server B will still be able to correctly demultiplex the two connections having the same source port number, since the two connections have different source IP addresses.
- Figure 3.5 shows a web server that spawns a new process for each connection. Each of these processes has its own connection socket through which HTTP requests arrive and HTTP responses are sent. There is not always a one-to-one correspondence between connection sockets and processes. Today’s high-performing web servers often use only one process, and create a new thread with a new connection socket for each new client connection. For such a server, at any given time there may be many connection sockets (with different identifiers) attached to the same process.
- If the client and server are using persistent HTTP, then throughout the duration of the persistent connection the client and server exchange HTTP messages via the same server socket.
- If the client and server use non-persistent HTTP, then a new TCP connection is created and closed for every request/response, and hence a new socket is created and later closed for every request/response. This frequent creating and closing of sockets can severely impact the performance of a busy web server.
3.3 Connectionless Transport: UDP
- Aside from the multiplexing/demultiplexing function and error checking, UDP adds nothing to IP. UDP takes messages from the application process, attaches source and destination port number fields for the multiplexing/demultiplexing service, adds two other small fields, and passes the resulting segment to the network layer. The network layer encapsulates the transport-layer segment into an IP datagram and then deliver the segment to the receiving host.
- When the segment arrives at the receiving host, UDP uses the destination port number to deliver the segment’s data to the correct application process. With UDP there is no handshaking between sending and receiving transport-layer entities before sending a segment. So UDP is said to be connectionless.
- DNS is an application-layer protocol that typically uses UDP. When the DNS application in a host wants to make a query, it constructs a DNS query message and passes the message to UDP. Without performing any handshaking with the UDP entity running on the destination end system, the host-side UDP adds header fields to the message and passes the resulting segment to the network layer. The network layer encapsulates the UDP segment into a datagram and sends the datagram to a name server. The DNS application at the querying host then waits for a reply to its query. If it doesn’t receive a reply (possibly because the underlying network lost the query), either it tries sending the query to another name server, or it informs the invoking application that it can’t get a reply.
Many applications are better suited for UDP for the following reasons:
- Finer application-level control over what data is sent, and when.
- Under UDP, as soon as an application process passes data to UDP, UDP will package the data inside a UDP segment and immediately pass the segment to the network layer.
- TCP has a congestion-control mechanism that throttles the transport-layer TCP sender when one or more links between the source and destination hosts become congested.
- TCP will continue to resend a segment until the receipt of the segment has been acknowledged by the destination, regardless of how long reliable delivery takes.
- Since real-time applications often require a minimum sending rate, don’t want to overly delay segment transmission, and can tolerate some data loss, TCP’s service model is not well matched to these applications’ needs. So these applications can use UDP and implement, as part of the application, any additional functionality that is needed beyond UDP’s no-frills segment-delivery service.
- No connection establishment.
TCP uses a three-way handshake before it starts to transfer data. UDP just blasts away without any formal preliminaries. Thus UDP does not introduce any delay to establish a connection. This is the principal reason why DNS runs over UDP rather than TCP: DNS would be much slower if it ran over TCP. HTTP uses TCP rather than UDP, since reliability is critical for web pages with text. But, the TCP connection-establishment delay in HTTP is an important contributor to the delays associated with downloading web documents.
- No connection state.
TCP maintains connection state in the end systems which includes receive and send buffers, congestion-control parameters, and sequence and acknowledgment number parameters that are needed to implement TCP’s reliable data transfer service and to provide congestion control. UDP does not maintain connection state and does not track any of these parameters. So a server devoted to a particular application can typically support many more active clients when the application runs over UDP rather than TCP.
- Small packet header overhead.
The TCP segment has 20 bytes of header overhead in every segment, whereas UDP has only 8 bytes of overhead.
- It is possible for an application to have reliable data transfer when using UDP if reliability is built into the application itself (for example, by adding acknowledgment and retransmission mechanisms). But this is a nontrivial task that would keep an application developer busy debugging for a long time. Nevertheless, by building reliability directly into the application, application processes can communicate reliably without being subjected to the transmission-rate constraints imposed by TCP’s congestion-control mechanism.
3.3.1 UDP Segment Structure
- The UDP header has four fields, each consisting of two bytes.
- The port numbers allow the destination host to pass the application data to the correct process running on the destination end system (that is, to perform the demultiplexing function).
- The length field specifies the number of bytes in the UDP segment (header + data). An explicit length value is needed since the size of the data field may differ from one UDP segment to the next.
- The checksum is used by the receiving host to check whether errors have been introduced into the segment and it is also calculated over a few of the fields in the IP header in addition to the UDP segment.
- The application data occupies the data field of the UDP segment. For example, for DNS, the data field contains either a query message or a response message.
3.3.2 UDP Checksum
- The checksum is used to determine whether bits within the UDP segment have been altered as it moved from source to destination.
- UDP at the sender side performs the 1s complement of the sum of all the 16-bit words in the segment, with any overflow encountered during the sum being wrapped around. This result is put in the checksum field of the UDP segment.
- Suppose that we have the following three 16-bit words:
0110011001100000
0101010101010101
1000111100001100
The sum of first two of these 16-bit words is
0110011001100000 + 0101010101010101 = 1011101110110101
Adding the third word to the above sum gives
1011101110110101 + 1000111100001100 = 0100101011000010
Note that this last addition had overflow, which was wrapped around.
- The 1s complement is obtained by 0 -> 1, 1 -> 0. Thus the 1s complement of the sum 0100101011000010 is 1011010100111101, which becomes the checksum.
- At the receiver, all four 16-bit words are added, including the checksum. If no errors are introduced into the packet, then clearly the sum at the receiver will be 1111111111111111. If one of the bits is a 0, then we know that errors have been introduced into the packet.
- Why UDP provides a checksum in the first place, as many link-layer protocols (e.g.: Ethernet protocol) also provide error checking.
-
- The reason is that there is no guarantee that all the links between source and destination provide error checking; that is, one of the links may use a link-layer protocol that does not provide error checking.
-
- Furthermore, even if segments are correctly transferred across a link, it’s possible that bit errors could be introduced when a segment is stored in a router’s memory.
-
- Given that neither link-by-link reliability nor in-memory error detection is guaranteed, UDP must provide error detection at the transport layer, on an end-end basis, if the end-end data transfer service is to provide error detection. Because IP is supposed to run over just about any layer-2 protocol, it is useful for the transport layer to provide error checking as a safety measure.
- Although UDP provides error checking, it does not do anything to recover from an error. Some implementations of UDP simply discard the damaged segment; others pass the damaged segment to the application with a warning.
3.4 Principles of Reliable Data Transfer
- The service abstraction provided to the upper-layer entities is that of a reliable channel through which data can be transferred. With a reliable channel, no transferred data bits are corrupted or lost, and all are delivered in the order in which they were sent. It is the responsibility of a reliable data transfer protocol to implement this service abstraction.
- This task is made difficult by the fact that the layer below the reliable data transfer protocol may be unreliable. For example, TCP is a reliable data transfer protocol that is implemented on top of an unreliable (IP) end-to-end network layer.
- Assumption: packets will be in the order in which they were sent, with some packets possibly being lost; that is, the underlying channel will not reorder packets.
- Figure 3.8(b) illustrates the interfaces for our data transfer protocol. The sending side of the data transfer protocol will be invoked from above by a call to rdt_send(). It will pass the data to be delivered to the upper layer at the receiving side. (rdt: reliable data transfer protocol; rdt_send: the sending side of rdt is being called.)
- On the receiving side, rdt_rcv() will be called when a packet arrives from the receiving side of the channel. When the rdt protocol wants to deliver data to the upper layer, it will do so by calling deliver_data().
- In addition to exchanging packets containing the data to be transferred, the sending and receiving sides of rdt will also need to exchange control packets back and forth. Both the send and receive sides of rdt send packets to the other side by a call to udt_send() (udt: unreliable data transfer).
3.4.1 Building a Reliable Data Transfer Protocol
Reliable Data Transfer over a Perfectly Reliable Channel: rdt1.0
- We first consider the underlying channel is completely reliable. The protocol itself (rdt1.0) is trivial. The finite-state machine (FSM) definitions for the rdt1.0 sender and receiver are shown in Figure 3.9.
- The sender and receiver FSMs in Figure 3.9 each have just one state. The arrows in the FSM description indicate the transition of the protocol from one state to another.
- The event causing the transition is shown above the horizontal line(__) labeling the transition, and the actions taken when the event occurs are shown below the horizontal line. When no action is taken on an event, or no event occurs and an action is taken, we’ll use the symbol “?” below or above the horizontal, respectively, to explicitly denote the lack of an action or event.
- The initial state of the FSM is indicated by the dashed arrow(- - ->) and the FSMs in Figure 3.9 have only one state.
- The sending side of rdt accepts data from the upper layer via the rdt_send(data) event, creates a packet containing the data (via the action make_pkt(data)) and sends the packet into the channel. In practice, the rdt_send(data) event would result from a procedure call (for example, to rdt_send()) by the upper-layer application.
- On the receiving side, rdt receives a packet from the underlying channel via the rdt_rcv(packet) event, extracts the data from the packet (via the action extract(packet,data)) and passes the data up to the upper layer (via the action deliver_data(data)). In practice, the rdt_rcv(packet) event would result from a procedure call (for example, to rdt_rcv()) from the lower-layer protocol.
- In this simple protocol, there is no difference between a unit of data and a packet. Also, all packet flow is from the sender to receiver; with a perfectly reliable channel there is no need for the receiver side to provide any feedback to the sender since nothing can go wrong! Note that we have also assumed that the receiver is able to receive data as fast as the sender happens to send data. Thus, there is no need for the receiver to ask the sender to slow down!
Reliable Data Transfer over a Channel with Bit Errors: rdt2.0
- A more realistic model of the underlying channel is one in which bits in a packet may be corrupted. We’ll continue to assume that all transmitted packets are received in the order in which they were sent.
- Consider how you yourself might dictate a long message over the phone. In a typical scenario, the message taker might say “OK” after each sentence has been heard, understood, and recorded. If the message taker hears a garbled sentence, you’re asked to repeat the garbled sentence. This message-dictation protocol uses both positive acknowledgments (“OK”) and negative acknowledgments (“Please repeat that.”). These control messages allow the receiver to let the sender know what has been received correctly, and what has been received in error and thus requires repeating. In a computer network setting, reliable data transfer protocols based on such retransmission are known as ARQ (Automatic Repeat re-Quest) protocols.
- Three additional protocol capabilities are required in ARQ protocols to handle the presence of bit errors:
- Error detection.
A mechanism is needed to allow the receiver to detect when bit errors have occurred. UDP uses the Internet checksum field for this purpose. Error-detection and error-correction techniques require that extra bits (beyond the bits of original data to be transferred) be sent from the sender to the receiver; these bits will be gathered into the packet checksum field of the rdt2.0 data packet.
- Receiver feedback.
Since the sender and receiver are typically executing on different end systems, the only way for the sender to learn of the receiver’s view of the world (i.e., whether or not a packet was received correctly) is for the receiver to provide explicit feedback to the sender. The positive (ACK) and negative (NAK) acknowledgment replies in the message-dictation scenario are examples of such feedback. In principle, these packets need only be one bit long; for example, a 0 value could indicate a NAK and a value of 1 could indicate an ACK.
- Retransmission.
A packet that is received in error at the receiver will be retransmitted by the sender.
- Error detection.
- The send side of rdt2.0 has two states:
- In the leftmost state, the send-side protocol is waiting for data to be passed down from the upper layer. When the rdt_send(data) event occurs, the sender will create a packet (sndpkt) containing the data to be sent, along with a packet checksum, and then send the packet via the udt_send(sndpkt) operation.
- In the rightmost state, the sender protocol is waiting for an ACK or a NAK packet from the receiver.
—If an ACK packet is received (rdt_rcv(rcvpkt) && isACK (rcvpkt)), the sender knows that the most recently transmitted packet has been received correctly and thus the protocol returns to the state of waiting for data from the upper layer.
—If a NAK is received, the protocol retransmits the last packet and waits for an ACK or NAK to be returned by the receiver in response to the retransmitted data packet.
- When the sender is in the wait-for-ACK-or-NAK state, it cannot get more data from the upper layer; i.e., the rdt_send() event can not occur; that will happen only after the sender receives an ACK and leaves this state. So the sender will not send a new piece of data until it is sure that the receiver has correctly received the current packet. Because of this behavior, protocols such as rdt2.0 are known as stop-and-wait protocols.
- The receiver-side FSM for rdt2.0 still has a single state. On packet arrival, the receiver replies with either an ACK or a NAK, depending on whether or not the received packet is corrupted.
- Protocol rdt2.0 has a drawback. Because the ACK or NAK packet could be corrupted. So we will need to add checksum bits to ACK/NAK packets in order to detect such errors. But if an ACK or NAK is corrupted, the sender can’t know whether or not the receiver has correctly received the last piece of transmitted data. So the sender has to resend the packet.
- So we add a new field to the data packet and make the sender number its data packets by putting a sequence number into this field. The receiver then need only check this sequence number to determine whether or not the received packet is a retransmission. If the sender is resending the previously transmitted packet (the sequence number of the received packet has the same sequence number as the most recently received packet) or a new packet (the sequence number changes).
- For this stop-and-wait protocol, a 1-bit sequence number will suffice because only two kinds of packets(just-send and prepare-send), prepare-send will not be sent until the sender know the just-send packet was successful received, so 0 for one kind of packet and 1 for the other kind of packet.
Figures 3.11 and 3.12 show the FSM description for rdt2.1, our fixed version of rdt2.0.
- The rdt2.1 sender and receiver FSMs each now have twice as many states as before. This is because the protocol state must now reflect whether the packet currently being sent by the sender or expected at the receiver should have a sequence number of 0 or 1.
- Protocol rdt2.1 uses both positive and negative acknowledgments from the receiver to the sender. When an out-of-order packet is received, the receiver sends a positive acknowledgment for the packet it has received. When a corrupted packet is received, the receiver sends a negative acknowledgment. We can achieve the same effect as a NAK if we send an ACK for the last correctly received packet. A sender that receives two ACKs for the same packet knows that the receiver did not correctly receive the packet following the packet that is being ACKed twice.
- Our NAK-free reliable data transfer protocol for a channel with bit errors is rdt2.2, shown in Figures 3.13 and 3.14. One change between rdt2.1 and rdt2.2 is that the receiver must now include the sequence number of the packet being acknowledged by an ACK message (this is done by including the (ACK, 0) or (ACK, 1) argument in make_pkt() in the receiver FSM), and the sender must now check the sequence number of the packet being acknowledged by a received ACK message (this is done by including the 0 or 1 argument in isACK()in the sender FSM).
Reliable Data Transfer over a Lossy Channel with Bit Errors: rdt3.0
- Suppose now that in addition to corrupting bits, the underlying channel can lose packets as well. Two additional concerns: how to detect packet loss and what to do when packet loss occurs.
- Suppose that the sender transmits a data packet and either that packet, or the receiver’s ACK of that packet, gets lost. In either case, no reply is coming at the sender from the receiver. If the sender waits long enough to certain that a packet has been lost, it can retransmit the data packet.
- The approach in practice is for the sender to choose a time value such that packet loss is likely, although not guaranteed, to have happened. If an ACK is not received within this time, the packet is retransmitted. This introduces the possibility of duplicate data packets in the sender-to-receiver channel. Protocol rdt2.2 already has enough functionality (sequence numbers) to handle the case of duplicate packets.
- Implementing a time-based retransmission mechanism requires a countdown timer that can interrupt the sender after a given amount of time has expired. The sender will thus need to be able to
(1) start the timer each time a packet (either a first-time packet or a retransmission) is sent.
(2) respond to a timer interrupt (taking appropriate actions).
(3) stop the timer.
- Because packet sequence numbers alternate between 0 and 1, protocol rdt3.0 is sometimes known as the alternating-bit protocol.
3.4.2 Pipelined Reliable Data Transfer Protocols
- Rather than operate in a stop-and-wait manner, the sender is allowed to send multiple packets without waiting for acknowledgments, as illustrated in Figure 3.17(b). Since the many in-transit sender-to-receiver packets can be visualized as filling a pipeline, this technique is known as pipelining.
- The range of sequence numbers must be increased, since each in-transit packet (not counting retransmissions) must have a unique sequence number and there may be multiple, in-transit, unacknowledged packets.
- The sender and receiver sides of the protocols may have to buffer more than one packet. Minimally, the sender will have to buffer packets that have been transmitted but not yet acknowledged. Buffering of correctly received packets may also be needed at the receiver.
- The range of sequence numbers needed and the buffering requirements will depend on the manner in which a data transfer protocol responds to lost, corrupted, and overly delayed packets. Two basic approaches toward pipelined error recovery can be identified: Go-Back-N and selective repeat.
3.4.3 Go-Back-N (GBN)
- In a Go-Back-N (GBN) protocol, the sender is allowed to transmit multiple packets (when available) without waiting for an acknowledgment, but is constrained to have no more than some maximum allowable number, N, of unacknowledged packets in the pipeline.
-
- [0, base-1]: packets that have already been transmitted and acknowledged.
-
- [base, nextseqnum-1]: packets that have been sent but not yet acknowledged.
-
- [nextseqnum, base+N-1]: packets that can be sent immediately, should data arrive from the upper layer.
-
- [base+N, +INF): packets cannot be used until an unacknowledged packet currently in the pipeline has been acknowledged.
-
- The range of permissible sequence numbers for transmitted but not yet acknowledged packets can be viewed as a window of size N over the range of sequence numbers. This window slides forward over the sequence number space. N is often referred to as the window size and the GBN protocol itself as a sliding-window protocol.
-
- In practice, a packet’s sequence number is carried in a fixed-length field in the packet header. If k is the number of bits in the packet sequence number field, the range of sequence numbers is thus [0, 2k – 1]. With a finite range of sequence numbers, all arithmetic involving sequence numbers must then be done using modulo 2k arithmetic. (That is, the sequence number space can be thought of as a ring of size 2k, where sequence number 2k – 1 is immediately followed by sequence number 0.) TCP has a 32-bit sequence number field.
-
- Figures 3.20 and 3.21 give an extended FSM description of the sender and receiver sides of an ACK-based, NAK-free, GBN protocol. We refer to this FSM description as an extended FSM because we have added variables for base and nextseqnum, and added operations on these variables and conditional actions involving these variables.
The GBN sender must respond to three types of events:
- Invocation from above.
When rdt_send() is called from above, the sender first checks to see if the window is full, that is, whether there are N unacknowledged packets. If the window is
—not full, a packet is created and sent, and variables are updated.
—full, the sender simply returns the data back to the upper layer, an implicit indication that the window is full. The upper layer would then have to try again later. In a real implementation, the sender would have either buffered (but not immediately sent) this data, or would have a synchronization mechanism (a semaphore or a flag) that would allow the upper layer to call rdt_send() only when the window is not full.
- Receipt of an ACK.
In our GBN protocol, an acknowledgment for a packet with sequence number n will be taken to be a cumulative acknowledgment, indicating that all packets with a sequence number up to and including n have been correctly received at the receiver.
- A timeout event.
A timer will be used to recover from lost data or acknowledgment packets. If a timeout occurs, the sender resends all packets that have been previously sent but that have not yet been acknowledged. Our sender in Figure 3.20 uses only a single timer, which can be thought of as a timer for the oldest transmitted but not yet acknowledged packet. If an ACK is received but there are still additional transmitted but not yet acknowledged packets, the timer is restarted. If there are no unacknowledged packets, the timer is stopped.
- Invocation from above.
- Figures 3.20 and 3.21 give an extended FSM description of the sender and receiver sides of an ACK-based, NAK-free, GBN protocol. We refer to this FSM description as an extended FSM because we have added variables for base and nextseqnum, and added operations on these variables and conditional actions involving these variables.
- The receiver’s actions in GBN:
—If a packet with sequence number n is received correctly and is in order (that is, the data last delivered to the upper layer came from a packet with sequence number n–1), the receiver sends an ACK for packet n and delivers the data portion of the packet to the upper layer.
—In all other cases, the receiver discards the packet and resends an ACK for the most recently received in-order packet. Since packets are delivered one at a time to the upper layer, if packet k has been received and delivered, then all packets with a sequence number lower than k have also been delivered. Thus, the use of cumulative acknowledgments is a natural choice for GBN.
- In our GBN protocol, the receiver discards out-of-order packets. Recall that the receiver must deliver data in order to the upper layer. Suppose now that packet n is expected, but packet n + 1 arrives. Because data must be delivered in order, the receiver could buffer (save) packet n + 1 and then deliver this packet to the upper layer after it had later received and delivered packet n. But if packet n is lost, both it and packet n + 1 will eventually be retransmitted as a result of the GBN retransmission rule at the sender. Thus, the receiver can simply discard packet n + 1.
- The advantage is the simplicity of receiver buffering—the receiver need not buffer any out-of-order packets. So while the sender must maintain the upper and lower bounds of its window and the position of nextseqnum within this window, the only piece of information the receiver need maintain is the sequence number of the next in-order packet. This value is held in the variable expectedseqnum, shown in the receiver FSM in Figure 3.21.
- The disadvantage of throwing away a correctly received packet is that the subsequent retransmission of that packet might be lost or garbled and thus even more retransmissions would be required.
- Figure 3.22 shows the operation of the GBN protocol for the case of a window size of four packets.
- Because of this window size limitation, the sender sends packets 0 through 3 but then must wait for one or more of these packets to be acknowledged before proceeding. As each successive ACK (for example, ACK0 and ACK1) is received, the window slides forward and the sender can transmit one new packet (pkt4 and pkt5, respectively). On the receiver side, packet 2 is lost and thus packets 3, 4, and 5 are found to be out of order and are discarded.
3.4.4 Selective Repeat (SR)
- The GBN protocol allows the sender to potentially “fill the pipeline” with packets, thus avoiding the channel utilization problems with stop-and-wait protocols. When the window size and bandwidth-delay product are both large, many packets can be in the pipeline. A single packet error can thus cause GBN to retransmit a large number of packets, many unnecessarily.
- Selective-repeat protocols avoid unnecessary retransmissions by having the sender retransmit only those packets that it suspects were received in error (that is, were lost or corrupted) at the receiver. This individual, as-needed, retransmission will require that the receiver individually acknowledge correctly received packets.
- A window size of N will be used to limit the number of outstanding, unacknowledged packets in the pipeline. Unlike GBN, the sender will have already received ACKs for some of the packets in the window.
Figure 3.24 SR sender events and actions
- Data received from above.
When data is received from above, the SR sender checks the next available sequence number for the packet. If the sequence number is within the sender’s window, the data is packetized and sent; otherwise it is either buffered or returned to the upper layer for later transmission.
- Timeout.
Timers are used to protect against lost packets. However, each packet must now have its own logical timer, since only a single packet will be transmitted on timeout. A single hardware timer can be used to mimic the operation of multiple logical timers.
- ACK received.
If an ACK is received, the SR sender marks that packet as having been received, provided it is in the window.
If the packet’s sequence number is equal to send_base, the window base is moved forward to the unacknowledged packet with the smallest sequence number. If the window moves and there are untransmitted packets with sequence numbers that now fall within the window, these packets are transmitted.
Figure 3.25 SR receiver events and actions
- Packet with sequence number in [rcv_base, rcv_base + N - 1] is correctly received.
In this case, the received packet falls within the receiver’s window and a selective ACK packet is returned to the sender. If the packet was not previously received, it is buffered. If this packet has a sequence number equal to the base of the receive window (rcv_base), then this packet, and any previously buffered and consecutively numbered (beginning with rcv_base) packets are delivered to the upper layer. The receive window is then moved forward by the number of packets delivered to the upper layer.
E.g., When a packet with a sequence number of rcv_base = 2 is received, it and packets 3, 4, and 5 can be delivered to the upper layer.
- Packet with sequence number in [rcv_base - N, rcv_base - 1] is correctly received.
In this case, an ACK must be generated, even though this is a packet that the receiver has previously acknowledged.
- Otherwise. Ignore the packet.
- Data received from above.
- The SR receiver will acknowledge a correctly received packet whether or not it is in order. Out-of-order packets are buffered until any missing packets (that is, packets with lower sequence numbers) are received, at which point a batch of packets can be delivered in order to the upper layer.
Mechanism | Use, Comments |
---|---|
Checksum | Used to detect bit errors in a transmitted packet. |
Timer | Used to timeout/retransmit a packet, possibly because the packet (or its ACK) was lost within the channel. Because timeouts can occur when a packet is delayed but not lost (premature timeout), or when a packet has been received by the receiver but the receiver-to-sender ACK has been lost, duplicate copies of a packet may be received by a receiver. |
Sequence number | Used for sequential numbering of packets of data flowing from sender to receiver. Gaps in the sequence numbers of received packets allow the receiver to detect a lost packet. Packets with duplicate sequence numbers allow the receiver to detect duplicate copies of a packet. |
Acknowledgment | Used by the receiver to tell the sender that a packet or set of packets has been received correctly. Acknowledgments will typically carry the sequence number of the packet or packets being acknowledged. Acknowledgments may be individual or cumulative, depending on the protocol. |
Negative acknowledgment | Used by the receiver to tell the sender that a packet has not been received correctly. Negative acknowledgments will typically carry the sequence number of the packet that was not received correctly. |
Window, pipelining | The sender may be restricted to sending only packets with sequence numbers that fall within a given range. By allowing multiple packets to be transmitted but not yet acknowledged, sender utilization can be increased over a stop-and-wait mode of operation. We’ll see shortly that the window size may be set on the basis of the receiver’s ability to receive and buffer messages, or the level of congestion in the network, or both. |
3.5 Connection-Oriented Transport: TCP
3.5.1 The TCP Connection
- The TCP protocol runs only in the end systems and not in the intermediate network elements (routers and link-layer switches), the intermediate network elements do not maintain TCP connection state.
- A TCP connection provides a full-duplex service: If there is a TCP connection between Process A on one host and Process B on another host, then data can flow from A to B at the same time as data flows from B to A. A TCP connection is point-to-point between a single sender and a single receiver.
- Suppose a process running in one host wants to initiate a connection with another process in another host. The process that is initiating the connection is called the client process, while the other process is called the server process.
- The client application process first informs the client transport layer that it wants to establish a connection to a process in the server. TCP in the client then proceeds to establish a TCP connection with TCP in the server. The client first sends a special TCP segment; the server responds with a second special TCP segment; and finally the client responds again with a third special segment.
- The first two segments carry no application-layer data; the third of these segments may carry application-layer data. This connection-establishment procedure is often referred to as a three-way handshake.
- The client process passes a stream of data through the socket. TCP directs this data to the connection’s send buffer. From time to time, TCP will grab chunks of data from the send buffer and pass the data to the network layer.
- The maximum amount of data that can be grabbed and placed in a segment is limited by the maximum segment size (MSS). The MSS is typically set by first determining the length of the largest link-layer frame that can be sent by the local sending host ( = maximum transmission unit, MTU), and then setting the MSS to ensure that a TCP segment (when encapsulated in an IP datagram) plus the TCP/IP header length (typically 40 bytes) will fit into a single link-layer frame. Note that the MSS is the maximum amount of application-layer data in the segment, not the maximum size of the TCP segment (which including headers).
- TCP pairs each chunk of client data with a TCP header, thereby forming TCP segments. The segments are passed down to the network layer, where they are separately encapsulated within network-layer IP datagrams. The IP datagrams are then sent into the network.
- When TCP receives a segment at the other end, the segment’s data is placed in the TCP connection’s receive buffer, as shown in Figure 3.28. The application reads the stream of data from this buffer. Each side of the connection has its own send buffer and its own receive buffer.
-
- Application-layer message: data which an application wants to send and passed onto the transport layer;
Transport-layer segment: generated by the transport layer and encapsulates application-layer message with transport layer header;
Network-layer datagram: encapsulates transport-layer segment with a network-layer header;
Link-layer frame: encapsulates network-layer datagram with a link-layer header.
- Application-layer message: data which an application wants to send and passed onto the transport layer;
3.5.2 TCP Segment Structure
- The TCP segment consists of header fields and a data field. The data field contains a chunk of application data. The MSS(maximum segment size) limits the maximum size of a segment’s data field. When TCP sends a large file, it typically breaks the file into chunks of size MSS (except for the last chunk, which will often be less than the MSS).
- The header includes source and destination port numbers, which are used for multiplexing/demultiplexing data from/to upper-layer applications.
- A TCP segment header also contains the following fields.
- The 32-bit sequence number field and the 32-bit acknowledgment number field are used by the TCP sender and receiver in implementing a reliable data transfer service.
- The 16-bit receive window field is used for flow control. It is used to indicate the number of bytes that a receiver is willing to accept.
- The 4-bit header length field specifies the length of the TCP header in 32-bit words. The TCP header can be of variable length due to the TCP options field. (Typically, the options field is empty, so that the length of the typical TCP header is 20 bytes.)
- The optional and variable-length options field is used when a sender and receiver negotiate the maximum segment size (MSS) or as a window scaling factor for use in high-speed networks. A time-stamping option is also defined.
- The flag field contains 6 bits.
-
- URG bit is used to indicate that there is data in this segment that the sending-side upper-layer entity has marked as “urgent.”
-
- The ACK bit is an acknowledgment for a segment that has been successfully received.
-
- Setting the PSH bit indicates that the receiver should pass the data to the upper layer immediately.
-
- The RST, SYN, and FIN bits are used for connection setup and teardown.
- The location of the last byte of urgent data is indicated by the 16-bit urgent data pointer field. TCP must inform the receiving-side upper-layer entity when urgent data exists and pass it a pointer to the end of the urgent data. (In practice, the PSH, URG, and the urgent data pointer are not used.)
Sequence Numbers and Acknowledgment Numbers
- The location of the last byte of urgent data is indicated by the 16-bit urgent data pointer field. TCP must inform the receiving-side upper-layer entity when urgent data exists and pass it a pointer to the end of the urgent data. (In practice, the PSH, URG, and the urgent data pointer are not used.)
- The RST, SYN, and FIN bits are used for connection setup and teardown.
- The sequence number field and the acknowledgment number fields are a critical part of TCP’s reliable data transfer service.
- TCP views data as an unstructured, but ordered, stream of bytes. TCP’s sequence numbers are based on the stream of transmitted bytes not based on the series of transmitted segments. The sequence number for a segment is therefore the byte-stream number of the first byte in the segment.
- Suppose that a process in Host A wants to send a stream of data to a process in Host B over a TCP connection. The TCP in Host A will implicitly number each byte in the data stream. Suppose that the data stream consists of a file consisting of 500,000 bytes, that the MSS is 1,000 bytes, and that the first byte of the data stream is numbered 0. As shown in Figure 3.30, TCP constructs 500 segments out of the data stream. The first segment gets assigned sequence number 0, the second segment gets assigned sequence number 1,000, the third segment gets assigned sequence number 2,000, and so on. Each sequence number is inserted in the sequence number field in the header of the appropriate TCP segment.
- TCP is full-duplex, so that Host A may be receiving data from Host B while it sends data to Host B (as part of the same TCP connection). Each of the segments that arrive from Host B has a sequence number for the data flowing from B to A. The acknowledgment number that Host A puts in its segment is the sequence number of the next byte Host A is expecting from Host B.
- Suppose that Host A has received all bytes numbered 0 through 535 from B and suppose that it is about to send a segment to Host B. Host A is waiting for byte 536 and all the subsequent bytes in Host B’s data stream. So Host A puts 536 in the acknowledgment number field of the segment it sends to B.
- Now suppose Host A received one segment from Host B containing bytes 900 through 1,000. For some reason Host A has not received bytes 536 through 899. In this example, Host A is still waiting for byte 536 (and beyond) in order to recreate B’s data stream. Thus, A’s next segment to B will contain 536 in the acknowledgment number field. Because TCP only acknowledges bytes up to the first missing byte in the stream, TCP is said to provide cumulative acknowledgments.
- Host A received the third segment (bytes 900 through 1,000) before receiving the second segment (bytes 536 through 899). Thus, the third segment arrived out of order. What does a host do when it receives out-of-order segments in a TCP connection? The TCP RFCs do not impose any rules here and leave the decision up to the people programming a TCP implementation.
- There are two choices:
(1) the receiver discards out-of-order segments (which can simplify receiver design);
(2) the receiver keeps the out-of-order bytes and waits for the missing bytes to fill in the gaps.
The latter choice is more efficient in terms of network bandwidth, and is the approach taken in practice.
- In Figure 3.30, we assumed that the initial sequence number was zero. In truth, both sides of a TCP connection randomly choose an initial sequence number. This is done to minimize the possibility that a segment that is still present in the network from an earlier, already-terminated connection between two hosts is mistaken for a valid segment in a later connection between these same two hosts.
Telnet: A Case Study for Sequence and Acknowledgment Numbers
- Telnet is a application-layer protocol used for remote login. It runs over TCP and is designed to work between any pair of hosts.
- Suppose Host A(client) initiates a Telnet session with Host B(sever). Each character typed by the user (at the client) will be sent to the remote host; the remote host will send back a copy of each character, which will be displayed on the Telnet user’s screen. This “echo back” is used to ensure that characters seen by the Telnet user have already been received and processed at the remote site. Each character thus traverses the network twice between the time the user hits the key and the time the character is displayed on the user’s monitor.
- Now suppose the user types a single letter, ‘C’, and then grabs a coffee. Let’s examine the TCP segments that are sent between the client and server.
- We suppose the starting sequence numbers are 42 and 79 for the client and server, respectively. Recall that the sequence number of a segment is the sequence number of the first byte in the data field and the acknowledgment number is the sequence number of the next byte of data that the host is waiting for. After the TCP connection is established but before any data is sent, the client is waiting for byte 79 and the server is waiting for byte 42.
- Three segments are sent.
- The first segment is sent from the client to the server, containing the 1-byte ASCII representation of the letter ‘C’ in its data field. This first segment has 42 in its sequence number field. Because the client has not yet received any data from the server, this first segment will have 79 in its acknowledgment number field.
- The second segment is sent from the server to the client. It serves two purposes. First it provides an acknowledgment of the data the server has received by putting 43 in the acknowledgment field. The second purpose is to echo back the letter ‘C.’ This second segment has the sequence number 79, the initial sequence number of the server-to-client data flow of this TCP connection.
- The third segment is sent from the client to the server. Its only purpose is to acknowledge the data it has received from the server. This segment has an empty data field. The segment has 80 in the acknowledgment number field because the client has received the stream of bytes up through byte sequence number 79 and it is now waiting for bytes 80 onward.
3.5.3 Round-Trip Time Estimation and Timeout
- TCP uses a timeout/retransmit mechanism to recover from lost segments. The timeout should be larger than the connection’s round-trip time (RTT), that is, the time from when a segment is sent until it is acknowledged. Otherwise, unnecessary retransmissions would be sent.
Estimating the Round-Trip Time
- The sample RTT, denoted SampleRTT, for a segment is the amount of time between when the segment is sent (passed to IP) and when an acknowledgment for the segment is received.
- Most TCP implementations take only one SampleRTT measurement at a time. That is, at any point in time, the SampleRTT is being estimated for only one of the transmitted but currently unacknowledged segments, leading to a new value of SampleRTT approximately once every RTT. TCP never computes a SampleRTT for a segment that has been retransmitted; it only measures SampleRTT for segments that have been transmitted once.
- In order to estimate a typical RTT, it is taking some sort of average of the SampleRTT values. TCP maintains an average, called EstimatedRTT, of the SampleRTT values. Upon obtaining a new SampleRTT, TCP updates EstimatedRTT according to the following formula:
EstimatedRTT = (1 – α) ? EstimatedRTT + α ? SampleRTT
- The recommended value of α is α = 0.125, in which case the formula above becomes:
EstimatedRTT = 0.875 ? EstimatedRTT + 0.125 ? SampleRTT
- It is also valuable to have a measure of the variability of the RTT. [RFC 6298] defines the RTT variation, DevRTT, as an estimate of how much SampleRTT typically deviates from EstimatedRTT:
DevRTT = (1 – β) ? DevRTT + β ?| SampleRTT – EstimatedRTT |
The recommended value of β is 0.25.
Setting and Managing the Retransmission Timeout Interval
- The interval should be greater than or equal to EstimatedRTT, otherwise unnecessary retransmissions would be sent. But the timeout interval should not be too much larger than EstimatedRTT; otherwise, when a segment is lost, TCP would not quickly retransmit the segment, leading to large data transfer delays. It is desirable to set the timeout equal to the EstimatedRTT plus some margin. The margin should be large when there is a lot of fluctuation in the SampleRTT values; it should be small when there is little fluctuation.
TimeoutInterval = EstimatedRTT + 4 ? DevRTT
- An initial TimeoutInterval value of 1 second is recommended. When a timeout occurs, the value of TimeoutInterval is doubled to avoid a premature(过早发生的) timeout occurring for a subsequent segment that will soon be acknowledged. But as soon as a segment is received and EstimatedRTT is updated, the TimeoutInterval is again computed using the formula above.
3.5.4 Reliable Data Transfer
- TCP creates a reliable data transfer service on top of IP’s unreliable service to make the byte stream is exactly the same byte stream that was sent by the end system on the other side of the connection.
- In our development of reliable data transfer techniques, it was conceptually easiest to assume that an individual timer is associated with each transmitted but not yet acknowledged segment. But timer management can require considerable overhead. Thus, the recommended TCP timer management procedures use only a single retransmission timer, even if there are multiple transmitted but not yet acknowledged segments.
- We will discuss how TCP provides reliable data transfer in two incremental steps. We suppose that data is being sent in only one direction, from Host A to Host B, and that Host A is sending a large file.
/*
Assume sender is not constrained by TCP flow or congestion control, that data from
above is less than MSS in size, and that data transfer is in one direction only.
*/
- First Event:
TCP receives data from the application, encapsulates the data in a segment, and passes the segment to IP. Each segment includes a sequence number that is the byte-stream number of the first data byte in the segment. If the timer is not running for some other segment, TCP starts the timer when the segment is passed to IP. Think the timer as being associated with the oldest unacknowledged segment. The full interval for this timer is the TimeoutInterval, which is calculated from EstimatedRTT and DevRTT.
- Second Event:
Timeout. TCP responds to the timeout event by retransmitting the segment that caused the timeout, then restarts the timer.
- Third Event:
The arrival of an acknowledgment segment (ACK) from the receiver (a segment containing a valid ACK field value). TCP compares the ACK value y with its variable SendBase. The TCP state variable SendBase is the sequence number of the oldest unacknowledged byte. Thus SendBase – 1 is the sequence number of the last byte that is known to have been received correctly. As TCP uses cumulative acknowledgments, so that y acknowledges the receipt of all bytes before byte number y. If y > SendBase, then the ACK is acknowledging one or more previously unacknowledged segments. Thus the sender updates its SendBase variable; it also restarts the timer if there currently are any not-yet-acknowledged segments.
- First Event:
- Figure 3.35
Suppose that both segments arrive intact at B, and B sends two separate acknowledgments for each of these segments. The first acknowledgment number is 100; the second acknowledgment number is 120. Suppose now that neither of the acknowledgments arrives at Host A before the timeout. When the timeout event occurs, Host A resends the first segment with sequence number 92 and restarts the timer. As long as the ACK for the second segment arrives before the new timeout, the second segment will not be retransmitted.
- Figure 3.36
The acknowledgment of the first segment is lost in the network, but just before the timeout event, Host A receives an acknowledgment with acknowledgment number 120. Host A therefore knows that Host B has received everything up through byte 119; so Host A does not resend either of the two segments.
- A few modifications that most TCP implementations employ.
Doubling the Timeout Interval
- Whenever the timeout event occurs, TCP retransmits the not-yet-acknowledged segment with the smallest sequence number. But each time TCP retransmits, it sets the next timeout interval to twice the previous value. For example, suppose TimeoutInterval associated with the oldest not yet acknowledged segment is 0.75 sec when the timer first expires. TCP will then retransmit this segment and set the new expiration time to 1.5 sec.
- However, whenever the timer is started after either of the two other events (data received from application above or ACK received), the TimeoutInterval is derived from the most recent values of EstimatedRTT and DevRTT.
- This modification provides a limited form of congestion control. The timer expiration is most likely caused by congestion in the network, that is, too many packets arriving at one (or more) router queues in the path between the source and destination, causing packets to be dropped and/or long queuing delays. In times of congestion, if the sources continue to retransmit packets persistently, the congestion may get worse. Instead, TCP with each sender retransmitting after longer and longer intervals.
Fast Retransmit
- One of the problems with timeout-triggered retransmissions is that the timeout period can be relatively long. The sender can often detect packet loss before the timeout event occurs by “duplicate ACKs”.
- A duplicate ACK is an ACK that reacknowledges a segment for which the sender has already received an earlier acknowledgment.
- When a TCP receiver receives a segment with a sequence number that is larger than the next, expected, in-order sequence number, it detects a missing segment(gap) in the data stream. This gap could be the result of lost or reordered segments within the network. Since TCP does not use negative acknowledgments, it simply reacknowledges (that is, generates a duplicate ACK for) the last in-order byte of data it has received.
- Because a sender often sends a large number of segments back to back, if one segment is lost, there will likely be many back-to-back duplicate ACKs. If the TCP sender receives three duplicate ACKs for the same data, it takes this as an indication that the segment following the segment that has been ACKed three times has been lost. In the case that three duplicate ACKs are received, the TCP sender performs a fast retransmit, retransmitting the missing segment before that segment’s timer expires. This is shown in Figure 3.37, where the second segment is lost, then retransmitted before its timer expires.
- For TCP with fast retransmit, the following code replaces the ACK received event in Figure 3.33:
Go-Back-N or Selective Repeat?
- TCP acknowledgments are cumulative and correctly received but out-of-order segments are not individually ACKed by the receiver. The TCP sender need only maintain the smallest sequence number of a transmitted but unacknowledged byte (SendBase) and the sequence number of the next byte to be sent (NextSeqNum). In this sense, TCP looks like a GBN-style protocol.
- But there are some differences between TCP and Go-Back-N. Many TCP implementations will buffer correctly received but out-of-order segments.
For example, the sender sends a sequence of segments 1, 2, … , N, and all of the segments arrive in order without error at the receiver. Suppose that the acknowledgment for packet n(n ‘<’ N) gets lost, but the remaining N – 1 acknowledgments arrive at the sender before their respective timeouts. In this example, GBN would retransmit not only packet n, but also all of the subsequent packets n + 1, n + 2, … , N. TCP, on the other hand, would retransmit at most one segment, namely, segment n. Moreover, TCP would not even retransmit segment n if the acknowledgment for segment n + 1 arrived before the timeout for segment n.
- A proposed modification to TCP, the so-called selective acknowledgment [RFC 2018], allows a TCP receiver to acknowledge out-of-order segments selectively rather than just cumulatively acknowledging the last correctly received, inorder segment. When combined with selective retransmission—skipping the retransmission of segments that have already been selectively acknowledged by the receiver—TCP looks a lot like our generic SR protocol.
- Thus, TCP’s error-recovery mechanism is probably best categorized as a hybrid of GBN and SR protocols.
3.5.5 Flow Control
- The hosts on each side of a TCP connection set aside a receive buffer for the connection. When the TCP connection receives bytes that are correct and in sequence, it places the data in the receive buffer. If the application is slow at reading the data, the sender can very easily overflow the connection’s receive buffer by sending too much data too quickly.
- Flow control is a speed-matching service: matching the rate at which the sender is sending against the rate at which the receiving application is reading.
Suppose the TCP receiver discards out-of-order segments.
- TCP provides flow control by having the sender maintain a variable called the receive window which is used to give the sender an idea of how much free buffer space is available at the receiver. Because TCP is full-duplex, the sender at each side of the connection maintains a distinct receive window.
- Suppose that Host A is sending a large file to Host B over a TCP connection. Host B allocates a receive buffer to this connection; denote its size by RcvBuffer. The application process in Host B reads from the buffer. Define the following variables:
*LastByteRead: the number of the last byte in the data stream read from the buffer by the application process in B.
*LastByteRcvd: the number of the last byte in the data stream that has arrived from the network and has been placed in the receive buffer at B.
- Because TCP is not permitted to overflow the allocated buffer, we must have
LastByteRcvd – LastByteRead <= RcvBuffer
The receive window, denoted rwnd is set to the amount of spare room in the buffer:
rwnd = RcvBuffer – [LastByteRcvd – LastByteRead]
Because the spare room changes with time, rwnd is dynamic.
- Host B tells Host A how much spare room it has in the connection buffer by placing its current value of rwnd in the receive window field of every segment it sends to A. Initially, Host B sets rwnd = RcvBuffer.
- Host A keeps track of two variables, LastByteSent and LastByteAcked. The difference between these two variables, LastByteSent – LastByteAcked, is the amount of unacknowledged data that A has sent into the connection. By keeping the amount of unacknowledged data less than the value of rwnd, Host A is assured that it is not overflowing the receive buffer at Host B. Thus, Host A makes sure throughout the connection’s life that
LastByteSent – LastByteAcked <= rwnd
- Suppose Host B’s receive buffer becomes full so that rwnd = 0. After advertising rwnd = 0 to Host A, B has nothing to send to A. As the application process at B empties the buffer, TCP does not send new segments with new rwnd values to Host A because TCP sends a segment to Host A only if it has data to send or if it has an acknowledgment to send. Therefore, Host A is never informed that some space has opened up in Host B’s receive buffer—Host A is blocked and can transmit no more data.
- To solve this problem, the TCP specification requires Host A to continue to send segments with one data byte when B’s receive window is zero. These segments will be acknowledged by the receiver. Eventually the buffer will begin to empty and the acknowledgments will contain a nonzero rwnd value.
- UDP does not provide flow control. Consider sending a series of UDP segments from a process on Host A to a process on Host B. UDP will append the segments in a finite-sized buffer that “precedes” the corresponding socket. The process reads one entire segment at a time from the buffer. If the process does not read the segments fast enough from the buffer, the buffer will overflow and segments will get dropped.
3.5.6 TCP Connection Management
- Suppose a process running in one host (client) wants to initiate a connection with another process in another host (server). The client application process first informs the client TCP that it wants to establish a connection to a process in the server. The TCP in the client then proceeds to establish a TCP connection with the TCP in the server in the following manner:
- The client-side TCP first sends a special TCP segment to the server-side TCP. This special segment contains no application-layer data. But one of the flag bits in the segment’s header, the SYN bit, is set to 1. For this reason, this special segment is referred to as a SYN segment. In addition, the client randomly chooses an initial sequence number (client_isn) and puts this number in the sequence number field of the initial TCP SYN segment. This segment is encapsulated within an IP datagram and sent to the server.
- Once the IP datagram containing the TCP SYN segment arrives at the server host, the server extracts the TCP SYN segment from the datagram, allocates the TCP buffers and variables to the connection, and sends a connection-granted segment to the client TCP. This connection-granted segment also contains no application-layer data, but it does contain three important pieces of information in the segment header.
-
- First, the SYN bit is set to 1.
-
- Second, the acknowledgment field of the TCP segment header is set to client_isn + 1.
-
- Finally, the server chooses its own initial sequence number (server_isn) and puts this value in the sequence number field of the TCP segment header.
This connection-granted segment is saying, “I received your SYN packet to start a connection with your initial sequence number, client_isn. I agree to establish this connection. My own initial sequence number is server_isn.” The connection-granted segment is referred to as a SYNACK segment.
- Upon receiving the SYNACK segment, the client also allocates buffers and variables to the connection. The client host then sends the server another segment; this last segment acknowledges the server’s connection-granted segment (the client does so by putting the value server_isn + 1 in the acknowledgment field of the TCP segment header). The SYN bit is set to 0, since the connection is established. This third stage of the three-way handshake may carry client-to-server data in the segment payload.
- Finally, the server chooses its own initial sequence number (server_isn) and puts this value in the sequence number field of the TCP segment header.
- Once these three steps have been completed, the client and server hosts can send segments containing data to each other. In each of these future segments, the SYN bit will be set to zero. In order to establish the connection, three packets are sent between the two hosts, as illustrated in Figure 3.39.
- Either of the two processes participating in a TCP connection can end the connection. When a connection ends, the “resources” (that is, the buffers and variables) in the hosts are deallocated.
- Suppose the client decides to close the connection.
- The client application process issues a close command. This causes the client TCP to send a special TCP segment to the server process which has a flag bit in the segment’s header, the FIN bit, set to 1.
- When the server receives this segment, it sends the client an acknowledgment segment in return.
- The server then sends its own shutdown segment, which has the FIN bit set to 1.
- Finally, the client acknowledges the server’s shutdown segment. At this point, all the resources in the two hosts are now deallocated.
- The client TCP begins in the CLOSED state.
- The application on the client side initiates a new TCP connection. This causes TCP in the client to send a SYN segment to TCP in the server.
- After having sent the SYN segment, the client TCP enters the SYN_SENT state.
- While in the SYN_SENT state, the client TCP waits for a segment from the server TCP that includes an acknowledgment for the client’s previous segment and has the SYN bit set to 1.
- Having received such a segment, the client TCP enters the ESTABLISHED state.
- While in the ESTABLISHED state, the TCP client can send and receive TCP segments containing payload (that is, application-generated) data.
- Suppose that the client application decides it wants to close the connection. This causes the client TCP to send a TCP segment with the FIN bit set to 1 and to enter the FIN_WAIT_1 state.
- While in the FIN_WAIT_1 state, the client TCP waits for a TCP segment from the server with an acknowledgment. When it receives this segment, the client TCP enters the FIN_WAIT_2 state.
- While in the FIN_WAIT_2 state, the client waits for another segment from the server with the FIN bit set to 1; after receiving this segment, the client TCP acknowledges the server’s segment and enters the TIME_WAIT state.
- The TIME_WAIT state lets the TCP client resend the final acknowledgment in case the ACK is lost. The time spent in the TIME_WAIT state is implementation-dependent, but typical values are 30 seconds, 1 minute, and 2 minutes.
- After the wait, the connection formally closes and all resources on the client side (including port numbers) are released.
- Our discussion above has assumed that both the client and server are prepared to communicate, i.e., that the server is listening on the port to which the client sends its SYN segment. Let’s consider what happens when a host receives a TCP segment whose port numbers or source IP address do not match with any of the ongoing sockets in the host.
- For example, suppose a host receives a TCP SYN packet with destination port 7188, but the host is not accepting connections on port 7188. Then the host will send a special reset segment to the source. This TCP segment has the RST flag bit set to 1. Thus, when a host sends a reset segment, it is telling the source “I don’t have a socket for that segment. Please do not resend the segment.” When a host receives a UDP packet whose destination port number doesn’t match with an ongoing UDP socket, the host sends a special ICMP datagram.
3.6 Principles of Congestion Control
3.6.1 The Causes and the Costs of Congestion
Scenario 1: Two Senders, a Router with Infinite Buffers
- Two hosts (A and B) each have a connection that shares a single hop between source and destination, as shown in Figure 3.43.
- Let’s assume that the application in Host A is sending data into the connection at an average rate of λin bytes/sec. These data are original in the sense that each unit of data is sent into the socket only once.
- The underlying transport-level protocol is a simple one: Data is encapsulated and sent; no error recovery (for example, retransmission), flow control, or congestion control is performed.
- Ignoring the additional overhead due to adding transport- and lower-layer header information, the rate at which Host A offers traffic to the router in this first scenario is thus λin bytes/sec.
- Host B operates in a similar manner, and it too is sending at a rate of λin bytes/sec.
- Packets from Hosts A and B pass through a router and over a shared outgoing link of capacity R. The router has buffers that allow it to store incoming packets when the packet-arrival rate exceeds the outgoing link’s capacity. We assume that the router has an infinite amount of buffer space.
- The left graph plots the per-connection throughput (number of bytes per second at the receiver) as a function of the connection-sending rate. For a sending rate between 0 and R/2, the throughput at the receiver equals the sender’s sending rate—everything sent by the sender is received at the receiver with a finite delay.
When the sending rate is above R/2, the throughput is only R/2. This upper limit on throughput is a consequence of the sharing of link capacity between two connections. No matter how high Hosts A and B set their sending rates, they will each never see a throughput higher than R/2.
- The right-hand graph shows the consequence of operating near link capacity. As the sending rate approaches R/2 (from the left), the average delay becomes larger and larger. When the sending rate exceeds R/2, the average number of queued packets in the router is unbounded, and the average delay between source and destination becomes infinite (assuming that the connections operate at these sending rates for an infinite period of time and there is an infinite amount of buffering available).
- While operating at an aggregate throughput of near R may be ideal from a throughput standpoint, it is far from ideal from a delay standpoint. We’ve already found one cost of a congested network—large queuing delays are experienced as the packet-arrival rate nears the link capacity.
Scenario 2: Two Senders and a Router with Finite Buffers
- First, the amount of router buffering is assumed to be finite, that is, packets will be dropped when arriving to an already full buffer.
- Second, each connection is reliable. If a packet containing a transport-level segment is dropped at the router, the sender will eventually retransmit it. Denote the rate at which the application sends original data into the socket by λin bytes/sec. The rate at which the transport layer sends segments (containing original data and retransmitted data) into the network will be denoted λ’in bytes/sec. λ’in is sometimes referred to as the offered load to the network.
- The performance realized under scenario 2 will now depend strongly on how retransmission is performed.
- First, consider the unrealistic case that Host A is able to determine whether or not a buffer is free in the router and thus sends a packet only when a buffer is free. In this case, no loss would occur, λin would be equal to λ’in, and the throughput of the connection would be equal to λin. This case is shown in Figure 3.46(a).
- From a throughput standpoint, performance is ideal—everything that is sent is received. Note that the average host sending rate cannot exceed R/2 under this scenario, since packet loss is assumed never to occur.
- Consider next the more realistic case that the sender retransmits only when a packet is known for certain to be lost. In this case, the performance might look something like that shown in Figure 3.46(b).
- Consider the case that the offered load, λ’in (the rate of original data transmission plus retransmissions), equals R/2. According to (b), at this value of the offered load, the rate at which data are delivered to the receiver application is R/3. Thus, out of the 1/2R units of data transmitted, 1/3R bytes/sec (on average) are original data and 1/6R bytes/sec (on average) are retransmitted data. We see here another cost of a congested network—the sender must perform retransmissions in order to compensate for dropped (lost) packets due to buffer overflow.
- Finally, consider the case that the sender may time out prematurely and retransmit a packet that has been delayed in the queue but not yet lost.
- In this case, both the original data packet and the retransmission may reach the receiver. The receiver needs only one copy of this packet and will discard the retransmission. The work done by the router in forwarding the retransmitted copy of the original packet was wasted, as the receiver will have already received the original copy of this packet. The router would have better used the link transmission capacity to send a different packet instead. Here is another cost of a congested network: unneeded retransmissions by the sender in the face of large delays may cause a router to use its link bandwidth to forward unneeded copies of a packet.
- (c) shows the throughput versus offered load when each packet is assumed to be forwarded (on average) twice by the router. Since each packet is forwarded twice, the throughput will have a value of R/4 as the offered load approaches R/2.
Scenario 3: Four Senders, Routers with Finite Buffers, and Multihop Paths
- In our final congestion scenario, four hosts transmit packets, each over overlapping two-hop paths, as shown in Figure 3.47.
- We again assume that each host uses a timeout/retransmission mechanism to implement a reliable data transfer service, that all hosts have the same value of λin, and that all router links have capacity R bytes/sec.
- Consider the connection from Host A to Host C, passing through routers R1 and R2. The A–C connection shares router R1 with the D–B connection and shares router R2 with the B–D connection. For small values of λin, buffer overflows are rare, and the throughput approximately equals the offered load. For slightly larger values of λin, the corresponding throughput is also larger, since more original data is being transmitted into the network and delivered to the destination, and overflows are still rare. Thus, for small values of λin, an increase in λin results in an increase in λout.
- Consider the case that λin (and hence λ’in ) is extremely large. Consider router R2. The A–C traffic arriving to router R2 (which arrives at R2 after being forwarded from R1) can have an arrival rate at R2 that is at most R, the capacity of the link from R1 to R2, regardless of the value of λin.
- If λ’in is extremely large for all connections (including the B–D connection), then the arrival rate of B–D traffic at R2 can be much larger than that of the A–C traffic. Because the A–C and B–D traffic must compete at router R2 for the limited amount of buffer space, the amount of A–C traffic that successfully gets through R2 (that is, is not lost due to buffer overflow) becomes smaller and smaller as the offered load from B–D gets larger and larger.
- In the limit, as the offered load approaches infinity, an empty buffer at R2 is immediately filled by a B–D packet, and the throughput of the A–C connection at R2 goes to zero. This implies that the A–C end-to-end throughput goes to zero in the limit of heavy traffic. These considerations give rise to the offered load versus throughput trade-off shown in Figure 3.48.
- The reason for the eventual decrease in throughput with increasing offered load is evident when one considers the amount of wasted work done by the network. In the high-traffic scenario, whenever a packet is dropped at a second-hop router, the work done by the first-hop router in forwarding a packet to the second-hop router ends up being “wasted.” The transmission capacity used at the first router to forward the packet to the second router could have been much more profitably used to transmit a different packet. So another cost of dropping a packet due to congestion: when a packet is dropped along a path, the transmission capacity that was used at each of the upstream links to forward that packet to the point at which it is dropped ends up having been wasted.
3.6.2 Approaches to Congestion Control
- At the broadest level, we can distinguish among congestion-control approaches by whether the network layer provides any explicit assistance to the transport layer for congestion-control purposes:
- End-to-end congestion control.
In an end-to-end approach to congestion control, the network layer provides no explicit support to the transport layer for congestion-control purposes. TCP must take this end-to-end approach toward congestion control, since the IP layer provides no feedback to the end systems regarding network congestion. TCP segment loss (as indicated by a timeout or a triple duplicate acknowledgment) is taken as an indication of network congestion and TCP decreases its window size accordingly. A more recent proposal for TCP congestion control that uses increasing round-trip delay values as indicators of increased network congestion.
- Network-assisted congestion control.
With network-assisted congestion control, network-layer components (i.e., routers) provide explicit feedback to the sender regarding the congestion state in the network. This feedback may be as simple as a single bit indicating congestion at a link. This approach was recently proposed for TCP/IP networks, and is used in ATM available bit-rate (ABR) congestion control as well. More sophisticated network feedback is also possible.
- End-to-end congestion control.
- For network-assisted congestion control, congestion information is typically fed back from the network to the sender in one of two ways, as shown in Figure 3.49.
- Direct feedback may be sent from a network router to the sender. This form of notification typically takes the form of a choke packet (essentially saying, “I’m congested!”).
- The second form of notification occurs when a router marks/updates a field in a packet flowing from sender to receiver to indicate congestion. Upon receipt of a marked packet, the receiver then notifies the sender of the congestion indication. Note that this latter form of notification takes at least a full round-trip time.
3.6.3 Network-Assisted Congestion-Control Example: ATM(异步传输模式 Asynchronous Transfer Mode) ABR(available bit-rate) Congestion Control
- Congestion-control algorithm in ATM ABR—a protocol that takes a network-assisted approach toward congestion control. Our goal is to illustrate a protocol that takes a markedly different approach toward congestion control from that of the Internet’s TCP protocol.
- Fundamentally ATM takes a virtual-circuit (VC) oriented approach toward packet switching. This means that each switch on the source-to-destination path will maintain state about the source-to-destination VC. This per-VC state allows a switch to track the behavior of individual senders (e.g., tracking their average transmission rate) and to take source-specific congestion-control actions (such as explicitly signaling to the sender to reduce its rate when the switch becomes congested). This per-VC state at network switches makes ATM ideally suited to perform network-assisted congestion control.
- When the network is underloaded, ABR service should be able to take advantage of the spare available bandwidth; when the network is congested, ABR service should throttle its transmission rate to some predetermined minimum transmission rate.
- Figure 3.50 shows the framework for ATM ABR congestion control. In our discussion we adopt ATM terminology (for example, using the term switch rather than router, and the term cell rather than packet).
- With ATM ABR service, data cells are transmitted from a source to a destination through a series of intermediate switches. Along with the data cells are resource-management cells (RM cells); these RM cells can be used to convey congestion-related information among the hosts and switches.
- When an RM cell arrives at a destination, it will be turned around and sent back to the sender (possibly after the destination has modified the contents of the RM cell). It is also possible for a switch to generate an RM cell itself and send this RM cell directly to a source. RM cells can thus be used to provide both direct network feedback and network feedback via the receiver.
- ATM ABR congestion control is a rate-based approach. The sender explicitly computes a maximum rate at which it can send and regulates itself accordingly.
- ABR provides three mechanisms for signaling congestion-related information from the switches to the receiver:
- EFCI bit.
Each data cell contains an explicit forward congestion indication (EFCI) bit. A congested network switch can set the EFCI bit in a data cell to 1 to signal congestion to the destination host. The destination must check the EFCI bit in all received data cells. When an RM cell arrives at the destination, if the most recently received data cell had the EFCI bit set to 1, then the destination sets the congestion indication bit (the CI bit) of the RM cell to 1 and sends the RM cell back to the sender. Using the EFCI in data cells and the CI bit in RM cells, a sender can thus be notified about congestion at a network switch.
- CI and NI bits.
Sender-to-receiver RM cells are along with data cells. The rate of RM cell is a changeable parameter, with the default value being one RM cell every 32 data cells. These RM cells have a congestion indication (CI) bit and a no increase (NI) bit that can be set by a congested network switch. A switch can set the NI bit in a passing RM cell to 1 under mild congestion and can set the CI bit to 1 under severe congestion conditions. When a destination host receives an RM cell, it will send the RM cell back to the sender with its CI and NI bits intact (except that CI may be set to 1 by the destination as a result of the EFCI mechanism described above).
- ER setting.
Each RM cell also contains a 2-byte explicit rate (ER) field. A congested switch may lower the value contained in the ER field in a passing RM cell. In this manner, the ER field will be set to the minimum supportable rate of all switches on the source-to-destination path.
- EFCI bit.
- An ATM ABR source adjusts the rate at which it can send cells as a function of the CI, NI, and ER values in a returned RM cell.
3.7 TCP Congestion Control
- TCP must use end-to-end congestion control rather than network-assisted congestion control, since the IP layer provides no explicit feedback to the end systems regarding network congestion.
- The approach taken by TCP is to have each sender limit the rate at which it sends traffic into its connection as a function of felt network congestion. If a TCP sender perceives that there is little congestion on the path between itself and the destination, then the TCP sender increases its send rate; otherwise reduces its send rate.
what algorithm should the sender use to change its send rate as a function of perceived end-to-end congestion?
how a TCP sender limits the rate at which it sends traffic into its connection?
- The TCP congestion-control mechanism operating at the sender keeps track of a variable, the congestion window. The congestion window, denoted cwnd, imposes a constraint on the rate at which a TCP sender can send traffic into the network. The amount of unacknowledged data at a sender may not exceed the minimum of cwnd and rwnd, that is:
LastByteSent – LastByteAcked <= min{cwnd, rwnd}
- Because TCP is not permitted to overflow the allocated buffer, we must have
LastByteRcvd – LastByteRead <= RcvBuffer
The receive window, denoted rwnd is set to the amount of spare room in the buffer:
rwnd = RcvBuffer – [LastByteRcvd – LastByteRead]
- Assume that the TCP receive buffer is so large that the receive-window constraint can be ignored; thus, the amount of unacknowledged data at the sender is only limited by cwnd. Also assume that the sender always has data to send, i.e., that all segments in the congestion window are sent.
- The constraint above limits the amount of unacknowledged data at the sender and therefore limits the sender’s send rate. Consider a connection for which loss and packet transmission delays are negligible. At the beginning of every RTT, the constraint permits the sender to send cwnd bytes of data into the connection; at the end of the RTT the sender receives acknowledgments for the data. Thus the sender’s send rate is roughly cwnd/RTT bytes/sec. By adjusting the value of cwnd, the sender can therefore adjust the rate at which it sends data into its connection.
how does a TCP sender perceive that there is congestion on the path between itself and the destination?
- Let us define a “loss event” at a TCP sender as the occurrence of either a timeout or the receipt of three duplicate ACKs from the receiver. When there is heavy congestion, then one (or more) router buffers along the path overflows, causing a datagram (containing a TCP segment) to be dropped. The dropped datagram results in a loss event at the sender— either a timeout or the receipt of three duplicate ACKs —which is taken by the sender to be an indication of congestion on the sender-to-receiver path.
- Consider when the network is congestion-free, that is, when a loss event doesn’t occur. In this case, acknowledgments for previously unacknowledged segments will be received at the TCP sender. TCP will take the arrival of these acknowledgments as an indication that all is well—that segments being transmitted into the network are being successfully delivered to the destination—and will use acknowledgments to increase its congestion window size (and hence its transmission rate). Note that if acknowledgments arrive at a relatively slow rate (e.g., if the end-end path has high delay or contains a low-bandwidth link), then the congestion window will be increased at a relatively slow rate. On the other hand, if acknowledgments arrive at a high rate, then the congestion window will be increased more quickly.
- Because TCP uses acknowledgments to trigger (or clock) its increase in congestion window size, TCP is said to be self-clocking.
How then do the TCP senders determine their sending rates such that they don’t congest the network but at the same time make use of all the available bandwidth?
- TCP answers these questions using the following guiding principles:
-
- A lost segment implies congestion, and hence, the TCP sender’s rate should be decreased when a segment is lost. A timeout event or the receipt of four acknowledgments for a given segment (one original ACK and then three duplicate ACKs) is interpreted as an implicit “loss event” indication of the segment following the quadruply ACKed segment, triggering a retransmission of the lost segment. From a congestion-control standpoint, the question is how the TCP sender should decrease its congestion window size, and hence its sending rate, in response to this inferred loss event.
-
- An acknowledged segment indicates that the network is delivering the sender’s segments to the receiver, and hence, the sender’s rate can be increased when an ACK arrives for a previously unacknowledged segment. The arrival of acknowledgments is taken as an implicit indication that all is well—segments are being successfully delivered from sender to receiver, and the network is thus not congested. The congestion window size can thus be increased.
-
- Bandwidth probing. Given ACKs indicating a congestion-free source-to-destination path and loss events indicating a congested path, TCP’s strategy for adjusting its transmission rate is to increase its rate in response to arriving ACKs until a loss event occurs, at which point, the transmission rate is decreased. The TCP sender thus increases its transmission rate to probe for the rate that at which congestion begins, backs off from that rate, and then to begins probing again to see if the congestion onset rate has changed. Note that there is no explicit signaling of congestion state by the network—ACKs and loss events serve as implicit signals—and that each TCP sender acts on local information asynchronously from other TCP senders.
-
- The TCP congestion-control algorithm has three major components: (1) slow start, (2) congestion avoidance, and (3) fast recovery. Slow start and congestion avoidance are mandatory components of TCP, differing in how they increase the size of cwnd in response to received ACKs. Fast recovery is recommended, but not required, for TCP senders.
Slow Start
- The TCP congestion-control algorithm has three major components: (1) slow start, (2) congestion avoidance, and (3) fast recovery. Slow start and congestion avoidance are mandatory components of TCP, differing in how they increase the size of cwnd in response to received ACKs. Fast recovery is recommended, but not required, for TCP senders.
- When a TCP connection begins, the value of cwnd is typically initialized to a small value of 1 MSS, resulting in an initial sending rate of roughly MSS/RTT. Since the available bandwidth to the TCP sender may be much larger than MSS/RTT, the TCP sender would like to find the amount of available bandwidth quickly. Thus, in the slow-start state, the value of cwnd begins at 1 MSS and increases by 1 MSS every time a transmitted segment is first acknowledged.
- In the example of Figure 3.51, TCP sends the first segment into the network and waits for an acknowledgment. When this acknowledgment arrives, the TCP sender increases the congestion window by one MSS and sends out two maximum-sized segments. These segments are then acknowledged, with the sender increasing the congestion window by 1 MSS for each of the acknowledged segments, giving a congestion window of 4 MSS, and so on.
- This process results in a doubling of the sending rate every RTT. Thus, the TCP send rate starts slow but grows exponentially during the slow start phase.
- When should this exponential growth end? Slow start provides several answers to this question.
- If there is a loss event indicated by a timeout, the TCP sender sets the value of a state variable, ssthresh (“slow start threshold”) to cwnd/2—half of the value of the congestion window value when congestion was detected, and then sets the value of cwnd to 1 and begins the slow start process anew.
- Directly tied to the value of ssthresh. Since ssthresh is half the value of cwnd when congestion was last detected, it might be a bit dangerous to keep doubling cwnd when it reaches or surpasses the value of ssthresh. Thus, when the value of cwnd equals ssthresh, slow start ends and TCP transitions into congestion avoidance mode.
- If three duplicate ACKs are detected, in which case TCP performs a fast retransmit and enters the fast recovery state.
Congestion Avoidance
- On entry to the congestion-avoidance state, the value of cwnd is half its value when congestion was last encountered. Rather than doubling the value of cwnd every RTT, TCP increases the value of cwnd by just a single MSS every RTT. This can be accomplished in several ways.
- A common approach is for the TCP sender to increase cwnd by MSS bytes whenever a new acknowledgment arrives. For example, if MSS is 1,460 bytes and cwnd is 14,600 bytes, then 10 segments are being sent within an RTT. Each arriving ACK (assuming one ACK per segment) increases the congestion window size by 1/10 MSS, and thus, the value of the congestion window will have increased by one MSS after ACKs when all 10 segments have been received.
- When a timeout occurs, the value of cwnd is set to 1 MSS, and the value of ssthresh is updated to half the value of cwnd when the loss event occurred.
- Recall that a loss event also can be triggered by a triple duplicate ACK event. In this case, the network is continuing to deliver segments from sender to receiver. TCP halves the value of cwnd and records the value of ssthresh to be half the value of cwnd when the triple duplicate ACKs were received. The fast-recovery state is then entered.
Fast Recovery
- In fast recovery, the value of cwnd is increased by 1 MSS for every duplicate ACK received for the missing segment that caused TCP to enter the fast-recovery state. Eventually, when an ACK arrives for the missing segment, TCP enters the congestion-avoidance state after shrinking cwnd. If a timeout event occurs, fast recovery transitions to the slow-start state after: The value of cwnd is set to 1 MSS, and the value of ssthresh is set to half the value of cwnd when the loss event occurred.
- An early version of TCP, known as TCP Tahoe, unconditionally cut its congestion window to 1 MSS and entered the slow-start phase after either a timeout-indicated or triple-duplicate-ACK-indicated loss event. The newer version of TCP, TCP Reno, incorporated fast recovery.
- Figure 3.53 illustrates the evolution of TCP’s congestion window for both Reno and Tahoe. In this figure, the threshold is initially equal to 8 MSS. For the first eight transmission rounds, Tahoe and Reno take identical actions. The congestion window climbs exponentially fast during slow start and hits the threshold at the fourth round of transmission. The congestion window then climbs linearly until a triple duplicate-ACK event occurs, just after transmission round 8. Note that the congestion window is 12 ? MSS when this loss event occurs. The value of ssthresh is then set to 0.5 ? cwnd = 6 ? MSS. Under TCP Reno, the congestion window is set to cwnd = 6 ? MSS and then grows linearly. Under TCP Tahoe, the congestion window is set to 1 MSS and grows exponentially until it reaches the value of ssthresh, at which point it grows linearly.
TCP Congestion Control: Retrospective
- Ignoring the initial slow-start period when a connection begins and assuming that losses are indicated by triple duplicate ACKs rather than timeouts, TCP’s congestion control consists of linear (additive) increase in cwnd of 1 MSS per RTT and then a halving(multiplicative) of cwnd on a triple duplicate-ACK event. For this reason, TCP congestion control is often referred to as an additive-increase, multiplicative-decrease (AIMD) form of congestion control.
- AIMD congestion control gives rise to the “saw tooth” behavior shown in Figure 3.54, which also illustrates our earlier intuition of TCP “probing” for bandwidth: TCP linearly increases its congestion window size (and hence its transmission rate) until a triple duplicate-ACK event occurs. It then decreases its congestion window size by a factor of two but then again begins increasing it linearly, probing to see if there is additional available bandwidth.
Macroscopic Description of TCP Throughput
TCP Over High-Bandwidth Paths
3.7.1 Fairness
- Consider K TCP connections, each with a different end-to-end path, but all passing through a bottleneck link with transmission rate R bps. (By bottleneck link, we mean that for each connection, all the other links along the connection’s path are not congested and have abundant transmission capacity as compared with the transmission capacity of the bottleneck link.) Suppose each connection is transferring a large file and there is no UDP traffic passing through the bottleneck link. A congestion-control mechanism is said to be fair if the average transmission rate of each connection is approximately R/K; that is, each connection gets an equal share of the link bandwidth.
- Is TCP’s AIMD algorithm fair, particularly given that different TCP connections may start at different times and thus may have different window sizes at a given point in time? The fact is that TCP congestion control converges to provide an equal share of a bottleneck link’s bandwidth among competing TCP connections.
- Consider two TCP connections sharing a single link with transmission rate R, as shown in Figure 3.55.
- Assume that the two connections have the same MSS and RTT (so that if they have the same congestion window size, then they have the same throughput), that they have a large amount of data to send, and that no other TCP connections or UDP datagrams traverse this shared link. Also, ignore the slow-start phase of TCP.
- If TCP is to share the link bandwidth equally between the two connections, then the realized throughput should fall along the 45-degree arrow (equal bandwidth share) emanating from the origin. Ideally, the sum of the two throughputs should equal R. So the goal should be to have the achieved throughputs fall somewhere near the intersection of the equal bandwidth share line and the full bandwidth utilization line in Figure 3.56.
- Suppose that the TCP window sizes are such that at a given point in time, connections 1 and 2 realize throughputs indicated by point A in Figure 3.56. Because the amount of link bandwidth jointly consumed by the two connections is less than R, no loss will occur, and both connections will increase their window by 1 MSS per RTT as a result of TCP’s congestion-avoidance algorithm. Thus, the joint throughput of the two connections proceeds along a 45-degree line (equal increase for both connections) starting from point A. Eventually, the link bandwidth jointly consumed by the two connections will be greater than R, and eventually packet loss will occur.
- Suppose that connections 1 and 2 experience packet loss when they realize throughputs indicated by point B. Connections 1 and 2 then decrease their windows by a factor of two. The resulting throughputs realized are thus at point C, halfway along a vector starting at B and ending at the origin. Because the joint bandwidth use is less than R at point C, the two connections again increase their throughputs along a 45-degree line starting from C. Eventually, loss will again occur, for example, at point D, and the two connections again decrease their window sizes by a factor of two, and so on.
- The bandwidth realized by the two connections eventually fluctuates along the equal bandwidth share line and the two connections will converge to this behavior regardless of where they are in the two-dimensional space.
Fairness and UDP
- From the perspective of TCP, the connections running over UDP are not being fair—they do not cooperate with the other connections nor adjust their transmission rates appropriately. It is possible for UDP sources to crowd out TCP traffic.
Fairness and Parallel TCP Connections
- Even if we could force UDP traffic to behave fairly, the fairness problem would still not be completely solved. This is because there is nothing to stop a TCP-based application from using multiple parallel connections. For example, web browsers often use multiple parallel TCP connections to transfer the multiple objects within a web page. When an application uses multiple parallel connections, it gets a larger fraction of the bandwidth in a congested link.
3.8 Summary
Please indicate the source: http://blog.csdn.net/gaoxiangnumber1
Welcome to my github: https://github.com/gaoxiangnumber1