一、TCP简介
TCP(Transmission Control Protocol,传输控制协议)是一个传输层(Transport Layer)协议,它在TCP/IP协议族中的位置如图1所示。它是专门为了在不可靠的互联网络上提供一个面向连接的且可靠的端到端(进程到进程)字节流而设计的。互联网络与单个网络不同,因为互联网络的不同部分可能有截然不同的拓扑、带宽、延迟、分组大小和其他参数。TCP的设计目标是能够动态地适应互联网络的这些特性,而且当面对多种失败的时候仍然足够健壮。
图1 TCP在TCP/IP协议族中的位置
TCP服务为应用提供了以下功能:流量控制、差错控制、拥塞控制、多路复用和全双工通信。所谓“面向连接”即是通过三次握手建立的一个通信连接(如图2所示);所谓“可靠的字节流服务”即是数据发送方等待数据接收方发送确认才清除发送数据缓存,如果没有收到接收方的确认信息则等待超时重发没有得到确认的部分字节。TCP通过端口号来完成进程到进程的通信;使用滑动窗口协议完成流量控制;使用确认分组、超时重传来完成差错控制。这是TCP协议提供的服务的基本概念。
图2 TCP通信连接建立过程——“三次握手”
二、TCP/IP协议栈的初始化
TCP协议相关的代码主要集中在linux-5.0.1/net/ipv4/目录下,在linux-5.0.1/net/ipv4/af_inet.c中可以查看TCP/IP协议栈的初始化的函数入口inet_init:
static int __init inet_init(void) { ... rc = proto_register(&tcp_prot, 1); if (rc) goto out; ... /* * Tell SOCKET that we are alive... */ (void)sock_register(&inet_family_ops); ... /* * Add all the base protocols. */ ... if (inet_add_protocol(&tcp_protocol, IPPROTO_TCP) < 0) pr_crit("%s: Cannot add TCP protocol\n", __func__); ... /* Register the socket-side information for inet_create. */ for (r = &inetsw[0]; r < &inetsw[SOCK_MAX]; ++r) INIT_LIST_HEAD(r); for (q = inetsw_array; q < &inetsw_array[INETSW_ARRAY_LEN]; ++q) inet_register_protosw(q); ... /* * Set the IP module up */ ip_init(); /* Setup TCP slab cache for open requests. */ tcp_init(); ... } fs_initcall(inet_init);
tcp_prot结构体可在linux-5.0.1/net/ipv4/tcp_ipv4.c中寻得,tcp_prot指定了TCP协议栈的访问接口函数,socket接口层里sock->opt->connect和sock->opt->accept对应的接口函数即是在这里制定的,sock->opt->connect实际调用的是tcp_v4_connect函数,sock->opt->accept实际调用的是inet_csk_accept函数。tcp_init函数可在linux-5.0.1/net/ipv4/tcp.c中寻得,其中关键的工作就是tcp_tasklet_init初始化了负责发送字节流进行滑动窗口管理的tasklet,即创建了线程来专门负责这个工作。相关代码如下所示:
struct proto tcp_prot = { .name = "TCP", .owner = THIS_MODULE, .close = tcp_close, .pre_connect = tcp_v4_pre_connect, .connect = tcp_v4_connect, .disconnect = tcp_disconnect, .accept = inet_csk_accept, .ioctl = tcp_ioctl, .init = tcp_v4_init_sock, .destroy = tcp_v4_destroy_sock, .shutdown = tcp_shutdown, .setsockopt = tcp_setsockopt, .getsockopt = tcp_getsockopt, .keepalive = tcp_set_keepalive, .recvmsg = tcp_recvmsg, .sendmsg = tcp_sendmsg, .sendpage = tcp_sendpage, .backlog_rcv = tcp_v4_do_rcv, .release_cb = tcp_release_cb, ... };
void __init tcp_init(void) { ... tcp_v4_init(); tcp_metrics_init(); BUG_ON(tcp_register_congestion_control(&tcp_reno) != 0); tcp_tasklet_init(); }
三、TCP“三次握手”的源代码分析
在TCP“三次握手”建立连接的过程中Client端会依次调用socket(),connect(),Server端会依次调用socket(),bind(),listen(),accept()。下图将Server端和Client端Socket API的调用顺序与TCP“三次握手”的机制结合起来展示了连接的建立过程,同时通过SYN/ACK的机制展示了客户端到服务端和服务端到客户端两条可靠的字节流的实现原理。
图3 TCP“三次握手”过程及相关 Socket API的调用顺序
接下来我们对TCP“三次握手”的过程进行跟踪、验证和分析。在一个终端打开qemu启动MenuOS,在另一个终端用gdb读入linux-5.0.1的vmlinux,通过端口1234与qemu建立连接,设置断点如下:
在qemu输入replyhi指令,然后在gdb中持续输入continue指令,直到无法继续;再在qemu中输入hello指令,然后在gdb中持续输入continue指令,直到无法继续。此时,qemu中指令已完成运行如下图所示:
gdb中显示的函数调用顺序如下图所示:
TCP的“三次握手”从用户程序的角度看就是Server端accept和Client端connect建立起连接时背后的完成的工作,在内核socket接口层这两个socket API函数对应着sys_connect和sys_accept函数,进一步对应着sock->opt->connect和sock->opt->accept两个函数指针,在TCP协议中这两个函数指针对应着tcp_v4_connect函数和inet_csk_accept函数。
Server端调用的inet_csk_accept函数会请求队列中取出一个连接请求,如果队列为空则通过inet_csk_wait_for_connect阻塞住等待客户端的连接。其源代码可在linux-5.0.1/net/ipv4/tcp_ipv4.c中寻得:
/* * This will accept the next outstanding connection. */ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern) { struct inet_connection_sock *icsk = inet_csk(sk); struct request_sock_queue *queue = &icsk->icsk_accept_queue; struct request_sock *req; struct sock *newsk; int error; lock_sock(sk); /* We need to make sure that this socket is listening, * and that it has something pending. */ error = -EINVAL; if (sk->sk_state != TCP_LISTEN) goto out_err; /* Find already established connection */ if (reqsk_queue_empty(queue)) { long timeo = sock_rcvtimeo(sk, flags & O_NONBLOCK); /* If this is a non blocking socket don‘t sleep */ error = -EAGAIN; if (!timeo) goto out_err; error = inet_csk_wait_for_connect(sk, timeo); if (error) goto out_err; } req = reqsk_queue_remove(queue, sk); newsk = req->sk; if (sk->sk_protocol == IPPROTO_TCP && tcp_rsk(req)->tfo_listener) { spin_lock_bh(&queue->fastopenq.lock); if (tcp_rsk(req)->tfo_listener) { /* We are still waiting for the final ACK from 3WHS * so can‘t free req now. Instead, we set req->sk to * NULL to signify that the child socket is taken * so reqsk_fastopen_remove() will free the req * when 3WHS finishes (or is aborted). */ req->sk = NULL; req = NULL; } ...return newsk; ... } EXPORT_SYMBOL(inet_csk_accept);
Client端调用的tcp_v4_connect函数的主要作用就是发起一个TCP连接,建立TCP连接的过程自然需要底层协议的支持,因此我们从这个函数中可以看到它调用了IP层提供的一些服务,比如ip_route_connect和ip_route_newports。我们可以看到这里设置了 TCP_SYN_SENT并进一步调用了 tcp_connect(sk)来实际构造SYN并发送出去。在linux-5.0.1/net/ipv4/tcp_ipv4.c中可寻得tcp_v4_connect函数的源代码:
/* This will initiate an outgoing connection. */ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len) { ... rt = ip_route_connect(fl4, nexthop, inet->inet_saddr, RT_CONN_FLAGS(sk), sk->sk_bound_dev_if, IPPROTO_TCP, orig_sport, orig_dport, sk); .../* Socket identity is still unknown (sport may be zero). * However we set state to SYN-SENT and not releasing socket * lock select source port, enter ourselves into the hash tables and * complete initialization after this. */ tcp_set_state(sk, TCP_SYN_SENT); ... rt = ip_route_newports(fl4, rt, orig_sport, orig_dport, inet->inet_sport, inet->inet_dport, sk); ... err = tcp_connect(sk); ... } EXPORT_SYMBOL(tcp_v4_connect);
tcp_v4_connect函数中调用的tcp_connect函数负责具体构造一个携带SYN标志位的TCP头并发送出去,同时还设置了计时器超时重发。其源代码可在linux-5.0.1/net/ipv4/tcp_output.c中寻得:
/* Do all connect socket setups that can be done AF independent. */ static void tcp_connect_init(struct sock *sk) { const struct dst_entry *dst = __sk_dst_get(sk); struct tcp_sock *tp = tcp_sk(sk); __u8 rcv_wscale; u32 rcv_wnd; /* We‘ll fix this up when we get a response from the other end. * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT. */ tp->tcp_header_len = sizeof(struct tcphdr); if (sock_net(sk)->ipv4.sysctl_tcp_timestamps) tp->tcp_header_len += TCPOLEN_TSTAMP_ALIGNED; #ifdef CONFIG_TCP_MD5SIG if (tp->af_specific->md5_lookup(sk, sk)) tp->tcp_header_len += TCPOLEN_MD5SIG_ALIGNED; #endif /* If user gave his TCP_MAXSEG, record it to clamp */ if (tp->rx_opt.user_mss) tp->rx_opt.mss_clamp = tp->rx_opt.user_mss; tp->max_window = 0; tcp_mtup_init(sk); tcp_sync_mss(sk, dst_mtu(dst)); tcp_ca_dst_init(sk, dst); if (!tp->window_clamp) tp->window_clamp = dst_metric(dst, RTAX_WINDOW); tp->advmss = tcp_mss_clamp(tp, dst_metric_advmss(dst)); tcp_initialize_rcv_mss(sk); /* limit the window selection if the user enforce a smaller rx buffer */ if (sk->sk_userlocks & SOCK_RCVBUF_LOCK && (tp->window_clamp > tcp_full_space(sk) || tp->window_clamp == 0)) tp->window_clamp = tcp_full_space(sk); rcv_wnd = tcp_rwnd_init_bpf(sk); if (rcv_wnd == 0) rcv_wnd = dst_metric(dst, RTAX_INITRWND); tcp_select_initial_window(sk, tcp_full_space(sk), tp->advmss - (tp->rx_opt.ts_recent_stamp ? tp->tcp_header_len - sizeof(struct tcphdr) : 0), &tp->rcv_wnd, &tp->window_clamp, sock_net(sk)->ipv4.sysctl_tcp_window_scaling, &rcv_wscale, rcv_wnd); tp->rx_opt.rcv_wscale = rcv_wscale; tp->rcv_ssthresh = tp->rcv_wnd; sk->sk_err = 0; sock_reset_flag(sk, SOCK_DONE); tp->snd_wnd = 0; tcp_init_wl(tp, 0); tcp_write_queue_purge(sk); tp->snd_una = tp->write_seq; tp->snd_sml = tp->write_seq; tp->snd_up = tp->write_seq; tp->snd_nxt = tp->write_seq; if (likely(!tp->repair)) tp->rcv_nxt = 0; else tp->rcv_tstamp = tcp_jiffies32; tp->rcv_wup = tp->rcv_nxt; tp->copied_seq = tp->rcv_nxt; inet_csk(sk)->icsk_rto = tcp_timeout_init(sk); inet_csk(sk)->icsk_retransmits = 0; tcp_clear_retrans(tp); } static void tcp_connect_queue_skb(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); struct tcp_skb_cb *tcb = TCP_SKB_CB(skb); tcb->end_seq += skb->len; __skb_header_release(skb); sk->sk_wmem_queued += skb->truesize; sk_mem_charge(sk, skb->truesize); tp->write_seq = tcb->end_seq; tp->packets_out += tcp_skb_pcount(skb); }
参考文献:
1. Andrew S. Tanenbaum, Computer Networks Fourth Edition.
2. Behrouz A. Forouzan, TCP/IP Protocol Suite Fourth Edition.
原文地址:https://www.cnblogs.com/wtz14/p/12099188.html