各协议族传输层使用各自的传输控制块存放套接口所要求的信息。TCP传输控制块、UDP传输控制块、原始IP传输控制块等
Linux内核的传输控制块定义是非常巧妙的---根据协议族和传输层协议的特点,分层次地定义了多个结构用来组成传输控制块。IPv4协议族包括sock_common、sock、inet_sock、inet_connection_sock、tcp_sock、request_sock、inet_request_sock、tcp_request_sock、inet_timewait_sock、tcp_timewait_sock、udp_sock、raw_sock结构。
sock_common
该结构是传输控制块信息的最小集合,由sock和inet_timewait_sock结构前面相同部分单独构成,因此只用来构成这两种结构
sock
该结构是比较通用的网络层描述块,构成传输控制块的基础,与具体的协议族无关。它描述了各个协议族传输层协议的功能信息,因此不能直接作为传输控制块来使用,不同协议族的传输层在使用该结构时都会对其进行扩展,来适合各自的传输特性,例如,inet_sock结构由sock结构及其他特性组成,构成IPv4协议族传输控制块的基础
inet_sock
该结构是比较通用的IPv4协议族描述块,包含IPv4协议族基础传输层,即UDP、TCP以及原始传输控制块共有的信息
inet_connection_sock
该结构是支持面向连接特性的描述块,构成IPv4协议族TCP控制块的基础,在inet_sock结构的基础上加入了支持连接的特性
tcp_sock
该结构即TCP传输控制块,支持完整的TCP特性,包含了TCP为各连接维护的所有节点信息
inet_timewait_sock
该结构是支持面向连接特性的TCP_TIME_WAIT状态的描述,是构成tcp_timewait_sock的基础
tcp_timewait_sock
该结构是TCP_TIME_WAIT状态描述块,是一种比较特殊的传输控制块,当TCP状态为TCP_TIME_WAIT时,tcp_sock结构会蜕变为tcp_timewait_sock结构
udp_sock
该结构是UDP传输控制块,支持UDP的完整特性。UDP需要的信息基本都在inet_sock结构中描述。
基本传输控制块和IPv4专用的传输控制块以及传输层通用的函数涉及以下文件:
include/net/sock.h 定义基本的传输控制块结构、宏和函数原型
include/net/inet_sock.h 定义IPv4专用的传输控制块
net/core/sock.c 实现传输层通用的函数
net/socket.c 实现套接口层的调用。
传输控制块的内存管理
传输控制块的分配和释放
sk_alloc()
在创建套接口时,TCP、UDP和原始IP会分配一个传输控制块。分配传输控制块的函数为sk_alloc()。当传输控制块生命结束以后,通过sk_free()将其释放。
/** * sk_alloc - All socket objects are allocated here * @net: the applicable net namespace * @family: protocol family * @priority: for allocation (%GFP_KERNEL, %GFP_ATOMIC, etc) * @prot: struct proto associated with this new sock instance */ struct sock *sk_alloc(struct net *net, int family, gfp_t priority, struct proto *prot) { struct sock *sk; sk = sk_prot_alloc(prot, priority | __GFP_ZERO, family); if (sk) { sk->sk_family = family; /* * See comment in struct sock definition to understand * why we need sk_prot_creator -acme */ sk->sk_prot = sk->sk_prot_creator = prot; sock_lock_init(sk); sock_net_set(sk, get_net(net)); atomic_set(&sk->sk_wmem_alloc, 1); } return sk; }
sk_free()
sk_free()通常用于释放指定的传输控制块,通常由sock_put()调用,当指定的控制块的引用计数为0时才会调用此函数进行释放操作。
static void __sk_free(struct sock *sk) { struct sk_filter *filter; if (sk->sk_destruct) sk->sk_destruct(sk); filter = rcu_dereference(sk->sk_filter); if (filter) { sk_filter_uncharge(sk, filter); rcu_assign_pointer(sk->sk_filter, NULL); } sock_disable_timestamp(sk, SOCK_TIMESTAMP); sock_disable_timestamp(sk, SOCK_TIMESTAMPING_RX_SOFTWARE); if (atomic_read(&sk->sk_omem_alloc)) printk(KERN_DEBUG "%s: optmem leakage (%d bytes) detected.\n", __func__, atomic_read(&sk->sk_omem_alloc)); put_net(sock_net(sk)); sk_prot_free(sk->sk_prot_creator, sk); } void sk_free(struct sock *sk) { /* * We substract one from sk_wmem_alloc and can know if * some packets are still in some tx queue. * If not null, sock_wfree() will call __sk_free(sk) later */ if (atomic_dec_and_test(&sk->sk_wmem_alloc)) __sk_free(sk); }
普通发送缓存区分配
sock_alloc_send_skb()
主要为UDP和RAW套接口分配用于输出的SKB。与sock_wmalloc()相比,在分片过程中考虑的细节比较多,支持检测传输控制块已经发生的错误、检测关闭套接口的标志、阻塞等,实际上是直接调用sock_alloc_send_pskb()实现的。
/* * Generic send/receive buffer handlers */ struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len, unsigned long data_len, int noblock, int *errcode) { struct sk_buff *skb; gfp_t gfp_mask; long timeo; int err; gfp_mask = sk->sk_allocation; if (gfp_mask & __GFP_WAIT) gfp_mask |= __GFP_REPEAT; timeo = sock_sndtimeo(sk, noblock); while (1) { err = sock_error(sk); if (err != 0) goto failure; err = -EPIPE; if (sk->sk_shutdown & SEND_SHUTDOWN) goto failure; if (atomic_read(&sk->sk_wmem_alloc) < sk->sk_sndbuf) { skb = alloc_skb(header_len, gfp_mask); if (skb) { int npages; int i; /* No pages, we're done... */ if (!data_len) break; npages = (data_len + (PAGE_SIZE - 1)) >> PAGE_SHIFT; skb->truesize += data_len; skb_shinfo(skb)->nr_frags = npages; for (i = 0; i < npages; i++) { struct page *page; skb_frag_t *frag; page = alloc_pages(sk->sk_allocation, 0); if (!page) { err = -ENOBUFS; skb_shinfo(skb)->nr_frags = i; kfree_skb(skb); goto failure; } frag = &skb_shinfo(skb)->frags[i]; frag->page = page; frag->page_offset = 0; frag->size = (data_len >= PAGE_SIZE ? PAGE_SIZE : data_len); data_len -= PAGE_SIZE; } /* Full success... */ break; } err = -ENOBUFS; goto failure; } set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags); set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); err = -EAGAIN; if (!timeo) goto failure; if (signal_pending(current)) goto interrupted; timeo = sock_wait_for_wmem(sk, timeo); } skb_set_owner_w(skb, sk); return skb; interrupted: err = sock_intr_errno(timeo); failure: *errcode = err; return NULL; } struct sk_buff *sock_alloc_send_skb(struct sock *sk, unsigned long size, int noblock, int *errcode) { return sock_alloc_send_pskb(sk, size, 0, noblock, errcode); }
发送缓存的分配与释放
sock_wmalloc()
sock_wmalloc的作用也是分配发送缓存。在TCP中,只是在构造SYN+ACK时使用,发送用户数据时通常sk_stream_alloc_pskb()分配发送缓存。
/* * Allocate a skb from the socket's send buffer. */ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force, gfp_t priority) { if (force || atomic_read(&sk->sk_wmem_alloc) < sk->sk_sndbuf) { struct sk_buff *skb = alloc_skb(size, priority); if (skb) { skb_set_owner_w(skb, sk); return skb; } } return NULL; }
skb_set_owner_w()
每个用于输出的SKB都要关联到一个传输控制块上,这样可以调整该传输控制块为发送而分配的所有SKB数据区的总大小,并设置此SKB的销毁函数。
/* * Queue a received datagram if it will fit. Stream and sequenced * protocols can't normally use this as they need to fit buffers in * and play with them. * * Inlined as it's very short and called for pretty much every * packet ever received. */ static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk) { skb_orphan(skb); skb->sk = sk; skb->destructor = sock_wfree; /* * We used to take a refcount on sk, but following operation * is enough to guarantee sk_free() wont free this sock until * all in-flight packets are completed */ atomic_add(skb->truesize, &sk->sk_wmem_alloc); }
sock_wfree()
sock_wfree()通常设置到用于输出SKB的销毁函数接口上,当释放该SKB时被调用,用于更新所属传输控制块为发送而分配的所有SKB数据区的总大小,调用sk_write_space接口来唤醒因等待本套接口而处于睡眠状态的进程,递减对所属传输控制块的引用
/* * Simple resource managers for sockets. */ /* * Write buffer destructor automatically called from kfree_skb. */ void sock_wfree(struct sk_buff *skb) { struct sock *sk = skb->sk; unsigned int len = skb->truesize; if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) { /* * Keep a reference on sk_wmem_alloc, this will be released * after sk_write_space() call */ atomic_sub(len - 1, &sk->sk_wmem_alloc); sk->sk_write_space(sk); len = 1; } /* * if sk_wmem_alloc reaches 0, we must finish what sk_free() * could not do because of in-flight packets */ if (atomic_sub_and_test(len, &sk->sk_wmem_alloc)) __sk_free(sk); }
接收缓存的分配与释放
用于输入的SKB都是在驱动层通过dev_alloc_skb()或alloc_skb()进行分配的,在传递至传输层以前,并不属于哪个具体的传输控制块。但是一旦进入传输层,便需要设置该SKB的宿主。
skb_set_owner_r()
当UDP数据报的SKB传递并添加到UDP传输控制块的接收队列中,便会调用skb_set_owner_r()设置该SKB的宿主,并设置此SKB的销毁函数,还要更新接收队列中所有报文数据的总长度。
static inline void skb_set_owner_r(struct sk_buff *skb, struct sock *sk) { skb_orphan(skb); skb->sk = sk; skb->destructor = sock_rfree; atomic_add(skb->truesize, &sk->sk_rmem_alloc); sk_mem_charge(sk, skb->truesize); }
异步I/O机制
尽快阻塞和非阻塞操作同select方法的结合对于查询设备在大多数情况下是有效的,但在某些情况下还不能完全有效地解决问题。
例如一个进程,在低优先级上执行一个较长的计算循环,但是需要尽可能快地处理输入数据。如果进程通过响应外设获取数据,当新数据可用时它应当立刻知道。通常应用程序可调用select()有规律地检查数据,但是,如果更迅速地处理外设数据,就可以使用异步通知的方法,使应用程序接收一个信号,而不需要主动查询。
用户程序必须执行2个步骤使能来自输入文件的异步通知。首先,它们指定一个进程作为文件的拥有者。当一个进程使用fcntl系统调用发出F_SETOWN命令,这个拥有者进程的ID被保存在filp->f_owner中供以后使用。通过这一步,内核便知道通知的对象。为了真正使能异步通知,用户程序必须通过fcntl的F_SETFL命令在设备中设置FASYNC标志。在这两个调用执行后,处理异步IO的进程可接管SIGIO信号,此后,无论新数据何时到达,信号都会发送给存储与filp->f_owner中的进程。
例如,下面的用户程序中的代码实现了向当前进程发送标准输入文件的异步通知:
signal(SIGIO, &input_handler);
fcntl(STDIN_FILENO, F_SETOWN, getpid());
oflags = fcntl(STDIN_FILENO, F_GETFL);
fcntl(STDIN_FILENO, F_SETFL, oflags | FASYNC)
sk_wake_async()
用来将SIGIO或SIGURG信号发送给在该套接口上的进程,通知该进程可以对该文件进行读或写。
/* This function may be called only under socket lock or callback_lock */ int sock_wake_async(struct socket *sock, int how, int band) { if (!sock || !sock->fasync_list) return -1; switch (how) { case SOCK_WAKE_WAITD: if (test_bit(SOCK_ASYNC_WAITDATA, &sock->flags)) break; goto call_kill; case SOCK_WAKE_SPACE: if (!test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sock->flags)) break; /* fall through */ case SOCK_WAKE_IO: call_kill: __kill_fasync(sock->fasync_list, SIGIO, band); break; case SOCK_WAKE_URG: __kill_fasync(sock->fasync_list, SIGURG, band); } return 0; } static inline void sk_wake_async(struct sock *sk, int how, int band) { if (sk->sk_socket && sk->sk_socket->fasync_list) sock_wake_async(sk->sk_socket, how, band); }
how
enum {
SOCK_WAKE_IO, 检测标识应用程序通过recv等调用时,是否在等待数据的接收
SOCK_WAKE_WAITD, 检测传输控制块的发送队列是否曾经达到上限
SOCK_WAKE_SPACE, 不做任何检测,直接向等待进程发送SIGIO信号
SOCK_WAKE_URG, 向等待进程发送SIGURG信号
};
band
/*
* SIGPOLL si_codes
*/
#define POLL_IN (__SI_POLL|1)
/* data input available */
#define POLL_OUT (__SI_POLL|2)
/* output buffers available */
#define POLL_MSG (__SI_POLL|3)
/* input message available */
#define POLL_ERR (__SI_POLL|4)
/* i/o error */
#define POLL_PRI (__SI_POLL|5)
/* high priority input available */
#define POLL_HUP (__SI_POLL|6)
/* device disconnected */
sock_def_wakeup()
用于唤醒传输控制块的sk_sleep队列上的睡眠进程,是传输控制块默认的唤醒等待该套接口的函数。该函数设置到传输控制块的sk_state_change接口上,通常当传输控制块的状态发生变化时被调用。
/* * Default Socket Callbacks */ static void sock_def_wakeup(struct sock *sk) { read_lock(&sk->sk_callback_lock); if (sk_has_sleeper(sk)) wake_up_interruptible_all(sk->sk_sleep); read_unlock(&sk->sk_callback_lock); }
接收到FIN段后通知进程
在TCP中还有些地方会通知套接口的fasync_list队列上的进程。比如,当TCP接收到FIN段后,如果此时套接口未在DEAD状态,则唤醒等待该套接口的进程。如果在发送接收方向都进行了关闭,或者此时该传输控制块处于CLOSE状态,则通知异步等待该套接口的进程,该连接已经终止,否则通知进程连接可以进行写操作。
/* * Process the FIN bit. This now behaves as it is supposed to work * and the FIN takes effect when it is validly part of sequence * space. Not before when we get holes. * * If we are ESTABLISHED, a received fin moves us to CLOSE-WAIT * (and thence onto LAST-ACK and finally, CLOSE, we never enter * TIME-WAIT) * * If we are in FINWAIT-1, a received FIN indicates simultaneous * close and we go into CLOSING (and later onto TIME-WAIT) * * If we are in FINWAIT-2, a received FIN moves us to TIME-WAIT. */ static void tcp_fin(struct sk_buff *skb, struct sock *sk, struct tcphdr *th) { struct tcp_sock *tp = tcp_sk(sk); inet_csk_schedule_ack(sk); sk->sk_shutdown |= RCV_SHUTDOWN; sock_set_flag(sk, SOCK_DONE); switch (sk->sk_state) { case TCP_SYN_RECV: case TCP_ESTABLISHED: /* Move to CLOSE_WAIT */ tcp_set_state(sk, TCP_CLOSE_WAIT); inet_csk(sk)->icsk_ack.pingpong = 1; break; case TCP_CLOSE_WAIT: case TCP_CLOSING: /* Received a retransmission of the FIN, do * nothing. */ break; case TCP_LAST_ACK: /* RFC793: Remain in the LAST-ACK state. */ break; case TCP_FIN_WAIT1: /* This case occurs when a simultaneous close * happens, we must ack the received FIN and * enter the CLOSING state. */ tcp_send_ack(sk); tcp_set_state(sk, TCP_CLOSING); break; case TCP_FIN_WAIT2: /* Received a FIN -- send ACK and enter TIME_WAIT. */ tcp_send_ack(sk); tcp_time_wait(sk, TCP_TIME_WAIT, 0); break; default: /* Only TCP_LISTEN and TCP_CLOSE are left, in these * cases we should never reach this piece of code. */ printk(KERN_ERR "%s: Impossible, sk->sk_state=%d\n", __func__, sk->sk_state); break; } /* It _is_ possible, that we have something out-of-order _after_ FIN. * Probably, we should reset in this case. For now drop them. */ __skb_queue_purge(&tp->out_of_order_queue); if (tcp_is_sack(tp)) tcp_sack_reset(&tp->rx_opt); sk_mem_reclaim(sk); if (!sock_flag(sk, SOCK_DEAD)) { sk->sk_state_change(sk); /* Do not send POLL_HUP for half duplex close. */ if (sk->sk_shutdown == SHUTDOWN_MASK || sk->sk_state == TCP_CLOSE) sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_HUP); else sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN); } }
sock_fasync()
实现了对套接口的异步通知队列增加和删除的更新操作。因为它在进程上下文中或在软中断被调用,因此,在访问异步通知列表时需要上锁,对套接口上锁,对传输控制块上sk_callback_lock锁
/* * Update the socket async list * * Fasync_list locking strategy. * * 1. fasync_list is modified only under process context socket lock * i.e. under semaphore. * 2. fasync_list is used under read_lock(&sk->sk_callback_lock) * or under socket lock. * 3. fasync_list can be used from softirq context, so that * modification under socket lock have to be enhanced with * write_lock_bh(&sk->sk_callback_lock). * --ANK (990710) */ static int sock_fasync(int fd, struct file *filp, int on) { struct fasync_struct *fa, *fna = NULL, **prev; struct socket *sock; struct sock *sk; if (on) { fna = kmalloc(sizeof(struct fasync_struct), GFP_KERNEL); if (fna == NULL) return -ENOMEM; } sock = filp->private_data; sk = sock->sk; if (sk == NULL) { kfree(fna); return -EINVAL; } lock_sock(sk); spin_lock(&filp->f_lock); if (on) filp->f_flags |= FASYNC; else filp->f_flags &= ~FASYNC; spin_unlock(&filp->f_lock); prev = &(sock->fasync_list); for (fa = *prev; fa != NULL; prev = &fa->fa_next, fa = *prev) if (fa->fa_file == filp) break; if (on) { if (fa != NULL) { write_lock_bh(&sk->sk_callback_lock); fa->fa_fd = fd; write_unlock_bh(&sk->sk_callback_lock); kfree(fna); goto out; } fna->fa_file = filp; fna->fa_fd = fd; fna->magic = FASYNC_MAGIC; fna->fa_next = sock->fasync_list; write_lock_bh(&sk->sk_callback_lock); sock->fasync_list = fna; write_unlock_bh(&sk->sk_callback_lock); } else { if (fa != NULL) { write_lock_bh(&sk->sk_callback_lock); *prev = fa->fa_next; write_unlock_bh(&sk->sk_callback_lock); kfree(fa); } } out: release_sock(sock->sk); return 0; }