在unix系统中,socket和普通文件一样对待,因为它可以像普通文件一样被读和写,但是它还有一些自己独特的特点,例如,文件的读写位置可以设置,但是socket只能被顺序的读写等等,那么在unix系统中,是如何实现这种方式的呢?
如下图,其中有以下重要数据结构:proc、filedesc、file等,对这些重要数据结构及其之间的关系弄清楚之后,上面的问题自然就有答案了。在本文介绍中,使用的操作系统源码为:4.4bsd-lite版本,该版本是《TCP/IP协议卷2——实现》一书使用的源码,同时该源码相对于目前使用的linux操作系统来更为小巧、简单,更适合学习。
(1)每个进程在OS中都有一个数据结构(struct filedesc)为之对应,通常称该数据结构为PCB(进程控制块),该数据结构中详细定义了控制该进程所需的全部数据,这里只需重点关注其成员:struct filedesc *p_fd,该成员指向一个文件描述信息的数据结构。struct filedesc定义在文件4.4BSD-Lite\sys\sys\filedesc.h中,详细定义如下:
struct proc { struct proc *p_forw; /* Doubly-linked run/sleep queue. */ struct proc *p_back; struct proc *p_next; /* Linked list of active procs */ struct proc **p_prev; /* and zombies. */ /* substructures: */ struct pcred *p_cred; /* Process owner's identity. */ struct filedesc *p_fd; /* Ptr to open files structure. */ struct pstats *p_stats; /* Accounting/statistics (PROC ONLY). */ struct plimit *p_limit; /* Process limits. */ struct vmspace *p_vmspace; /* Address space. */ struct sigacts *p_sigacts; /* Signal actions, state (PROC ONLY). */ #define p_ucred p_cred->pc_ucred #define p_rlimit p_limit->pl_rlimit int p_flag; /* P_* flags. */ char p_stat; /* S* process status. */ char p_pad1[3]; pid_t p_pid; /* Process identifier. */ struct proc *p_hash; /* Hashed based on p_pid for kill+exit+... */ struct proc *p_pgrpnxt; /* Pointer to next process in process group. */ struct proc *p_pptr; /* Pointer to process structure of parent. */ struct proc *p_osptr; /* Pointer to older sibling processes. */ /* The following fields are all zeroed upon creation in fork. */ #define p_startzero p_ysptr struct proc *p_ysptr; /* Pointer to younger siblings. */ struct proc *p_cptr; /* Pointer to youngest living child. */ pid_t p_oppid; /* Save parent pid during ptrace. XXX */ int p_dupfd; /* Sideways return value from fdopen. XXX */ /* scheduling */ u_int p_estcpu; /* Time averaged value of p_cpticks. */ int p_cpticks; /* Ticks of cpu time. */ fixpt_t p_pctcpu; /* %cpu for this process during p_swtime */ void *p_wchan; /* Sleep address. */ char *p_wmesg; /* Reason for sleep. */ u_int p_swtime; /* Time swapped in or out. */ u_int p_slptime; /* Time since last blocked. */ struct itimerval p_realtimer; /* Alarm timer. */ struct timeval p_rtime; /* Real time. */ u_quad_t p_uticks; /* Statclock hits in user mode. */ u_quad_t p_sticks; /* Statclock hits in system mode. */ u_quad_t p_iticks; /* Statclock hits processing intr. */ int p_traceflag; /* Kernel trace points. */ struct vnode *p_tracep; /* Trace to vnode. */ int p_siglist; /* Signals arrived but not delivered. */ struct vnode *p_textvp; /* Vnode of executable. */ long p_spare[5]; /* pad to 256, avoid shifting eproc. */ /* End area that is zeroed on creation. */ #define p_endzero p_startcopy /* The following fields are all copied upon creation in fork. */ #define p_startcopy p_sigmask sigset_t p_sigmask; /* Current signal mask. */ sigset_t p_sigignore; /* Signals being ignored. */ sigset_t p_sigcatch; /* Signals being caught by user. */ u_char p_priority; /* Process priority. */ u_char p_usrpri; /* User-priority based on p_cpu and p_nice. */ char p_nice; /* Process "nice" value. */ char p_comm[MAXCOMLEN+1]; struct pgrp *p_pgrp; /* Pointer to process group. */ /* End area that is copied on creation. */ #define p_endcopy p_thread int p_thread; /* Id for this "thread"; Mach glue. XXX */ struct user *p_addr; /* Kernel virtual addr of u-area (PROC ONLY). */ struct mdproc p_md; /* Any machine-dependent fields. */ u_short p_xstat; /* Exit status for wait; also stop signal. */ u_short p_acflag; /* Accounting flags. */ struct rusage *p_ru; /* Exit information. XXX */ };
(2)struct filedesc结构体描述了进程打开的所有的文件信息,这里需要重点关注该结构体的两个数组成员:struct file **fd_ofiles和char *fd_ofileflags(见上图),其中:数组fd_ofiles的每个成员对应一个当前进程打开的文件结构体的地址;数组fd_ofileflags的每个成员对应当前进程打开的一个文件的描述符标志,文件描述符的标志采用bit位表示,因此一个打开的文件共有8个bit位表示8中不同的标志,例如标志close-on-exec和标志mapped-from-device。另外,这两个数组的成员是对应的,即:
fd_ofiles[n]对应打开当前进程打开的第n个文件的文件结构体地址;
fd_ofileflags[n]对应当前进程打开的第n个文件的描述符标志;
struct filedesc的定义在文件4.4BSD-Lite\sys\sys\filedesc.h中,其详细定义如下:
struct filedesc { struct file **fd_ofiles; /* file structures for open files */ char *fd_ofileflags; /* per-process open file flags */ struct vnode *fd_cdir; /* current directory */ struct vnode *fd_rdir; /* root directory */ int fd_nfiles; /* number of open files allocated */ u_short fd_lastfile; /* high-water mark of fd_ofiles */ u_short fd_freefile; /* approx. next free file */ u_short fd_cmask; /* mask for file creation */ u_short fd_refcnt; /* reference count */ };
(3)结构体struct file表示当前进程中一个打开的文件,这里将关注其成员short f_type、struct fileops *f_ops和caddr_t f_data。其中:
short f_type表示当前打开文件的类型,在文件4.4BSD-Lite\usr\src\sys\sys\file.h中定义了成员shortf_type的两种类型:
#define DTYPE_VNODE 1 /*file */
#define DTYPE_SOCKET 2 /*communications endpoint */
例如其值为DTYPE_SOCKET表示当前打开的文件是一个socket,值为DTYPE_VNODE表示一个普通的文件等等;
struct fileops *f_ops定义了5个函数指针,它们将根据具体的文件类型(f_type表示文件类型)指向具体的函数,例如当前打开的文件为socket时(f_type值为DTYPE_SOCKET),这5个函数指针将指向以下5个操作socket的函数:soo_read、soo_write、soo_ioctl、soo_select、soo_close,当打开的文件为普通文件时(f_type值为DTYPE_VNODE),这5个函数指针将指向以下5个操作socket的函数:vn_read、vn_write、vn_ioctl、vn_select、vn_close;
caddr_t f_data对应了该打开文件所对应的数据部分,对应vnode或者socket结构体;这里类型caddr_t 实质上是一个char*,其定义为:typedef char * caddr_t;。
struct file定义在文件4.4BSD-Lite\sys\sys\file.h中,其详细定义如下:
struct file { struct file *f_filef; /* list of active files */ struct file **f_fileb; /* list of active files */ short f_flag; /* see fcntl.h */ #define DTYPE_VNODE 1 /* file */ #define DTYPE_SOCKET 2 /* communications endpoint */ short f_type; /* descriptor type */ short f_count; /* reference count */ short f_msgcount; /* references from message queue */ struct ucred *f_cred; /* credentials associated with descriptor */ struct fileops { int (*fo_read) __P((struct file *fp, struct uio *uio, struct ucred *cred)); int (*fo_write) __P((struct file *fp, struct uio *uio, struct ucred *cred)); int (*fo_ioctl) __P((struct file *fp, int com, caddr_t data, struct proc *p)); int (*fo_select) __P((struct file *fp, int which, struct proc *p)); int (*fo_close) __P((struct file *fp, struct proc *p)); } *f_ops; off_t f_offset; caddr_t f_data; /* vnode or socket */ };
(4)caddr_t f_data表示打开文件的实际数据部分,caddr_t的实际定义类型为char*,当打开的文件类型为普通文件时(f_type值为DTYPE_VNODE),f_data指向了一个struct vnode结构体,当打开的文件类型为socket时(f_type值为DTYPE_SOCKET),f_data指向了一个struct socket结构体,这里只需要先关注short so_type和caddr_t so_pcb两个成员变量,其中:short so_type表示socket类型,例如SOCK_DGRAM表示UDP类型,SOCK_STREAM表示TCP类型;caddr_t
so_pcb指向一个协议控制块的双向链表。
socket结构体的定义在文件:4.4BSD-Lite\sys\sys\socketvar.h中,其详细定义为:
struct socket { short so_type; /* generic type, see socket.h */ short so_options; /* from socket call, see socket.h */ short so_linger; /* time to linger while closing */ short so_state; /* internal state flags SS_*, below */ caddr_t so_pcb; /* protocol control block */ struct protosw *so_proto; /* protocol handle */ /* * Variables for connection queueing. * Socket where accepts occur is so_head in all subsidiary sockets. * If so_head is 0, socket is not related to an accept. * For head socket so_q0 queues partially completed connections, * while so_q is a queue of connections ready to be accepted. * If a connection is aborted and it has so_head set, then * it has to be pulled out of either so_q0 or so_q. * We allow connections to queue up based on current queue lengths * and limit on number of queued connections for this socket. */ struct socket *so_head; /* back pointer to accept socket */ struct socket *so_q0; /* queue of partial connections */ struct socket *so_q; /* queue of incoming connections */ short so_q0len; /* partials on so_q0 */ short so_qlen; /* number of connections on so_q */ short so_qlimit; /* max number queued connections */ short so_timeo; /* connection timeout */ u_short so_error; /* error affecting connection */ pid_t so_pgid; /* pgid for signals */ u_long so_oobmark; /* chars to oob mark */ /* * Variables for socket buffering. */ struct sockbuf { u_long sb_cc; /* actual chars in buffer */ u_long sb_hiwat; /* max actual char count */ u_long sb_mbcnt; /* chars of mbufs used */ u_long sb_mbmax; /* max chars of mbufs to use */ long sb_lowat; /* low water mark */ struct mbuf *sb_mb; /* the mbuf chain */ struct selinfo sb_sel; /* process selecting read/write */ short sb_flags; /* flags, see below */ short sb_timeo; /* timeout for read/write */ } so_rcv, so_snd; #define SB_MAX (256*1024) /* default for max chars in sockbuf */ #define SB_LOCK 0x01 /* lock on data queue */ #define SB_WANT 0x02 /* someone is waiting to lock */ #define SB_WAIT 0x04 /* someone is waiting for data/space */ #define SB_SEL 0x08 /* someone is selecting */ #define SB_ASYNC 0x10 /* ASYNC I/O, need signals */ #define SB_NOTIFY (SB_WAIT|SB_SEL|SB_ASYNC) #define SB_NOINTR 0x40 /* operations not interruptible */ caddr_t so_tpcb; /* Wisc. protocol control block XXX */ void (*so_upcall) __P((struct socket *so, caddr_t arg, int waitf)); caddr_t so_upcallarg; /* Arg for above */ };
(5)协议控制块caddr_tso_pcb,在socket结构体中,协议控制块是非常核心的数据结构,采用双向链表方式表示,它包含以下成员:前一个inpcb、后一个inpcb、inpcb双向链表的首节点、当前inpcb对应socket的本地IP、本地端口、远端IP、远端端口、当前inpcb对应的socket结构体的地址等等;
每个socket都有一个协议控制块inpub与之对应:可通过socket的so_pcb成员来访问它,同时协议控制块中也有个成员struct socket *inp_socket用于指向自己所属的socket结构体。
在OS中,每种类型的socket有且只有一个inpcb链表与之对应,例如:所有的TCP的socket的inpcb都在同一个TCP的inpcb双向链表中,所有的UDP的socket的inpcb都在同一个UDP的inpcb双向链表中。
在通过socket接收数据时,OS从网卡驱动中拿到数据后,先搜索inpcb的双向链表,通过比对本地ip地址、本地端口号、远端ip地址、远端端口号找到匹配的inpcb,进而找到对应socket,并将数据保存到socket的接收缓存中。
协议控制块struct inp定义在文件4.4BSD-Lite\sys\netinet\in_pcb.h中,其详细定义为:
struct inpcb { struct inpcb *inp_next,*inp_prev; /* pointers to other pcb's */ struct inpcb *inp_head; /* pointer back to chain of inpcb's for this protocol */ struct in_addr inp_faddr; /* foreign host table entry */ u_short inp_fport; /* foreign port */ struct in_addr inp_laddr; /* local host table entry */ u_short inp_lport; /* local port */ struct socket *inp_socket; /* back pointer to socket */ caddr_t inp_ppcb; /* pointer to per-protocol pcb */ struct route inp_route; /* placeholder for routing entry */ int inp_flags; /* generic IP/datagram flags */ struct ip inp_ip; /* header prototype; should have more */ struct mbuf *inp_options; /* IP options */ struct ip_moptions *inp_moptions; /* IP multicast options */ };