Linux System Programming 学习笔记(二) 文件I/O


文件描述符不仅可以引用普通文件,也可以引用套接字socket,目录,管道(everything is a file)



open系统调用必须包含 O_RDONLY,O_WRONLY,O_RDWR 三种存取模式之一


int fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC, 0644)
int fd = creat(filename, 0644)












(6)返回-1,errno设置为EAGAIN,表示读操作被阻塞,因为当前并没有可读字节,这只发生在 非阻塞模式

(7)返回-1,errno设置为 EINTR,EAGAIN之外的值,表示发生其他更严重的错误


size_t readn(int fd, void* buf, size_t len)
size_t tmp = len;
ssize_t ret = 0;
while (len != 0 && (ret = read(fd, buf, len)) != 0) {
if (ret == -1) {
if (errno == EINTR) {
fprintf(stderr, "read error\n");
len -= ret;
buf += ret;
return tmp - len;






size_t write_all(int fd, void* buf, size_t len)
ssize_t ret = 0;
size_t tmp = len;
while (len != 0 && (ret = write(fd, buf, len)) != 0) {
if (ret == -1) {
if (errno == EINTR) {
fprintf(stderr, "write error\n");
len -= ret;
buf += ret;
return tmp - len;



buffer,后而内核收集所有这些dirty buffer(contain data newer than what is on


if a read is issued for just-written data that lives in a dirty buffer and
is not yet on disk, the request will be satisfied from the buffer and not cause
a read from the "stale" data on disk. so the read is satisfied from an in-memory
cache without having to go to disk.



int ret = fsync(fd);

open调用时 O_SYNC标志表示 文件必须同步

int fd = open(file, O_WRONLY | O_SYNC);

O_SYNC导致I/O等待时间消耗巨大,一般地,需要确保文件写回到磁盘时我们使用 fsync函数



a. 将文件偏移定位到1825

off_t ret = lseek(fd, (off_t)1825, SEEK_SET);

b. 将文件便宜定位到文件末尾处

off_t ret = lseek(fd, 0, SEEK_END);

c. 将文件偏移定位到文件开始处

off_t ret = lseek(fd, 0, SEEK_CUR)


This implies that the total size of all files on a filesystem can add up to
more than the physical size of the disk


int ftruncate(int fd, off_t len);  





Multiplexed I/O becomes the pivot point for the application,designed similarly to the following activity:
a. Multiplexed I/O : Tell me when any of these file descriptors becomes ready for I/O
b. Nothing ready? Sleep until one or more file descriptors are ready.
c. Woken up ! What is ready?
d. Handle all file descriptors ready for I/O, without bolocking
e. Go back to step a

int select(int nfds, fd_set* readfds, fd_set* writefds, fd_set* exceptfds, struct timeval* timeout);
FD_CLR(int fd, fd_set* set); // removes a fd from a given set
FD_ISSET(int fd, fd_set* set); // test whether a fd is part of a given set
FD_SET(int fd, fd_set* set); // adds a fd to a given set
FD_ZERO(int fd, fd_set* set); // removes all fds from specified set. shoule be called before every invocation of select()

因为fd_set是静态分配的,系统有一个文件描述符的最大打开数 FD_SETSIZE,在Linux中,该值为 1024

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>

#define TIMEOUT 5 /* select timeout in seconds */
#define BUFLEN 1024 /* read buffer in bytes */

int main(int argc, char* argv[])
struct timeval tv;
tv.tv_sec = TIMEOUT;
tv.tv_usec = 0;

/* wait on stdin for input */
fd_set readfds;

int ret = select(STDIN_FILENO + 1, &readfds, NULL, NULL, &tv);
if (ret == -1) {
fprintf(stderr, "select error\n");
return 1;
} else if (!ret) {
fprintf(stderr, "%d seconds elapsed.\n", TIMEOUT);
return 0;
if (FD_ISSET(STDIN_FILENO, &readfds)) {
char buf[BUFLEN + 1];
int len = read(STDIN_FILENO, buf, BUFLEN);
if (len == -1) {
fprintf(stderr, "read error\n");
return 1;
if (len != 0) {
buf[BUFLEN] = ‘\0‘;
fprintf(stdout, "read:%s\n", buf);
return 0;
} else {
fprintf(stderr, "This should not happen\n");
return 1;


10. poll

int poll(struct pollfd* fds, nfds_t  nfds, int timeout);

This is a program that uses poll() to check whether a read from stdin
and a write to stdout will block

#include <unistd.h>
#include <poll.h>

#define TIMEOUT 5

int main(int argc, char* argv[])
struct pollfd fds[2];

/* watch stdin for input */
fds[0].fd = STDIN_FILENO;
fds[0].events = POLLIN;

/* watch stdout for alibity to write */
fds[1].fd = STDOUT_FILENO;
fds[1].events = POLLOUT;

int ret = poll(fds, 2, TIMEOUT * 1000);
if (ret == -1) {
fprintf(stderr, "poll error\n");
return 1;

if (!ret) {
fprintf(stdout, "%d seconds elapsed.\n", TIMEOUT);
return 0;

if (fds[0].revents & POLLIN) {
fprintf(stdout, "stdin is readable\n");
if (fds[1].revents & POLLOUT) {
fprintf(stdout, "stdout is writable\n");
return 0;

poll vs select

a. poll不需要用户计算并传递文件描述符参数(select中必须将该值设为最大描述符数加1)


c. select移植性更好,支持select的unix更多

d. select支持更精细的timeout,poll只支持毫秒


Linux内核主要由 virtual filesystem, page cache, page write-back

(1) virtual filesystem

The virtual filesystem (also called a virtual file switch) is a mechanism
of abstraction that allows the Linux kernel to call filesystem functions and
manipulate filesystem data without knowing the specific type of filesystem being

So, a single system call can read any filesystem on any medium, All
filesystems support the same concepts, the same interfaces, and the same

(2) page cache

The page cache is an in-memory store of recently accessed data from an
on-disk filesystem.

Storing requested data in memory allows the kernel to fulfill subsequent
requests for the same data  from memory, avoiding repeated disk

The page cache exploits the concept of temporal locality, which says that a
resource accessed at one point has a high probability of being accessed again in
the near future


The page cache is the first place that kernel looks for filesystem data.
The first time any item of sata is read, it is transferred from the disk into
the page cache, and is returned to the application from the cache.


The data is often referenced sequentially. The kernel implements page cache
 readahead(预读). Readahead is the act of reading extra data off the disk and
into the page cache following each read request. In effect, reading a little bit

(3) page write-back

When a process issues a write request, the data is copied into a buffer,
and the buffer is marked dirty, denoting that the in-memory copy is newer than
the on-disk copy.

Eventually, the dirty buffers need to be committed to disk, sync the
on-disk files with the data in memory.

