性能调优文档
前言
蛋疼的sysctl网络调优设置,不知道就别瞎改
Sysctl/sysfs 参数调整和计算
网络
1 设置Maximum receive/send socket buffer size
net.core.rmem_max
,net.core.wmem_max
和Bandwidth Delay Product (BDP)有关系
网络BDP值计算方式如下:
BDP = Bandwidth/8*RTT
如果服务器的带宽为1G,RTT为10ms,tcp_adv_win_scale=2
的时候,那么BDP为2G/8*0.01=2.5MB
,最大读缓冲设置为4/3*2.5MB=3.3MB
。 具体计算公式参看TCP 性能计算
常见的配置:
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 134217728
net.ipv4.tcp_wmem = 4096 65536 134217728
net.ipv4.tcp_mem = 8388608 12582912 16777216
这个定义了tcp内存的page数目,一般page为4k或者8k。除非知道自己在干嘛,没事不要调整。
tcp_mem
可以按照如下公式计算: default_window*connection_count/page_size
第一个数值是tcp无压力的时候的数值,第二个数值是tcp内存有压力的时候的数值,第三个是最大值。 一般情况下tcp内存超过第二个数值的时候,新的tcp windows不会再变大。如果大于第三个值,新的tcp连接会被拒绝。
例如线上Nginx服务器:
tcp_mem的默认值应该设置为:
nginx_active_connection*default_window/page_size
2 nf_conntrack 设置
net.ipv4.netfilter.ip_conntrack_max
和sysfs/module/nf_conntrack/parameters/hashsize
的关系
ARCH = [32|64]
HASHSIZE = CONNTRACK_MAX / 8 = RAMSIZE (in bytes) / 131072 / (ARCH / 32)
例如32G内存的服务器conntrack_max = 1048576
, hashsize = 131072
3 tcp 控制协议
net.ipv4.tcp_congestion_control
默认cubic
可以按照实际需求换用htcp
具体参看tcp 算法
4 tcp 端口范围
net.ipv4.ip_local_port_range
这个一般对tcp的发送方有用。
5 tcp 端口重用
net.ipv4.tcp_tw_recycle
,net.ipv4.tcp_tw_reuse
没事不用设置。除非是临时的workaround设置。 不要同时设置
6 tcp time 相关
`net.ipv4.netfilter.ip_conntrack_tcp_timeout_close = 10
net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60
net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 432000
net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_last_ack = 30
net.ipv4.netfilter.ip_conntrack_tcp_timeout_max_retrans = 300
net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_recv = 60
net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent2 = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120
net.ipv4.tcp_fin_timeout = 60
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_thin_linear_timeouts = 0
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300
net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60
net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120`
默认情况下够用了,没事不要去调整。如果需要调整一般说明程序或者网络存在问题。
7 tcp 处理队列
net.core.netdev_max_backlog
这个参数需要没有什么可以计算的公式。一般RTT值越大,这个值应该也越大。 10G NIC,RTT=100ms,net.core.netdev_max_backlog = 30000
10G NIC,RTT=200ms或者40G NIC, RTT=50ms,net.core.netdev_max_backlog = 250000
只有看到TCP: drop open request
,才需要调整tcp_max_syn_backlog
。
net.ipv4.tcp_max_orphans
这个参数可以调大,表示没有fd关联的socket。不要调小,可以调大。 但是需要记住一点:每个orphans最多会吃64k不可换出的内存(unswappable memory)。
8 tcp 属性
某些人可能会推荐net.ipv4.tcp_timestamps
和net.ipv4.tcp_sack
设置为0,以减少CPU开销,但是实际环境中默认值更有用。
TCP 性能计算
例如:
net.ipv4.tcp_rmem = 4096 87380 2067968
recieve window (tcp_rmem) 默认为87380
缓存开销计算公式:
if tcp_adv_win_scale > 0 {
buf = window/2^tcp_adv_win_scale
} else {
buf = window - window/2^(-tcp_adv_win_scale)
}
实际用于网络传输的tcp_rmem
数值为window - buf
tcp性能计算公式为:
if tcp_adv_win_scale > 0 {
speed = (window - window/2^tcp_adv_win_scale)/RTT
} else {
speed = (window/2^-tcp_adv_win_scale)/RTT
}
如果RTT=150ms,tcp_adv_win_scale
为2的情况下,使用tcp_rmem的默认值,最大性能为
(87380 - (87380 / 2^2))/0.150 = 436906 bytes/s
speed <= BDP(Bandwidth/8*RTT)
所以如果按照BDP反推写缓冲的公式如下:
(window - (window/2^tcp_adv_win_scale))/RTT = Bandwidth/8*RTT
window = Bandwidth/8*(2^tcp_adv_win_scale)/(2^tcp_adv_win_scale -1)*(RTT^2)
实际情况中window初始大小并不会按照tcp_rmem这样的数值来,而是和mss有关。
TCP 初始win的作用
int init_cwnd = 4;
if (mss > 1460*3)
init_cwnd = 2;
else if (mss > 1460)
init_cwnd = 3;
if (*rcv_wnd > init_cwnd*mss)
*rcv_wnd = init_cwnd*mss;
kernel 3.x以后的版本中,初始窗口调整到了10个MSS大小
很多的调优文章可能会推荐用ip route命令修改默认的win。
# 发送window
ip route change default via 192.168.1.1 dev eth0 proto static initcwnd 10
# 接受window
ip route change default via 192.168.1.1 dev eth0 proto static initrwnd 10
大的win意味大的tcp缓冲,意味着刷一次缓冲发送的数据包变多。
关于网卡的Ring buffer
ethtool -g eth1
Ring parameters for eth1:
Pre-set maximums:
RX: 2040
RX Mini: 0
RX Jumbo: 8160
TX: 255
Current hardware settings:
RX: 255
RX Mini: 0
RX Jumbo: 0
TX: 255
可以按照实际需求修改ring buffer,但是没事别瞎改。大的ring buffer会导致额外的网络延迟。
ring buffer存储的是SKB(socket kernel buffers)的描述指针。 例如一个网络的传输速度为5Mbit/s,网卡mtu为1500,skb大小为1500bytes(12000bit)。 ring buffer可以认为是一个FIFO的队列。 所以延迟为(254*12000)/5000000 = 0.6096s
。
修改ring buffer得自己衡量网络latency和带宽之间的平衡。大的ring buffer能减少丢包,但是因此带来的网络延迟可能会很大。
大部分情况下,默认的出厂设置就很合适。
Driver queue 其实就是网卡的ring buffer
Queuing Disciplines 其实就是 tc里面的qdisc。不同的qdisc设置有不同的作用(http://www.tldp.org/HOWTO/Traffic-Control-HOWTO/classless-qdiscs.html)。 这个队列的长度由网卡的txqueuelen参数决定。
[email protected]:~$ ip add
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
/sbin/ifconfig ethN txqueuelen 10000
这样的设置其实是修改网卡发送队列的长度。这个设置能保持10000个包在队列。
10G NIC上设置的优化参数/sbin/ifconfig ethN txqueuelen 10000
,这个是网上参考的经验值。 实际环境中如果看到很多的overruns或者drop才需要修改。
Linux Performance Checklist
From: http://www.brendangregg.com/USEmethod/use-linux.html
Physical Resources
component| type| metric ---------|-----|------- CPU|utilization|system-wide: vmstat 1, "us" + "sy" + "st"; sar -u, sum fields except "%idle" and "%iowait"; dstat -c, sum fields except "idl" and "wai"; per-cpu: mpstat -P ALL 1, sum fields except "%idle" and "%iowait"; sar -P ALL, same as mpstat; per-process: top, "%CPU"; htop, "CPU%"; ps -o pcpu; pidstat 1, "%CPU"; per-kernel-thread: top/htop ("K" to toggle), where VIRT == 0 (heuristic). [1] CPU|saturation|system-wide: vmstat 1, "r" > CPU count [2]; sar -q, "runq-sz" > CPU count; dstat -p, "run" > CPU count; per-process: /proc/PID/schedstat 2nd field (sched_info.run_delay); perf sched latency (shows "Average" and "Maximum" delay per-schedule); dynamic tracing, eg, SystemTap schedtimes.stp "queued(us)" [3] CPU|errors|perf (LPE) if processor specific error events (CPC) are available; eg, AMD64‘s "04Ah Single-bit ECC Errors Recorded by Scrubber" [4] Memory capacity|utilization|system-wide: free -m, "Mem:" (main memory), "Swap:" (virtual memory); vmstat 1, "free" (main memory), "swap" (virtual memory); sar -r, "%memused"; dstat -m, "free"; slabtop -s c for kmem slab usage; per-process: top/htop, "RES" (resident main memory), "VIRT" (virtual memory), "Mem" for system-wide summary Memory capacity|saturation|system-wide: vmstat 1, "si"/"so" (swapping); sar -B, "pgscank" + "pgscand" (scanning); sar -W; per-process: 10th field (min_flt) from /proc/PID/stat for minor-fault rate, or dynamic tracing [5]; OOM killer: dmesg | grep killed Memory capacity|errors|dmesg for physical failures; dynamic tracing, eg, SystemTap uprobes for failed malloc()s Network Interfaces|utilization|sar -n DEV 1, "rxKB/s"/max "txKB/s"/max; ip -s link, RX/TX tput / max bandwidth; /proc/net/dev, "bytes" RX/TX tput/max; nicstat "%Util" [6] Network Interfaces|saturation|ifconfig, "overruns", "dropped"; netstat -s, "segments retransmited"; sar -n EDEV, drop and fifo metrics; /proc/net/dev, RX/TX "drop"; nicstat "Sat" [6]; dynamic tracing for other TCP/IP stack queueing [7] Network Interfaces|errors|ifconfig, "errors", "dropped"; netstat -i, "RX-ERR"/"TX-ERR"; ip -s link, "errors"; sar -n EDEV, "rxerr/s" "txerr/s"; /proc/net/dev, "errs", "drop"; extra counters may be under /sys/class/net/...; dynamic tracing of driver function returns 76] Storage device I/O|utilization|system-wide: iostat -xz 1, "%util"; sar -d, "%util"; per-process: iotop; pidstat -d; /proc/PID/sched "se.statistics.iowait_sum" Storage device I/O|saturation|iostat -xnz 1, "avgqu-sz" > 1, or high "await"; sar -d same; LPE block probes for queue length/latency; dynamic/static tracing of I/O subsystem (incl. LPE block probes) Storage device I/O|errors|/sys/devices/.../ioerr_cnt; smartctl; dynamic/static tracing of I/O subsystem response codes [8] Storage capacity|utilization|swap: swapon -s; free; /proc/meminfo "SwapFree"/"SwapTotal"; file systems: "df -h" Storage capacity|saturation|not sure this one makes sense - once it‘s full, ENOSPC Storage capacity|errors|strace for ENOSPC; dynamic tracing for ENOSPC; /var/log/messages errs, depending on FS Storage controller|utilization|iostat -xz 1, sum devices and compare to known IOPS/tput limits per-card Storage controller|saturation|see storage device saturation, ... Storage controller|errors|see storage device errors, ... Network controller|utilization|infer from ip -s link (or /proc/net/dev) and known controller max tput for its interfaces Network controller|saturation|see network interface saturation, ... Network controller|errors|see network interface errors, ... CPU interconnect|utilization|LPE (CPC) for CPU interconnect ports, tput / max CPU interconnect|saturation|LPE (CPC) for stall cycles CPU interconnect|errors|LPE (CPC) for whatever is available Memory interconnect|utilization|LPE (CPC) for memory busses, tput / max; or CPI greater than, say, 5; CPC may also have local vs remote counters Memory interconnect|saturation|LPE (CPC) for stall cycles Memory interconnect|errors|LPE (CPC) for whatever is available I/O interconnect|utilization|LPE (CPC) for tput / max if available; inference via known tput from iostat/ip/... I/O interconnect|saturation|LPE (CPC) for stall cycles I/O interconnect|errors|LPE (CPC) for whatever is available
refer
[1] There can be some oddities with the %CPU from top/htop in virtualized environments; I‘ll update with details later when I can. CPU utilization: a single hot CPU can be caused by a single hot thread, or mapped hardware interrupt. Relief of the bottleneck usually involves tuning to use more CPUs in parallel. uptime "load average" (or /proc/loadavg) wasn‘t included for CPU metrics since Linux load averages include tasks in the uninterruptable state (usually I/O).
[2] The man page for vmstat describes "r" as "The number of processes waiting for run time", which is either incorrect or misleading (on recent Linux distributions it‘s reporting those threads that are waiting, and threads that are running on-CPU; it‘s just the wait threads in other OSes).
[3] There may be a way to measure per-process scheduling latency with perf‘s sched:sched_process_wait event, otherwise perf probe to dynamically trace the scheduler functions, although, the overhead under high load to gather and post-process many (100s of) thousands of events per second may make this prohibitive. SystemTap can aggregate per-thread latency in-kernel to reduce overhead, although, last I tried schedtimes.stp (on FC16) it produced thousands of "unknown transition:" warnings. LPE == Linux Performance Events, aka perf_events. This is a powerful observability toolkit that reads CPC and can also use static and dynamic tracing. Its interface is the perf command. CPC == CPU Performance Counters (aka "Performance Instrumentation Counters" (PICs) or "Performance Monitoring Events" (PMUs) or "Hardware Events"), read via programmable registers on each CPU by perf (which it was originally designed to do). These have traditionally been hard to work with due to differences between CPUs. LPE perf makes life easier by providing aliases for commonly used counters. Be aware that there are usually many more made available by the processor, accessible by providing their hex values to perf stat -e. Expect to spend some quality time (days) with the processor vendor manuals when trying to use these. (My short video about CPC may be useful, despite not being on Linux).
[4] There aren‘t many error-related events in the recent Intel and AMD processor manuals; be aware that the public manuals may not show a complete list of events.
[5] The goal is a measure of memory capacity saturation - the degree to which a process is driving the system beyond its ability (and causing paging/swapping). High fault latency works well, but there isn‘t a standard LPE probe or existing SystemTap example of this (roll your own using dynamic tracing). Another metric that may serve a similar goal is minor-fault rate by process, which could be watched from /proc/PID/stat. This should be available in htop as MINFLT.
[6] Tim Cook ported nicstat to Linux; it can be found on sourceforge or his blog.
[7] Dropped packets are included as both saturation and error indicators, since they can occur due to both types of events.
[8] This includes tracing functions from different layers of the I/O subsystem: block device, SCSI, SATA, IDE, ... Some static probes are available (LPE "scsi" and "block" tracepoint events), else use dynamic tracing. CPI == Cycles Per Instruction (others use IPC == Instructions Per Cycle). I/O interconnect: this includes the CPU to I/O controller busses, the I/O controller(s), and device busses (eg, PCIe). Dynamic Tracing: Allows custom metrics to be developed, live in production. Options on Linux include: LPE‘s "perf probe", which has some basic functionality (function entry and variable tracing), although in a trace-n-dump style that can cost performance; SystemTap (in my experience, almost unusable on CentOS/Ubuntu, but much more stable on Fedora); DTrace-for-Linux, either the Paul Fox port (which I‘ve tried) or the OEL port (which Adam has tried), both projects very much in beta.
Software Resources
component|type|metric ---------|------|--------- Kernel mutex|utilization|With CONFIG_LOCK_STATS=y, /proc/lock_stat "holdtime-totat" / "acquisitions" (also see "holdtime-min", "holdtime-max") [8]; dynamic tracing of lock functions or instructions (maybe) Kernel mutex|saturation|With CONFIG_LOCK_STATS=y, /proc/lock_stat "waittime-total" / "contentions" (also see "waittime-min", "waittime-max"); dynamic tracing of lock functions or instructions (maybe); spinning shows up with profiling (perf record -a -g -F 997 ..., oprofile, dynamic tracing) Kernel mutex|errors|dynamic tracing (eg, recusive mutex enter); other errors can cause kernel lockup/panic, debug with kdump/crash User mutex|utilization|valgrind --tool=drd --exclusive-threshold=... (held time); dynamic tracing of lock to unlock function time User mutex|saturation|valgrind --tool=drd to infer contention from held time; dynamic tracing of synchronization functions for wait time; profiling (oprofile, PEL, ...) user stacks for spins User mutex|errors|valgrind --tool=drd various errors; dynamic tracing of pthread_mutex_lock() for EAGAIN, EINVAL, EPERM, EDEADLK, ENOMEM, EOWNERDEAD, ... Task capacity|utilization|top/htop, "Tasks" (current); sysctl kernel.threads-max, /proc/sys/kernel/threads-max (max) Task capacity|saturation|threads blocking on memory allocation; at this point the page scanner should be running (sar -B "pgscan*"), else examine using dynamic tracing Task capacity|errors|"can‘t fork()" errors; user-level threads: pthread_create() failures with EAGAIN, EINVAL, ...; kernel: dynamic tracing of kernel_thread() ENOMEM File descriptors|utilization|system-wide: sar -v, "file-nr" vs /proc/sys/fs/file-max; dstat --fs, "files"; or just /proc/sys/fs/file-nr; per-process: ls /proc/PID/fd | wc -l vs ulimit -n File descriptors|saturation|does this make sense? I don‘t think there is any queueing or blocking, other than on memory allocation. File descriptors|errors|strace errno == EMFILE on syscalls returning fds (eg, open(), accept(), ...).
Refer
[8] Kernel lock analysis used to be via lockmeter, which had an interface called "lockstat".
What‘s Next
See the USE Method for the follow-up strategies after identifying a possible bottleneck. If you complete this checklist but still have a performance issue, move onto other strategies: drill-down analysis and latency analysis.