



Sysctl/sysfs 参数调整和计算


1 设置Maximum receive/send socket buffer size

net.core.rmem_maxnet.core.wmem_max 和Bandwidth Delay Product (BDP)有关系


BDP = Bandwidth/8*RTT

如果服务器的带宽为1G,RTT为10ms,tcp_adv_win_scale=2的时候,那么BDP为2G/8*0.01=2.5MB,最大读缓冲设置为4/3*2.5MB=3.3MB。 具体计算公式参看TCP 性能计算


 net.core.rmem_max = 134217728
 net.core.wmem_max = 134217728
 net.ipv4.tcp_rmem = 4096 87380 134217728
 net.ipv4.tcp_wmem = 4096 65536 134217728

net.ipv4.tcp_mem = 8388608 12582912 16777216 这个定义了tcp内存的page数目,一般page为4k或者8k。除非知道自己在干嘛,没事不要调整。

tcp_mem可以按照如下公式计算: default_window*connection_count/page_size

第一个数值是tcp无压力的时候的数值,第二个数值是tcp内存有压力的时候的数值,第三个是最大值。 一般情况下tcp内存超过第二个数值的时候,新的tcp windows不会再变大。如果大于第三个值,新的tcp连接会被拒绝。




2 nf_conntrack 设置

net.ipv4.netfilter.ip_conntrack_maxsysfs/module/nf_conntrack/parameters/hashsize 的关系

ARCH = [32|64]

HASHSIZE = CONNTRACK_MAX / 8 = RAMSIZE (in bytes) / 131072 / (ARCH / 32)

例如32G内存的服务器conntrack_max = 1048576hashsize = 131072

3 tcp 控制协议


可以按照实际需求换用htcp 具体参看tcp 算法

4 tcp 端口范围


5 tcp 端口重用

net.ipv4.tcp_tw_recyclenet.ipv4.tcp_tw_reuse没事不用设置。除非是临时的workaround设置。 不要同时设置

6 tcp time 相关

`net.ipv4.netfilter.ip_conntrack_tcp_timeout_close = 10
net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60
net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 432000
net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_last_ack = 30
net.ipv4.netfilter.ip_conntrack_tcp_timeout_max_retrans = 300
net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_recv = 60
net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent2 = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120
net.ipv4.tcp_fin_timeout = 60
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_thin_linear_timeouts = 0
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 432000
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300
net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60
net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120`


7 tcp 处理队列

net.core.netdev_max_backlog 这个参数需要没有什么可以计算的公式。一般RTT值越大,这个值应该也越大。 10G NIC,RTT=100ms,net.core.netdev_max_backlog = 3000010G NIC,RTT=200ms或者40G NIC, RTT=50ms,net.core.netdev_max_backlog = 250000

只有看到TCP: drop open request,才需要调整tcp_max_syn_backlog

net.ipv4.tcp_max_orphans 这个参数可以调大,表示没有fd关联的socket。不要调小,可以调大。 但是需要记住一点:每个orphans最多会吃64k不可换出的内存(unswappable memory)。

8 tcp 属性


TCP 性能计算


net.ipv4.tcp_rmem = 4096        87380   2067968

recieve window (tcp_rmem) 默认为87380


if tcp_adv_win_scale > 0 {
    buf = window/2^tcp_adv_win_scale
} else {
    buf = window - window/2^(-tcp_adv_win_scale)

实际用于网络传输的tcp_rmem数值为window - buf


if tcp_adv_win_scale > 0 {
    speed = (window - window/2^tcp_adv_win_scale)/RTT
} else {
    speed = (window/2^-tcp_adv_win_scale)/RTT


(87380 - (87380 / 2^2))/0.150 = 436906 bytes/s

speed <= BDP(Bandwidth/8*RTT)


(window - (window/2^tcp_adv_win_scale))/RTT = Bandwidth/8*RTT
window = Bandwidth/8*(2^tcp_adv_win_scale)/(2^tcp_adv_win_scale -1)*(RTT^2)


TCP 初始win的作用

int init_cwnd = 4;
if (mss > 1460*3)
    init_cwnd = 2;
else if (mss > 1460)
    init_cwnd = 3;
if (*rcv_wnd > init_cwnd*mss)
    *rcv_wnd = init_cwnd*mss;

kernel 3.x以后的版本中,初始窗口调整到了10个MSS大小

很多的调优文章可能会推荐用ip route命令修改默认的win。

# 发送window
ip route change default via dev eth0  proto static initcwnd 10
# 接受window
ip route change default via dev eth0  proto static initrwnd 10


关于网卡的Ring buffer

ethtool -g eth1

Ring parameters for eth1:
Pre-set maximums:
RX:             2040
RX Mini:        0
RX Jumbo:       8160
TX:             255
Current hardware settings:
RX:             255
RX Mini:        0
RX Jumbo:       0
TX:             255

可以按照实际需求修改ring buffer,但是没事别瞎改。大的ring buffer会导致额外的网络延迟。

ring buffer存储的是SKB(socket kernel buffers)的描述指针。 例如一个网络的传输速度为5Mbit/s,网卡mtu为1500,skb大小为1500bytes(12000bit)。 ring buffer可以认为是一个FIFO的队列。 所以延迟为(254*12000)/5000000 = 0.6096s

修改ring buffer得自己衡量网络latency和带宽之间的平衡。大的ring buffer能减少丢包,但是因此带来的网络延迟可能会很大。


Driver queue 其实就是网卡的ring buffer

Queuing Disciplines 其实就是 tc里面的qdisc。不同的qdisc设置有不同的作用(。 这个队列的长度由网卡的txqueuelen参数决定。

[email protected]:~$ ip add

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet scope host lo
    valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
    valid_lft forever preferred_lft forever

/sbin/ifconfig ethN txqueuelen 10000 这样的设置其实是修改网卡发送队列的长度。这个设置能保持10000个包在队列。

10G NIC上设置的优化参数/sbin/ifconfig ethN txqueuelen 10000,这个是网上参考的经验值。 实际环境中如果看到很多的overruns或者drop才需要修改。

Linux Performance Checklist


Physical Resources

component| type| metric ---------|-----|------- CPU|utilization|system-wide: vmstat 1, "us" + "sy" + "st"; sar -u, sum fields except "%idle" and "%iowait"; dstat -c, sum fields except "idl" and "wai"; per-cpu: mpstat -P ALL 1, sum fields except "%idle" and "%iowait"; sar -P ALL, same as mpstat; per-process: top, "%CPU"; htop, "CPU%"; ps -o pcpu; pidstat 1, "%CPU"; per-kernel-thread: top/htop ("K" to toggle), where VIRT == 0 (heuristic). [1] CPU|saturation|system-wide: vmstat 1, "r" > CPU count [2]; sar -q, "runq-sz" > CPU count; dstat -p, "run" > CPU count; per-process: /proc/PID/schedstat 2nd field (sched_info.run_delay); perf sched latency (shows "Average" and "Maximum" delay per-schedule); dynamic tracing, eg, SystemTap schedtimes.stp "queued(us)" [3] CPU|errors|perf (LPE) if processor specific error events (CPC) are available; eg, AMD64‘s "04Ah Single-bit ECC Errors Recorded by Scrubber" [4] Memory capacity|utilization|system-wide: free -m, "Mem:" (main memory), "Swap:" (virtual memory); vmstat 1, "free" (main memory), "swap" (virtual memory); sar -r, "%memused"; dstat -m, "free"; slabtop -s c for kmem slab usage; per-process: top/htop, "RES" (resident main memory), "VIRT" (virtual memory), "Mem" for system-wide summary Memory capacity|saturation|system-wide: vmstat 1, "si"/"so" (swapping); sar -B, "pgscank" + "pgscand" (scanning); sar -W; per-process: 10th field (min_flt) from /proc/PID/stat for minor-fault rate, or dynamic tracing [5]; OOM killer: dmesg | grep killed Memory capacity|errors|dmesg for physical failures; dynamic tracing, eg, SystemTap uprobes for failed malloc()s Network Interfaces|utilization|sar -n DEV 1, "rxKB/s"/max "txKB/s"/max; ip -s link, RX/TX tput / max bandwidth; /proc/net/dev, "bytes" RX/TX tput/max; nicstat "%Util" [6] Network Interfaces|saturation|ifconfig, "overruns", "dropped"; netstat -s, "segments retransmited"; sar -n EDEV, drop and fifo metrics; /proc/net/dev, RX/TX "drop"; nicstat "Sat" [6]; dynamic tracing for other TCP/IP stack queueing [7] Network Interfaces|errors|ifconfig, "errors", "dropped"; netstat -i, "RX-ERR"/"TX-ERR"; ip -s link, "errors"; sar -n EDEV, "rxerr/s" "txerr/s"; /proc/net/dev, "errs", "drop"; extra counters may be under /sys/class/net/...; dynamic tracing of driver function returns 76] Storage device I/O|utilization|system-wide: iostat -xz 1, "%util"; sar -d, "%util"; per-process: iotop; pidstat -d; /proc/PID/sched "se.statistics.iowait_sum" Storage device I/O|saturation|iostat -xnz 1, "avgqu-sz" > 1, or high "await"; sar -d same; LPE block probes for queue length/latency; dynamic/static tracing of I/O subsystem (incl. LPE block probes) Storage device I/O|errors|/sys/devices/.../ioerr_cnt; smartctl; dynamic/static tracing of I/O subsystem response codes [8] Storage capacity|utilization|swap: swapon -s; free; /proc/meminfo "SwapFree"/"SwapTotal"; file systems: "df -h" Storage capacity|saturation|not sure this one makes sense - once it‘s full, ENOSPC Storage capacity|errors|strace for ENOSPC; dynamic tracing for ENOSPC; /var/log/messages errs, depending on FS Storage controller|utilization|iostat -xz 1, sum devices and compare to known IOPS/tput limits per-card Storage controller|saturation|see storage device saturation, ... Storage controller|errors|see storage device errors, ... Network controller|utilization|infer from ip -s link (or /proc/net/dev) and known controller max tput for its interfaces Network controller|saturation|see network interface saturation, ... Network controller|errors|see network interface errors, ... CPU interconnect|utilization|LPE (CPC) for CPU interconnect ports, tput / max CPU interconnect|saturation|LPE (CPC) for stall cycles CPU interconnect|errors|LPE (CPC) for whatever is available Memory interconnect|utilization|LPE (CPC) for memory busses, tput / max; or CPI greater than, say, 5; CPC may also have local vs remote counters Memory interconnect|saturation|LPE (CPC) for stall cycles Memory interconnect|errors|LPE (CPC) for whatever is available I/O interconnect|utilization|LPE (CPC) for tput / max if available; inference via known tput from iostat/ip/... I/O interconnect|saturation|LPE (CPC) for stall cycles I/O interconnect|errors|LPE (CPC) for whatever is available


[1] There can be some oddities with the %CPU from top/htop in virtualized environments; I‘ll update with details later when I can. CPU utilization: a single hot CPU can be caused by a single hot thread, or mapped hardware interrupt. Relief of the bottleneck usually involves tuning to use more CPUs in parallel. uptime "load average" (or /proc/loadavg) wasn‘t included for CPU metrics since Linux load averages include tasks in the uninterruptable state (usually I/O).

[2] The man page for vmstat describes "r" as "The number of processes waiting for run time", which is either incorrect or misleading (on recent Linux distributions it‘s reporting those threads that are waiting, and threads that are running on-CPU; it‘s just the wait threads in other OSes).

[3] There may be a way to measure per-process scheduling latency with perf‘s sched:sched_process_wait event, otherwise perf probe to dynamically trace the scheduler functions, although, the overhead under high load to gather and post-process many (100s of) thousands of events per second may make this prohibitive. SystemTap can aggregate per-thread latency in-kernel to reduce overhead, although, last I tried schedtimes.stp (on FC16) it produced thousands of "unknown transition:" warnings. LPE == Linux Performance Events, aka perf_events. This is a powerful observability toolkit that reads CPC and can also use static and dynamic tracing. Its interface is the perf command. CPC == CPU Performance Counters (aka "Performance Instrumentation Counters" (PICs) or "Performance Monitoring Events" (PMUs) or "Hardware Events"), read via programmable registers on each CPU by perf (which it was originally designed to do). These have traditionally been hard to work with due to differences between CPUs. LPE perf makes life easier by providing aliases for commonly used counters. Be aware that there are usually many more made available by the processor, accessible by providing their hex values to perf stat -e. Expect to spend some quality time (days) with the processor vendor manuals when trying to use these. (My short video about CPC may be useful, despite not being on Linux).

[4] There aren‘t many error-related events in the recent Intel and AMD processor manuals; be aware that the public manuals may not show a complete list of events.

[5] The goal is a measure of memory capacity saturation - the degree to which a process is driving the system beyond its ability (and causing paging/swapping). High fault latency works well, but there isn‘t a standard LPE probe or existing SystemTap example of this (roll your own using dynamic tracing). Another metric that may serve a similar goal is minor-fault rate by process, which could be watched from /proc/PID/stat. This should be available in htop as MINFLT.

[6] Tim Cook ported nicstat to Linux; it can be found on sourceforge or his blog.

[7] Dropped packets are included as both saturation and error indicators, since they can occur due to both types of events.

[8] This includes tracing functions from different layers of the I/O subsystem: block device, SCSI, SATA, IDE, ... Some static probes are available (LPE "scsi" and "block" tracepoint events), else use dynamic tracing. CPI == Cycles Per Instruction (others use IPC == Instructions Per Cycle). I/O interconnect: this includes the CPU to I/O controller busses, the I/O controller(s), and device busses (eg, PCIe). Dynamic Tracing: Allows custom metrics to be developed, live in production. Options on Linux include: LPE‘s "perf probe", which has some basic functionality (function entry and variable tracing), although in a trace-n-dump style that can cost performance; SystemTap (in my experience, almost unusable on CentOS/Ubuntu, but much more stable on Fedora); DTrace-for-Linux, either the Paul Fox port (which I‘ve tried) or the OEL port (which Adam has tried), both projects very much in beta.

Software Resources

component|type|metric ---------|------|--------- Kernel mutex|utilization|With CONFIG_LOCK_STATS=y, /proc/lock_stat "holdtime-totat" / "acquisitions" (also see "holdtime-min", "holdtime-max") [8]; dynamic tracing of lock functions or instructions (maybe) Kernel mutex|saturation|With CONFIG_LOCK_STATS=y, /proc/lock_stat "waittime-total" / "contentions" (also see "waittime-min", "waittime-max"); dynamic tracing of lock functions or instructions (maybe); spinning shows up with profiling (perf record -a -g -F 997 ..., oprofile, dynamic tracing) Kernel mutex|errors|dynamic tracing (eg, recusive mutex enter); other errors can cause kernel lockup/panic, debug with kdump/crash User mutex|utilization|valgrind --tool=drd --exclusive-threshold=... (held time); dynamic tracing of lock to unlock function time User mutex|saturation|valgrind --tool=drd to infer contention from held time; dynamic tracing of synchronization functions for wait time; profiling (oprofile, PEL, ...) user stacks for spins User mutex|errors|valgrind --tool=drd various errors; dynamic tracing of pthread_mutex_lock() for EAGAIN, EINVAL, EPERM, EDEADLK, ENOMEM, EOWNERDEAD, ... Task capacity|utilization|top/htop, "Tasks" (current); sysctl kernel.threads-max, /proc/sys/kernel/threads-max (max) Task capacity|saturation|threads blocking on memory allocation; at this point the page scanner should be running (sar -B "pgscan*"), else examine using dynamic tracing Task capacity|errors|"can‘t fork()" errors; user-level threads: pthread_create() failures with EAGAIN, EINVAL, ...; kernel: dynamic tracing of kernel_thread() ENOMEM File descriptors|utilization|system-wide: sar -v, "file-nr" vs /proc/sys/fs/file-max; dstat --fs, "files"; or just /proc/sys/fs/file-nr; per-process: ls /proc/PID/fd | wc -l vs ulimit -n File descriptors|saturation|does this make sense? I don‘t think there is any queueing or blocking, other than on memory allocation. File descriptors|errors|strace errno == EMFILE on syscalls returning fds (eg, open(), accept(), ...).


[8] Kernel lock analysis used to be via lockmeter, which had an interface called "lockstat".

What‘s Next

See the USE Method for the follow-up strategies after identifying a possible bottleneck. If you complete this checklist but still have a performance issue, move onto other strategies: drill-down analysis and latency analysis.

时间: 2024-10-12 17:25:00



Socket是应?用层与TCP/IP协议族通信的中间软件抽象层,它是?一组接?口.在设计模式中,Socket其实就是 ?一个?门?面模式,它把复杂的TCP/IP协议族隐藏在Socket接?口后?面,对?用户来说,?一组简单的接?口就是全部,让 Socket去组织数据,以符合指定的协议. ?一个?生活中的场景.你要打电话给?一个朋友,先拨号,朋友听到电话铃声后提起电话,这时你和你的朋友就 建?立起了连接,就可以讲话了.等交流结束,挂断电话结束此次交谈. 先从服务器端说起.1>服务器端先初始化Soc


Socket是应?用层与TCP/IP协议族通信的中间软件抽象层,它是?一组接?口.在设计模式中,Socket其实就是 ?一个?门?面模式,它把复杂的TCP/IP协议族隐藏在Socket接?口后?面,对?用户来说,?一组简单的接?口就是全部,让 Socket去组织数据,以符合指定的协议. ?一个?生活中的场景.你要打电话给?一个朋友,先拨号,朋友听到电话铃声后提起电话,这时你和你的朋友就 建?立起了连接,就可以讲话了.等交流结束,挂断电话结束此次交谈. 先从服务器端说起.1>服务器端先初始化Soc


转自: 大多数Linux发行版都定义了适当的缓冲区和其他TCP参数,可以通过修改这些参数来分配更多的内存,从而改进网络性能.设置内核参数的方法是通过proc接口,也就是通过读写/proc中的值.幸运的是,sysctl可以读取/etc/sysctl.conf中的值并根据需要填充/proc,这样就能够更轻松地管理这些参数. 下面展示了在互联网服务器上应用于Internet服务器的一些比较激进的网络


下面这些关于Spark的性能调优项,有的是来自官方的,有的是来自别的的工程师,有的则是我自己总结的. Data Serialization,默认使用的是Java Serialization,这个程序员最熟悉,但是性能.空间表现都比较差.还有一个选项是Kryo Serialization,更快,压缩率也更高,但是并非支持任意类的序列化. Memory Tuning,Java对象会占用原始数据2~5倍甚至更多的空间.最好的检测对象内存消耗的办法就是创建RDD,然后放到cache里面去,然后在UI 上


[性能调优工具jps.jstack.jmap.jhat.jstat.hprof使用详解] 现实企业级Java开发中,有时候我们会碰到下面这些问题: OutOfMemoryError,内存不足 内存泄露 线程死锁 锁争用(Lock Contention) Java进程消耗CPU过高 这些问题在日常开发中可能被很多人忽视(比如有的人遇到上面的问题只是重启服务器或者调大内存,而不会深究问题根源),但能够理解并解决这些


转自: 最近因项目存在内存泄漏,故进行大规模的JVM性能调优 , 现把经验做一记录. 一.JVM内存模型及垃圾收集算法 1.根据Java虚拟机规范,JVM将内存划分为: New(年轻代) Tenured(年老代) 永久代(Perm) 其中New和Tenured属于堆内存,堆内存会从JVM启动参数(-Xmx:3G)指定的内存中分配,Perm不属于堆内存,有虚拟机直接分配,但可以


网上IBM很早放出的一本免费电子书, 十来年了,参考意义还是很大. 国内有翻译成中文在线阅读的版本. 见如下两个URL Linux Performance and Tuning Guidelines <Linux性能调优指南> ========================================= 服务器优化思路 管理变更流程 管理变更和性能优化并不直接相关,但可能是


本文转自: 最近因项目存在内存泄漏,故进行大规模的JVM性能调优 , 现把经验做一记录. 一.JVM内存模型及垃圾收集算法 1.根据Java虚拟机规范,JVM将内存划分为: New(年轻代) Tenured(年老代) 永久代(Perm) 其中New和Tenured属于堆内存,堆内存会从JVM启动参数(-Xmx:3G)指定的内存中分配,Perm不属于堆内存,有虚拟机直接分

sql server 性能调优 资源等待之网络I/O

原文:sql server 性能调优 资源等待之网络I/O 一.概述 与网络I/O相关的等待的主要是ASYNC_NETWORK_IO,是指当sql server返回数据结果集给客户端的时候,会先将结果集填充到输出缓存里(ouput cache),同时网络层会开始将输出缓存里的数据打包,由客户端接收.如果客户端接收数据包慢,sql server没有地方存放新数据结果时,这时任务进入ASYNC_NETWORK_IO等待状态. 1. 从实例级别查看ASYNC_NETWORK_IO 平均耗时: 4636