一,问题场景和环境
系统环境:
redhat6.4 kernel:2.6.32-358
问题:
使用iptables给mangle表添加了一条规则,使用nfqueue做为target。当一个http请求命中这个规则之后,机器直接重启了。偶发性的出了两次问题,但是却在重启的机器上重现不了这个问题。
二,排查
1,查看messages,kernel和dmesg相关日志,未发现有任何异常
2,查看重启前机器的负载,cpu,内存,磁盘io,网络io都正常
3,由于是使用了nfqueue做为target才导致的重启,怀疑是系统的问题,通过现象看应该是iptables的nfqueue导致的问题,而nfqueue用于从内核读取数据包在用户态处理。故具体定位在kernel或者libnetfilter_queue上。
4,通过服务器显示屏幕来看重启的时候会有什么有用的输出,但是服务器在客户的机房,查看太麻烦
5,使用last查看服务器的重启记录,发现一个意外现象,即:机器因为nfqueue重启的那个记录里面有一个crash记录,意思即系统奔溃了,从而导致重启。那就能断定是系统或者kernel crash了。
6,linux系统一般默认都安装配置了kdump,故当 linux 系统内核发生崩溃的时候,可以通过 kdump 等方式收集内核崩溃之前的内存,在/var/crash/日期 目录生成一个转储文件 vmcore。使用crash工具可以分享vmcore文件,来获取kernel crash前的一些重要信息。通过在机器上查找,果然发现了crash相关的vmcore文件。
三,分析vmcore文件
1,安装指定kernel的debuginfo包:
# yum install kernel-debuginfo-2.6.32-358.el6.x86_64
2,使用系统自带的crash命令分析vmcore:
# crash /usr/lib/debug/lib/modules/2.6.32-358.el6.x86_64/vmlinux vmcore crash 7.1.0-6.el6 Copyright (C) 2002-2014 Red Hat, Inc. Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation Copyright (C) 1999-2006 Hewlett-Packard Co Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited Copyright (C) 2006, 2007 VA Linux Systems Japan K.K. Copyright (C) 2005, 2011 NEC Corporation Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc. Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc. This program is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Enter "help copying" to see the conditions. This program has absolutely no warranty. Enter "help warranty" for details. GNU gdb (GDB) 7.6 Copyright (C) 2013 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu"... WARNING: kernel version inconsistency between vmlinux and dumpfile KERNEL: vmlinux DUMPFILE: vmcore [PARTIAL DUMP] CPUS: 40 DATE: Tue Oct 31 11:53:41 2017 UPTIME: 342 days, 12:15:26 LOAD AVERAGE: 0.00, 0.02, 0.00 TASKS: 1050 NODENAME: web_yp_49_202.mobileztgame RELEASE: 2.6.32-358.el6.x86_64 VERSION: #1 SMP Tue Jan 29 11:47:41 EST 2013 MACHINE: x86_64 (2499 Mhz) MEMORY: 128 GB PANIC: "BUG: unable to handle kernel NULL pointer dereference at (null)" PID: 0 COMMAND: "swapper" TASK: ffff882069324080 (1 of 40) [THREAD_INFO: ffff881068896000] CPU: 5 STATE: TASK_RUNNING (PANIC)
从crash的输出可以看到kernel崩溃的原因为kernel遇见空指针导致崩溃
bt 命令用于查看系统崩溃前的堆栈等信息
bt命令结果如下:
crash> bt PID: 0 TASK: ffff882069324080 CPU: 5 COMMAND: "swapper" #0 [ffff8800618a3750] machine_kexec at ffffffff81035b7b #1 [ffff8800618a37b0] crash_kexec at ffffffff810c0db2 #2 [ffff8800618a3880] oops_end at ffffffff815111d0 #3 [ffff8800618a38b0] no_context at ffffffff81046bfb #4 [ffff8800618a3900] __bad_area_nosemaphore at ffffffff81046e85 #5 [ffff8800618a3950] bad_area_nosemaphore at ffffffff81046f53 #6 [ffff8800618a3960] __do_page_fault at ffffffff810476b1 #7 [ffff8800618a3a80] do_page_fault at ffffffff8151311e #8 [ffff8800618a3ab0] page_fault at ffffffff815104d5 [exception RIP: nf_queue+152] RIP: ffffffff81475718 RSP: ffff8800618a3b60 RFLAGS: 00010207 RAX: 0000000000000020 RBX: 0000000000000000 RCX: ffff8810638a3c00 RDX: 0000000000000002 RSI: ffff880959189980 RDI: 0000000000000000 RBP: ffff8800618a3bd0 R8: 0000000000021773 R9: 0000000000000001 R10: 000000000000000e R11: 0000000000000006 R12: ffff880959189980 R13: 0000000000000000 R14: ffffffff8147e8b0 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #9 [ffff8800618a3bd8] nf_hook_slow at ffffffff81474800 #10 [ffff8800618a3c58] ip_rcv at ffffffff8147ef54 #11 [ffff8800618a3c98] __netif_receive_skb at ffffffff8144819b #12 [ffff8800618a3cf8] netif_receive_skb at ffffffff8144a578 #13 [ffff8800618a3d38] napi_skb_finish at ffffffff8144a680 #14 [ffff8800618a3d58] napi_gro_receive at ffffffff8144cc29 #15 [ffff8800618a3d78] ixgbe_poll at ffffffffa015e44c [ixgbe] #16 [ffff8800618a3e68] net_rx_action at ffffffff8144cd43 #17 [ffff8800618a3ec8] __do_softirq at ffffffff81076fb1 #18 [ffff8800618a3f38] call_softirq at ffffffff8100c1cc #19 [ffff8800618a3f50] do_softirq at ffffffff8100de05 #20 [ffff8800618a3f70] irq_exit at ffffffff81076d95 #21 [ffff8800618a3f80] do_IRQ at ffffffff81516c95 --- <IRQ stack> --- #22 [ffff881068897db8] ret_from_intr at ffffffff8100b9d3 [exception RIP: intel_idle+222] RIP: ffffffff812d37ae RSP: ffff881068897e68 RFLAGS: 00000206 RAX: 0000000000000000 RBX: ffff881068897ed8 RCX: 0000000000000000 RDX: 00000000000e3cb1 RSI: 0000000000000000 RDI: 00000000379d13ba RBP: ffffffff8100b9ce R8: 0000000000000004 R9: 0000000000000050 R10: 0069229e5ea9dbfa R11: 0000000000000000 R12: ffff8800618b15a0 R13: 0000000000000000 R14: 0069229c2b297a40 R15: ffff8800618b16a0 ORIG_RAX: ffffffffffffff62 CS: 0010 SS: 0018 #23 [ffff881068897ee0] cpuidle_idle_call at ffffffff81414ef7 #24 [ffff881068897f00] cpu_idle at ffffffff81009fc6
通过bt分析,我们从下到上来看kernel崩溃前的系统调用,定位到kernel崩溃前的一个exception是ip寄存器RIP的异常,而通过dis 命令来看一下该地址的反汇编结果:
crash> dis -l ffffffff81475718 /usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/net/netfilter/nf_queue.c: 221 0xffffffff81475718 <nf_queue+152>: mov (%rbx),%r12
故可定位到出现异常的代码段:
# vim /usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/net/netfilter/nf_queue.c +221 215 segs = skb_gso_segment(skb, 0); 216 kfree_skb(skb); 217 if (IS_ERR(segs)) 218 return 1; 219 220 do { 221 struct sk_buff *nskb = segs->next; 222 223 segs->next = NULL; 224 if (!__nf_queue(segs, elem, pf, hook, indev, outdev, okfn, 225 queuenum)) 226 kfree_skb(segs); 227 segs = nskb; 228 } while (segs); 229 return 1;
而通过看skb_gso_segment结构体,可以判断出是因为skb_gso_segment在某些情况下会返回NULL,从而导致如上代码segs->next获取到了空指针,从而导致kernel崩溃。而既然是gso导致的问题,应该可以通过调整系统gso属性来规避这个问题:
# vim /usr/src/debug/kernel-2.6.32-358.el6/linux-2.6.32-358.el6.x86_64/net/core/dev.c +1728 1728 /** 1729 * skb_gso_segment - Perform segmentation on skb. 1730 * @skb: buffer to segment 1731 * @features: features for the output path (see dev->features) 1732 * 1733 * This function segments the given skb and returns a list of segments. 1734 * 1735 * It may return NULL if the skb requires no segmentation. This is 1736 * only possible when GSO is used for verifying header integrity. 1737 */ 1738 struct sk_buff *skb_gso_segment(struct sk_buff *skb, int features) 1739 { 1740 struct sk_buff *segs = ERR_PTR(-EPROTONOSUPPORT); 1741 struct packet_type *ptype; 1742 __be16 type = skb->protocol; 1743 int err;
从网上找到的对应patch如下:
https://patchwork.kernel.org/patch/6615071/
四,问题重现
1,最早发现问题,想要重现的办法是通过如下url访问:curl “t.test.com”,发现重现不了。
2,之后,通过搜索相关TSO/GSO/LRO/GRO相关的资料,觉得有可能是由于发送的数据包太小,导致没有触发相关的数据包分段重组,从而没有导致重现问题。故增大了请求的数据包,通过如下url重现了问题:
# curl “t.test.com/v2/user-manage/css/bootstrap.min.css?test1=sdfsfsdfsdfa&test2_id=2234234234234234234&test_id=50129009890098&test_token=1670056402|_80_m_lxxj1298|1493196793|c726299f2d03b8462764bacf20e2395f|sdfsdfdsfsdffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffsdfsdfsdfdsfsdfhgjgjghjghjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjfhjgjfghjfjfhjjjjjjjjjjjjjjjjjjjjjfffffadfsfsdfsdfsdfsdfsdfdsfdssdfsdfsdfsdfsdfsdf”
iptables相关规则如下:
# ipset create lee hash:ip hashsize 819200 maxelem 100000 timeout 300 # ipset add lee 1.1.1.1 timeout 300 # iptables -t mangle -I PREROUTING -p tcp -m multiport --dports 80,443 -m set --match-set lee src -m string --string t.test.com --algo kmp --from 0 --to 1480 -j NFQUEUE
五,问题结论
linux kernel bug
六,解决办法
1,升级kernel。从patch和源代码可以看出kernel 3.0以后应该fix了这个问题,看了下3.10的kernel代码已经fix
2,使用drop,不再使用nfqueue这个target来添加iptables规则(建议使用这个办法)
3,调整网卡gso相关属性,发现通过关闭lro来解决这个重启问题。具体命令:
# ethtool -K eth0 lro on
LRO简介:
Linux 在 2.6.24 中加入了支持 IPv4 TCP 协议的 LRO (Large Receive Offload) ,它通过将多个 TCP 数据聚合在一个 skb 结构,在稍后的某个时刻作为一个大数据包交付给上层的网络协议栈,以减少上层协议栈处理 skb 的开销,提高系统接收 TCP 数据包的能力。当然,这一切都需要网卡驱动程序支持。
七,参考
https://patchwork.kernel.org/patch/6615071/
https://www.ibm.com/developerworks/cn/linux/l-cn-network-pt/index.html