这几天测试中,又收到了coredump的报告,调用栈如下:
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x0000000000432bb4 in ChargingNode::canProcessed (this=0x7f87b40118e0, maxTimestamp=9000000000) at src/sl/ChargingFile.C:406
#2 0x0000000000445de4 in BucketFileAdapter::checkin (this=0x2192b98, startTime=<value optimized out>) at src/sl/BucketFileAdapter.C:118
#3 0x0000000000446114 in BucketFileAdapter::start (this=0x2192b98) at src/sl/BucketFileAdapter.C:87
#4 0x000000000043560e in file_reader_run (arg=0x2192b98) at src/sl/ChargingFileAdapter.C:234
#5 0x0000003657607851 in start_thread () from /lib64/libpthread.so.0
#6 0x0000003656ee890d in clone () from /lib64/libc.so.6
栈顶函数地址为0x0。
这是一个很有意思的现象,以前我没有遇到这种case。
结合代码看看:出错行的C++代码:
if(chargingFile && (chargingFile ->canDecoded() == false))
用gdb看chargingFile的值,结果显示被优化了(使用了-O2):
(gdb) p chargingFile
$3 = <value optimized out>
那我们分析下frame 1 对应的C++代码,chargingFile即使为NULL也不可能导致core,所以只可能是chargingFIle不空,但调用canDecode()函数时出问题了。
对应反汇编代码:
(gdb) disas
Dump of assembler code for function ChargingNode::canProcessed(long):
0x0000000000432b20 <+0>: mov %rbx,-0x18(%rsp)
0x0000000000432b25 <+5>: lea 0x78(%rdi),%rbx
0x0000000000432b29 <+9>: mov %rbp,-0x10(%rsp)
0x0000000000432b2e <+14>: mov %r12,-0x8(%rsp)
0x0000000000432b33 <+19>: mov %rdi,%rbp
0x0000000000432b36 <+22>: sub $0x18,%rsp
0x0000000000432b3a <+26>: mov %rbx,%rdi
0x0000000000432b3d <+29>: mov %rsi,%r12
0x0000000000432b40 <+32>: callq 0x407820 <[email protected]>
0x0000000000432b45 <+37>: mov 0x18(%rbp),%rcx
0x0000000000432b49 <+41>: mov 0x28(%rbp),%rdx
0x0000000000432b4d <+45>: mov 0x38(%rbp),%rax
0x0000000000432b51 <+49>: sub 0x40(%rbp),%rax
0x0000000000432b55 <+53>: sub %rcx,%rdx
0x0000000000432b58 <+56>: sar $0x3,%rdx
0x0000000000432b5c <+60>: sar $0x3,%rax
0x0000000000432b60 <+64>: add %rax,%rdx
0x0000000000432b63 <+67>: mov 0x50(%rbp),%rax
0x0000000000432b67 <+71>: sub 0x30(%rbp),%rax
0x0000000000432b6b <+75>: sar $0x3,%rax
0x0000000000432b6f <+79>: shl $0x6,%rax
0x0000000000432b73 <+83>: lea -0x40(%rax,%rdx,1),%rax
0x0000000000432b78 <+88>: test %rax,%rax
0x0000000000432b7b <+91>: je 0x432b83 <ChargingNode::canProcessed(long)+99>
0x0000000000432b7d <+93>: cmpb $0x0,0x70(%rbp)
0x0000000000432b81 <+97>: je 0x432ba0 <ChargingNode::canProcessed(long)+128>
0x0000000000432b83 <+99>: mov %rbx,%rdi
0x0000000000432b86 <+102>: callq 0x407250 <[email protected]>
0x0000000000432b8b <+107>: xor %eax,%eax
0x0000000000432b8d <+109>: mov (%rsp),%rbx
0x0000000000432b91 <+113>: mov 0x8(%rsp),%rbp
0x0000000000432b96 <+118>: mov 0x10(%rsp),%r12
0x0000000000432b9b <+123>: add $0x18,%rsp
0x0000000000432b9f <+127>: retq
0x0000000000432ba0 <+128>: cmp %r12,0x68(%rbp)
0x0000000000432ba4 <+132>: jg 0x432bd0 <ChargingNode::canProcessed(long)+176>
0x0000000000432ba6 <+134>: mov (%rcx),%rdi
0x0000000000432ba9 <+137>: test %rdi,%rdi
0x0000000000432bac <+140>: je 0x432bb8 <ChargingNode::canProcessed(long)+152>
0x0000000000432bae <+142>: mov (%rdi),%rax
0x0000000000432bb1 <+145>: callq *0x20(%rax)
---Type <return> to continue, or q <return> to quit---
=> 0x0000000000432bb4 <+148>: test %al,%al
......
看到红色的汇编代码了吗?callq是X86上的函数调用指令,后面跟的是一个间接地址,不是一个直接的函数地址。
这种形式的汇编代码,我见过的只有两种情况:1 是内核启动的时候为了防止编译器优化而这样做,2虚函数调用。
这么说,canDecode函数是虚函数?查了下源码,果然是:virtual bool canDecoded();
那这么说,%eax里放的就是class的vptr。
证明一下:
(gdb) i r
rax 0x7f87b40c4670 140220818015856
rbx 0x7f87b4011958 140220817283416
rcx 0x7f87b4054478 140220817556600
rdx 0x4c 76
rsi 0x0 0
rdi 0x7f87b40cd480 140220818052224
根据X86_64上的函数调用传参数的习惯,rdi里存的就是chargingFile的值。
(gdb) p *(ChargingFile*)0x7f87b40cd480
$7 = {_vptr.ChargingFile = 0x7f87b40c4670, nodeName = "", hostName = "bucket", fileName =
"/incoming4cdrsch/reported/acr/bucket/bucket12/MAS2_-_0000001709.20130704_-_2126+0800.INC", fileType = CDR_FILE_TYPE_ACR, qid = {id =
-1}, timestamp = 1372944386, bufferedChargingRec = false, chargingNode = 0x7f87b40118e0, static fileStatusDir =
"/incoming4cdrsch//status", static processedRecNumLimit = 300000, static acrFilesTotalSize = 0, decodeFlag = false, localFileFlag =
false, fpFile = 0x0, fpStatusFile = 0x0, stopFlag =false, statusFileName = KeyboardInterrupt: Quit
, offset = 4184212, recordNum = 5695, totalRecordNum = 5695, accumNum = 0, static batchSize = 1000}
然后看上面标红色的,发现:vptr的值和rax的值一样。前面的分析是正确的。
那我们根据 0x0000000000432bb1 <+145>: callq *0x20(%rax) 来看一下内存 0x20(%rax)里到底是什么内容:
%rax=0x7f87b40c4670
0x20(%rax) = 0x7f87b40c4690
看内存:
(gdb) x/40x 0x7f87b40c4670
0x7f87b40c4670: 0x4100434e 0x2d5f3253 0x00000211 0x00000000
0x7f87b40c4680: 0xb4024f10 0x00007f87 0xb40cd470 0x00007f87
0x7f87b40c4690: 0x00000000 0x00000000 0x00000000 0x00000000
看上面标蓝色的,哇塞地址为0x0。说明了什么呢童鞋们,对象被破坏了,虚函数表被覆盖了。
结合代码,发现问题是因为多线程情况下,互斥锁使用范围不当,导致对象被过早释放出现问题。
总结:
1. 碰到call stack栈顶函数地址为0,考虑虚函数表被破坏,即对象呗破坏的情况。
2. 熟悉常用的X86_64函数调用习惯,rdi里放的是第一个函数参数。
3. 多线程中的锁使用一定要注意范围,锁的太小可能不够,太大了性能会有问题。