【NE现场】
pid: 6715, tid: 6719, name: Signal Catcher >>> com.android.contacts <<< signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x55a5931480 x0 00000000000d4000 x1 0000007f98f9d868 x2 0000000000004001 x3 0000007f951f4450 x4 00000000000000ca x5 0000000000004001 x6 0000000000000000 x7 0000007f98f9d86c x8 0000000000001a3f x9 0000000000001a3f x10 0000007f98f9d86c x11 0000000000004000 x12 0000000000004001 x13 0000000000000000 x14 0000000000000001 x15 0000000000000fc0 x16 0000007f98f90a58 x17 0000000000000000 x18 0000000000000fc0 x19 0000007f97457000 x20 0000007f98f9df18 x21 00000055a5931460 x22 0000007f951f3420 x23 00000055a585d460 x24 0000007f951f3428 x25 00000000000000ca x26 0000000000000001 x27 0000007f951f3338 x28 0000007f8abba350 x29 0000007f951f3290 x30 0000007f97434788 sp 0000007f951f3290 pc 0000007f974347b0 pstate 0000000060000000 backtrace: #00 pc 00000000000077b0 /system/lib64/libunwind.so #01 pc 0000000000007cf4 /system/lib64/libunwind.so (_ULaarch64_dwarf_find_debug_frame+320) #02 pc 0000000000008244 /system/lib64/libunwind.so #03 pc 0000000000003868 /system/bin/linker64 (_dlZ18do_dl_iterate_phdrPFiP12dl_phdr_infomPvES1+96) #04 pc 0000000000003324 /system/bin/linker64 (__dl_dl_iterate_phdr+44) #05 pc 00000000000085b8 /system/lib64/libunwind.so #06 pc 00000000000062c0 /system/lib64/libunwind.so #07 pc 00000000000071f4 /system/lib64/libunwind.so #08 pc 0000000000007618 /system/lib64/libunwind.so #09 pc 0000000000015ca8 /system/lib64/libunwind.so (_ULaarch64_step+40) #10 pc 0000000000006f80 /system/lib64/libbacktrace.so (_ZN13UnwindCurrent17UnwindFromContextEmP8ucontext+360) #11 pc 0000000000004eec /system/lib64/libbacktrace.so (_ZN16BacktraceCurrent12UnwindThreadEm+556) #12 pc 000000000048a368 /system/lib64/libart.so (ZN3art15DumpNativeStackERNSt3_113basic_ostreamIcNS0_11char_traitsIcEEEEiP12BacktraceMapPKcPNS_9ArtMethodEPv+200) #13 pc 0000000000459204 /system/lib64/libart.so (ZNK3art6Thread4DumpERNSt3_113basic_ostreamIcNS1_11char_traitsIcEEEEP12BacktraceMap+224) #14 pc 00000000004660a4 /system/lib64/libart.so (_ZN3art14DumpCheckpoint3RunEPNS_6ThreadE+692) #15 pc 00000000004670a0 /system/lib64/libart.so (_ZN3art10ThreadList13RunCheckpointEPNS_7ClosureE+500) #16 pc 000000000046770c /system/lib64/libart.so (ZN3art10ThreadList4DumpERNSt3_113basic_ostreamIcNS1_11char_traitsIcEEEE+204) #17 pc 0000000000468014 /system/lib64/libart.so (ZN3art10ThreadList14DumpForSigQuitERNSt3_113basic_ostreamIcNS1_11char_traitsIcEEEE+492) #18 pc 00000000004313a4 /system/lib64/libart.so (ZN3art7Runtime14DumpForSigQuitERNSt3_113basic_ostreamIcNS1_11char_traitsIcEEEE+96) #19 pc 000000000043e814 /system/lib64/libart.so (_ZN3art13SignalCatcher13HandleSigQuitEv+1256) #20 pc 000000000043f424 /system/lib64/libart.so (_ZN3art13SignalCatcher3RunEPv+452) #21 pc 0000000000067754 /system/lib64/libc.so (ZL15_pthread_startPv+52) #22 pc 000000000001c644 /system/lib64/libc.so (__start_thread+16)
从调用栈来看,是SignalCacher线程在打印其他线程的调用栈时出现异常。
【问题分析】
先通过addr2line工具确定出问题的代码:
static int load_debug_frame (const char *file, char **buf, size_t *bufsize, int is_local) { ... f = fopen (file, "r"); if (!f) return 1; if (fread (&ehdr, sizeof (Elf_W (Ehdr)), 1, f) != 1) goto file_error; shstrndx = ehdr.e_shstrndx; Debug (4, "opened file ‘%s‘. Section header at offset %d\n", file, (int) ehdr.e_shoff); fseek (f, ehdr.e_shoff, SEEK_SET); sec_hdrs = calloc (ehdr.e_shnum, sizeof (Elf_W (Shdr))); if (sec_hdrs == NULL || fread (sec_hdrs, sizeof (Elf_W (Shdr)), ehdr.e_shnum, f) != ehdr.e_shnum) goto file_error; Debug (4, "loading string table of size %ld\n", (long) sec_hdrs[shstrndx].sh_size); size_t sec_size = sec_hdrs[shstrndx].sh_size; <<<
看起来是shstrndx太大,数组访问越界导致的。而
shstrndx = ehdr.e_shstrndx
这个值是从elfheader中获取的,可能是elf文件格式有问题。
通过objdump反汇编,确定这里ehdr的地址是x27值,也就是0000007f951f3338
再通过tombstone,确定这个ehdr的值为:
memory near x27: 0000007f951f3318 0000007f8371b214 0000007f951f3bb0 ..q......;...... 0000007f951f3328 0000007f95c31000 0000007f95c31000 ................ 0000007f951f3338 0800000a04034b50 c1963a2100000000 PK..........!:.. 0000007f951f3348 3b4e00013b4e66df 7361000d001d0001 .fN;..N;......as 0000007f951f3358 7268632f73746573 5f3030315f656d6f sets/chrome_100_ 0000007f951f3368 2e746e6563726570 350000cafe6b6170 percent.pak....5 0000007f951f3378 2190b2bd990eaa19 0000007f951f3470 .......!p4......
标准的elf文件,它的文件头是一个magic numbers = 7f 45 4c 46,如:
$ xxd symbols/system/bin/linker 0000000: 7f45 4c46 0101 0100 0000 0000 0000 0000 .ELF............ 0000010: 0300 2800 0100 0000 5816 0000 3400 0000 ..(.....X...4... 0000020: a430 1800 0000 0005 3400 2000 0a00 2800 .0......4. ...(. 0000030: 2300 2000 0600 0000 3400 0000 3400 0000 #. .....4...4... 0000040: 3400 0000 4001 0000 4001 0000 0400 0000 [email protected]@.......
而上面的magic numbers是PK,这个是apk包的magic numbers。
当前被解析的时apk包,这个并不是一个elf文件,因此unwind解析时会出错。
下一步就需要找到这个elf文件。为此我们首先得找到正在被打印的目标线程。
我们可以在tombstone中搜索[vdso]就能找到这个正在被打印的线程,如:
pid: 6715, tid: 6762, name: Chrome_FileUser >>> com.android.contacts <<< backtrace: #00 pc 00000000000199c0 /system/lib64/libc.so (syscall+28) #01 pc 0000000000067474 /system/lib64/libc.so (ZL33_pthread_cond_timedwait_relativeP23pthread_cond_internal_tP15pthread_mutex_tPK8timespec+96) #02 pc 00000000000675f0 /system/lib64/libc.so (pthread_cond_timedwait+72) #03 pc 00000000000068bc /system/lib64/libbacktrace.so (_ZN11ThreadEntry4WaitEi+76) #04 pc 0000000000004c3c /system/lib64/libbacktrace.so #05 pc 00000000000004dc [vdso] #06 pc 00000000000199bc /system/lib64/libc.so (syscall+24) #07 pc 0000000000067474 /system/lib64/libc.so (ZL33_pthread_cond_timedwait_relativeP23pthread_cond_internal_tP15pthread_mutex_tPK8timespec+96) #08 pc 0000000000793214 /system/app/WebViewGoogle/WebViewGoogle.apk (offset 0xa0d000)
这里的vdso是放信号处理函数的内存段:
... 0000007f‘993ae000-0000007f‘993aefff r-x 0 1000 [vdso] 0000007f‘993af000-0000007f‘993affff r-- 32000 1000 /system/bin/linker64 0000007f‘993b0000-0000007f‘993b1fff rw- 33000 2000 /system/bin/linker64 ...
【打印目标线程调用栈过程】
1、给目标线程发送特殊的signal,触发目标线程的信号处理函数。
2、内核调用信号处理函数时,会将当前目标线程的CPU上下文以参数的形式传给信号处理函数。这样信号处理函数中,就能得到这个上下文。
3、得到这个上下文后,当前线程就将这个上下文传给Signal Catcher线程,然后自己进入wait状态。
4、Signal Catcher线程得到这个上下文后,就可以调用unwind库,来打印目标线程的调用栈,打印完在wake目标线程。
当前的NE就是第4个阶段,也就是Signal Cacher线程拿到目标线程上下文后,再调用unwind库解析目标线程的调用栈时出的问题。
从调用栈中可以清楚的看到,是调用栈中有WebViewGoogle.apk这个文件,和之前的推理吻合。
查看WebViewGoogle.apk的内容,也和前面解析的文件头内容相同。
【解决方案】
在unwind中,读取文件头后,先判断是否有elf的magic number即可,如:
if (fread (&ehdr, sizeof (Elf_W (Ehdr)), 1, f) != 1) goto file_error; /* Verify this is actually an elf file. */ + if (memcmp(ehdr.e_ident, ELFMAG, SELFMAG) != 0) + goto file_error; shstrndx = ehdr.e_shstrndx;
aosp中也有相同change:
https://android-review.googlesource.com/#/c/194457/