【NE现场】
打开12306应用后做一些操作,和容易出现系统重启。dropbox中有好多system_server的tombstone文件:
./[email protected]1449222028760.txt:12:pid: 10466, tid: 10493, name: android.bg >>> system_server <<< ./[email protected]1449455808867.txt:12:pid: 5992, tid: 6053, name: AlarmManager >>> system_server <<< ./[email protected]1449222028730.txt:12:pid: 10466, tid: 10494, name: ActivityManager >>> system_server <<< ./[email protected]1449455808843.txt:12:pid: 5992, tid: 6014, name: SensorService >>> system_server <<< ./[email protected]1449457509508.txt:12:pid: 11012, tid: 11887, name: Binder_E >>> system_server <<< ./[email protected]1449229865122.txt:12:pid: 18238, tid: 18260, name: SensorService >>> system_server <<<
可以看到每次crash的线程都不一样!甚至backtrace也不一样:
@[email protected] pid: 10466, tid: 10493, name: android.bg >>> system_server <<< signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xa00000070 ... backtrace: #00 pc 0000000000029f70 /system/lib64/libbinder.so (android::IPCThreadState::flushCommands()+4) #01 pc 0000000000009c60 /data/dalvik-cache/arm64/[email protected]@boot.oat
@[email protected] pid: 5992, tid: 6053, name: AlarmManager >>> system_server <<< signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x553d89e3c0 ... backtrace: #00 pc 0000000000030bc4 /system/lib64/libbinder.so (int android::Parcel::writeAligned<int>(int)+80) #01 pc 00000000000d3e0c /system/lib64/libandroid_runtime.so #02 pc 0000000000109630 /data/dalvik-cache/arm64/[email protected]@boot.oat
@[email protected] pid: 10466, tid: 10494, name: ActivityManager >>> system_server <<< signal 7 (SIGBUS), code 1 (BUS_ADRALN), fault addr 0x7f0000000a ... backtrace: #00 pc 0000007f0000000a <unknown>
@[email protected] pid: 5992, tid: 6014, name: SensorService >>> system_server <<< signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xa00000068 backtrace: #00 pc 000000000000f508 /system/lib64/libsensorservice.so #01 pc 0000000000010e90 /system/lib64/libsensorservice.so #02 pc 00000000000179c0 /system/lib64/libutils.so (android::Thread::_threadLoop(void*)+188) #03 pc 000000000009277c /system/lib64/libandroid_runtime.so (android::AndroidRuntime::javaThreadShell(void*)+96) #04 pc 0000000000017224 /system/lib64/libutils.so #05 pc 000000000001cbb0 /system/lib64/libc.so (__pthread_start(void*)+52) #06 pc 0000000000019044 /system/lib64/libc.so (__start_thread+16)
@[email protected] pid: 11012, tid: 11887, name: Binder_E >>> system_server <<< signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x0 backtrace: #00 pc 0000000000014070 /system/lib64/libutils.so (android::RefBase::decStrong(void const*) const+236) #01 pc 000000000002f694 /system/lib64/libbinder.so (android::Parcel::releaseObjects()+84) #02 pc 000000000002f6e8 /system/lib64/libbinder.so (android::Parcel::freeDataNoInit()+60) #03 pc 000000000002f744 /system/lib64/libbinder.so (android::Parcel::ipcSetDataReference(unsigned char const*, unsigned long, unsigned long long const*, unsigned long, void (*)(android::Parcel*, unsigned char const*, unsigned long, unsigned long long const*, unsigned long, void*), void*)+40) #04 pc 000000000002a44c /system/lib64/libbinder.so (android::IPCThreadState::executeCommand(int)+700) #05 pc 000000000002a6c8 /system/lib64/libbinder.so (android::IPCThreadState::getAndExecuteCommand()+92) #06 pc 000000000002a73c /system/lib64/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+76) #07 pc 0000000000031d68 /system/lib64/libbinder.so #08 pc 00000000000179c0 /system/lib64/libutils.so (android::Thread::_threadLoop(void*)+188) #09 pc 000000000009277c /system/lib64/libandroid_runtime.so (android::AndroidRuntime::javaThreadShell(void*)+96) #10 pc 0000000000017224 /system/lib64/libutils.so #11 pc 000000000001cbb0 /system/lib64/libc.so (__pthread_start(void*)+52) #12 pc 0000000000019044 /system/lib64/libc.so (__start_thread+16)
@[email protected] pid: 18238, tid: 18260, name: SensorService >>> system_server <<< signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x7f3d798d38 backtrace: #00 pc 000000000000dbc0 /system/lib64/libsensorservice.so #01 pc 000000000000ed44 /system/lib64/libsensorservice.so #02 pc 000000000000f3a8 /system/lib64/libsensorservice.so #03 pc 0000000000011078 /system/lib64/libsensorservice.so #04 pc 00000000000179c0 /system/lib64/libutils.so (android::Thread::_threadLoop(void*)+188) #05 pc 000000000009277c /system/lib64/libandroid_runtime.so (android::AndroidRuntime::javaThreadShell(void*)+96) #06 pc 0000000000017224 /system/lib64/libutils.so #07 pc 000000000001cbb0 /system/lib64/libc.so (__pthread_start(void*)+52) #08 pc 0000000000019044 /system/lib64/libc.so (__start_thread+16)
这种backtrace都不一样的问题很可能就是内存问题了,所谓内存问题指的就是野指针或内存越界。
【问题分析】
分析内存问题的第一步就是排查NE现场寄存器指向的内存值附近有没有规律。比如:
@[email protected] pid: 10466, tid: 10493, name: android.bg >>> system_server <<< signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xa00000070 x0 000000557e996710 x1 0000000a00000068 x2 0000007f79c31ba0 x3 0000000000000000 x4 000000006fc46a80 x5 0000000000000001 x6 0000000000000000 x7 000000557e4f527c x8 0000000000000000 x9 000000557e4f5278 x10 0000000000000000 x11 0000000000000000 x12 0000000000000000 x13 0000000000430000 x14 0000000000550000 x15 0000000000430000 x16 0000007f8f640320 x17 0000007f8eff4f6c x18 0000007f8c1d0470 x19 000000000000000a x20 0000007f8f590e9c x21 000000557e9b14d0 x22 000000001354bf40 x23 000000006fdf61f0 x24 0000007f79c31b20 x25 0000000012da2040 x26 0000000000000000 x27 00000000000f52be x28 0000000000000000 x29 0000000000358c82 x30 0000000072639c64 sp 0000007f79c31680 pc 0000007f8eff4f70 pstate 0000000080000000 backtrace: #00 pc 0000000000029f70 /system/lib64/libbinder.so (android::IPCThreadState::flushCommands()+4) #01 pc 0000000000009c60 /data/dalvik-cache/arm64/[email protected]@boot.oat
查看#00层代码:
$ aarch64-linux-android-objdump -D symbols/system/lib64/libbinder.so 0000000000029f6c <_ZN7android14IPCThreadState13flushCommandsEv>: 29f6c: f9400001 ldr x1, [x0] => 29f70: b9400822 ldr w2, [x1,#8] 29f74: 6b1f005f cmp w2, wzr 29f78: 5400006d b.le 29f84 <_ZN7android14IPCThreadState13flushCommandsEv+0x18> 29f7c: 52800001 mov w1, #0x0 // #0 29f80: 17ffd9ec b 20730 <[email protected]> 29f84: d65f03c0 ret
x1值是x0地址中取来的,0x0000000a00000068显然不是一个合法地址。可能是x0地址被覆盖了。
查看x0附近的内存值:
memory near x0: 000000557e9966f0 0000007f8f01e748 0000000000000000 000000557e996700 0000000000000000 0000000000000007 000000557e996710 0000000a00000068 0000007f0000000a 000000557e996720 00000438234cda2a 3de6de183dc20f78 000000557e996730 000000003d9e2680 0000000000000008 000000557e996740 0000000000000000 0000000000000000 000000557e996750 0000000000000000 0000000000000000 000000557e996760 0000000000010001 0000000000000000 000000557e996770 000000557e996840 0000000a00000068 000000557e996780 000000550000000a 000004382647caaa 000000557e996790 3de6de183dc20f78 000000003d9e2680 000000557e9967a0 0000000000000000 0000000000000000 000000557e9967b0 0000000000000000 0000000000000000 000000557e9967c0 0000007f8e010001 0000000000000000 000000557e9967d0 0000000000000000 000028e200000040 000000557e9967e0 0000000a00000068 000000550000000a
发现一个明显的规律:
0x0000000a00000068出现了多次,而且间隔都是13*8=104字节,每个块的结构都很相似,很可能这个数据是个结构体数组。
继续分析下一个tombstone:
@[email protected] pid: 5992, tid: 6053, name: AlarmManager >>> system_server <<< signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x553d89e3c0 x0 00000055837f4dc0 x1 0000000000000004 x2 0000000000000440 x3 0000000000020000 x4 0000000000000444 x5 000000553d89df80 x6 0000000000000000 x7 000000558336b27c x8 0000000000000000 x9 000000558336b278 x10 0000000000000000 x11 0000000000000000 x12 0000000000000000 x13 0000000000430000 x14 0000000000550000 x15 0000000000430000 x16 0000007f7e502478 x17 0000007f7e4dbb74 x18 0000007f7b6b0470 x19 00000055837f4dc0 x20 000000008080005c x21 0000005583838590 x22 000000008080005c x23 000000006fd15b08 x24 000000008080005c x25 000000003000003a x26 0000000012d5fe80 x27 0000000012c47400 x28 000000000000005c x29 0000007f68090bd0 x30 0000007f7ea66e10 sp 0000007f68090bd0 pc 0000007f7e4dbbc4 pstate 0000000080000000 backtrace: #00 pc 0000000000030bc4 /system/lib64/libbinder.so (int android::Parcel::writeAligned<int>(int)+80) #01 pc 00000000000d3e0c /system/lib64/libandroid_runtime.so #02 pc 0000000000109630 /data/dalvik-cache/arm64/[email protected]@boot.oat
查看#00层代码:
$ aarch64-linux-android-objdump -D symbols/system/lib64/libbinder.so 0000000000030b74 <_ZN7android6Parcel12writeAlignedIiEEiT_>: 30b74: a9be7bfd stp x29, x30, [sp,#-32]! 30b78: 910003fd mov x29, sp 30b7c: a90153f3 stp x19, x20, [sp,#16] 30b80: aa0003f3 mov x19, x0 30b84: 2a0103f4 mov w20, w1 30b88: f9401002 ldr x2, [x0,#32] 30b8c: f9400c03 ldr x3, [x0,#24] 30b90: 91001044 add x4, x2, #0x4 30b94: eb03009f cmp x4, x3 30b98: 54000109 b.ls 30bb8 <_ZN7android6Parcel12writeAlignedIiEEiT_+0x44> 30b9c: d2800081 mov x1, #0x4 // #4 30ba0: 97ffbbe4 bl 1fb30 <[email protected]> 30ba4: 34000080 cbz w0, 30bb4 <_ZN7android6Parcel12writeAlignedIiEEiT_+0x40> 30ba8: a94153f3 ldp x19, x20, [sp,#16] 30bac: a8c27bfd ldp x29, x30, [sp],#32 30bb0: d65f03c0 ret 30bb4: f9401262 ldr x2, [x19,#32] 30bb8: f9400665 ldr x5, [x19,#8] 30bbc: aa1303e0 mov x0, x19 30bc0: d2800081 mov x1, #0x4 // #4 => 30bc4: b82268b4 str w20, [x5,x2] 30bc8: a94153f3 ldp x19, x20, [sp,#16] 30bcc: a8c27bfd ldp x29, x30, [sp],#32 30bd0: 17ffbf3c b 208c0 <[email protected]>
NE的原因是x5值非法,而x5是从x19中取出来的,查看x19值:
memory near x19: 00000055837f4da0 0000000200000004 0000000a00000068 00000055837f4db0 000000550000000a 0000015263c04a64 00000055837f4dc0 3d5762703e10ca1c 000000553d89df80 00000055837f4dd0 0000000000000440 0000000000020000 00000055837f4de0 0000000000000440 0000000000000000 00000055837f4df0 0000000000000000 0000000000000000 00000055837f4e00 0000000000000000 0000000000010001 00000055837f4e10 0000000a00000068 000000550000000a 00000055837f4e20 0000015266bb3ae4 3d5762703e10ca1c 00000055837f4e30 0000007f3d89df80 000000558379c020 00000055837f4e40 000000558382ea70 0000656c62617300 00000055837f4e50 0000000000000000 00000000000000d3 00000055837f4e60 0000007f7eb1dc28 0000000000000000 00000055837f4e70 0000007f7c8f2348 0000000a00000068 00000055837f4e80 000000020000000a 0000015269b62b64 00000055837f4e90 3d5762703e10ca1c 000000003d89df80
被破坏的现场合前面一个现场几乎相同!
后面几个就直接在tombstone里搜索0000000a00000068,发现每一个都是有0000000a00000068, 且间隔都是104字节。
@[email protected] memory near x1: 000000557e996a50 0000000a00000068 000000000000000a 000000557e996a60 000004383b245e2a 3de6de183dc20f78 000000557e996a70 000000003d9e2680 0000000000000000 000000557e996a80 0000000000000007 00000000000f8327 000000557e996a90 0000007f89b05070 0000000000000000 000000557e996aa0 0000007f79a26000 0000007f00000000 000000557e996ab0 0000000000000000 0000000a00000068 000000557e996ac0 000000000000000a 000004383e1f4eaa 000000557e996ad0 3de6de183dc20f78 000000003d9e2680 000000557e996ae0 0000000012d94a90 0000000000000000 000000557e996af0 0000007f79a24000 0000000000103000 000000557e996b00 0000000000000000 0000000000000000 000000557e996b10 0000000000000000 0000000000000000 000000557e996b20 0000000a00000068 000000000000000a 000000557e996b30 00000438411a3f2a 3de6de183dc20f78 000000557e996b40 000000553d9e2680 000000557e915720
@[email protected] memory near x9: 000000558380c828 000000007614b620 7461003b72656e65 000000558380c838 0000000000000000 000000558380d530 000000558380c848 0000000a00000068 000000550000000a 000000558380c858 0000015d3d53dc64 3d5762703e10ca1c 000000558380c868 000000003d89df80 0000007f7c8f1fd8 000000558380c878 000000020000008d 0000000200000004 000000558380c888 00000000753a770c 0000000000000000 000000558380c898 0000000000000234 0000000000000000 000000558380c8a8 0000000000000000 0000000000000000 000000558380c8b8 0000000000000000 0000007f7c7f9860 000000558380c8c8 0000000000000101 00000000753a770c 000000558380c8d8 0000000000000000 0000000000000234 000000558380c8e8 0000000000000000 0000000000000000 000000558380c8f8 0000000000000000 000000558337a320 000000558380c908 0000000000000001 00000000d6000025 000000558380c918 0000000000000000 0000000000000000
@[email protected] memory near x19: 0000005594c52190 3d7e9cb83e33b9be 000000003d567500 0000005594c521a0 0000000000000000 0000000000000000 0000005594c521b0 0000000100000000 0000000000000000 0000005594c521c0 0000000000000000 0000000000000160 0000005594c521d0 0000000000000000 0000007f9d76b470 0000005594c521e0 0000000a00000068 000000110000000a 0000005594c521f0 000001d7e54f0ade 3d7e9cb83e33b9be 0000005594c52200 000000003d567500 0000000000000000 0000005594c52210 0000000000000000 0000005594bc6700 0000005594c52220 0000000000000000 0000000000000000 0000005594c52230 0000000000000081 0000005500000000 0000005594c52240 0000007f9e6baf60 0000000a00000068 0000005594c52250 0000007f0000000a 000001d7e849fb5e 0000005594c52260 3d7e9cb83e33b9be 0000007f3d567500 0000005594c52270 0000007f9e6baf00 0000000000110000 0000005594c52280 0000000000000000 0000000000000000
@[email protected] memory near x0: 0000005599644de0 0000000a00000068 000000550000000a 0000005599644df0 0000066c78c6a111 3d8b65883e0ed844 0000005599644e00 0000007f3d798d00 0000005599648310 0000005599644e10 0000005599644ca8 0000005599644cd8 0000005599644e20 0000000400000004 420ba3d700000000 0000005599644e30 40c333333c2f4f0e 0000000300000000 0000005599644e40 0000000000000000 0000000a00000068 0000005599644e50 000000550000000a 0000066c7bc19191 0000005599644e60 3d8b65883e0ed844 0000007f3d798d00 0000005599644e70 0000000000000080 0000000000000053 0000005599644e80 0000000000000051 0000000000000046 0000005599644e90 0000005599644ed0 0000007f97c18000 0000005599644ea0 0000000000000000 0000007f97c18000 0000005599644eb0 0000000a00000068 000000010000000a 0000005599644ec0 0000066c7ebc8211 3d8b65883e0ed844 0000005599644ed0 61642f613d798d00 6361632d6b69766c
至此基本上能确定就是这个大小为104字节,带有0x0000000a00000068这个pattern的结构体数组覆盖正常内存导致的。
下一步就是要确定这个0x0000000a00000068属于哪个结构体。
从tombstone的maps数据中可以知道每次出现问题的地址都是堆内存,因此覆盖和被覆盖的内存也应该都是malloc出来的。
@[email protected] ... 00000055996ba000-0000005599d24fff rw- 6729728 [heap] ...
现在我们可以用hook工具hook free函数,然后再free时判断内存里是否有0000000a00000068,如果有这个pattern就打印调用栈。
代码如下:
#if defined(__aarch64__) extern "C" void free(void* p); extern "C" void inject_free(void* p) { if(p != NULL) { size_t* head = (size_t*)((char*)p-sizeof(size_t)); //这里通过heap chunk的head信息得出指针的大小 size_t size = *head & ~7,i=0; int* p_int = (int*)p; while(i < size/sizeof(int)) { if(*(p_int+i)== 0x00000068 && *(p_int+i+1)==0x0000000a ) { //这里判断内存中是否有pattern LOGD("size=%llu",size); size_t j = 0; while(j < size/sizeof(int)) { LOGD("p_int[%d]=%08x",j,*(p_int+j)); j++; } dump_java_stack(); dump_native_stack(); } i++; } } free(p); //这里再调用真正的free } #endif
打开12306复现问题,发现退出应用时,有如下日志:
# logcat |grep INJECT D/INJECT ( 912): size=102352 D/INJECT ( 912): p_int[0]=00000068 D/INJECT ( 912): p_int[1]=0000000a D/INJECT ( 912): p_int[2]=0000000a D/INJECT ( 912): p_int[3]=00000055 D/INJECT ( 912): p_int[4]=0eaa916c D/INJECT ( 912): p_int[5]=0000493a D/INJECT ( 912): p_int[6]=bdaadb00 D/INJECT ( 912): p_int[7]=bcd10f00 D/INJECT ( 912): p_int[8]=be057140 D/INJECT ( 912): p_int[9]=00000000 ... D/INJECT ( 912): p_int[25586]=00007041 D/INJECT ( 912): p_int[25587]=00000000 D/INJECT ( 912): #00 pc 0000000000000a24 /system/lib64/libinjectapis.so (inject_free+236) D/INJECT ( 912): #01 pc 000000000000fc28 /system/lib64/libsensorservice.so D/INJECT ( 912): #02 pc 000000000000fcf4 /system/lib64/libsensorservice.so D/INJECT ( 912): #03 pc 00000000000141b4 /system/lib64/libutils.so (android::RefBase::decStrong(void const*) const+560) D/INJECT ( 912): #04 pc 0000000000029858 /system/lib64/libbinder.so (android::IPCThreadState::processPendingDerefs()+140) D/INJECT ( 912): #05 pc 000000000002a734 /system/lib64/libbinder.so (android::IPCThreadState::joinThreadPool(bool)+68) D/INJECT ( 912): #06 pc 0000000000031d68 /system/lib64/libbinder.so D/INJECT ( 912): #07 pc 00000000000179c0 /system/lib64/libutils.so (android::Thread::_threadLoop(void*)+188) D/INJECT ( 912): #08 pc 000000000009277c /system/lib64/libandroid_runtime.so (android::AndroidRuntime::javaThreadShell(void*)+96) D/INJECT ( 912): #09 pc 0000000000017224 /system/lib64/libutils.so D/INJECT ( 912): #10 pc 000000000001cbb0 /system/lib64/libc.so (__pthread_start(void*)+52) D/INJECT ( 912): #11 pc 0000000000019044 /system/lib64/libc.so (__start_thread+16)
看来是抓到这个罪魁祸首了!
用addr2line看看是哪个文件哪一行:
$ aarch64-linux-android-addr2line -f -e symbols/system/lib64/libsensorservice.so fc28 _ZN7android13SensorService21SensorEventConnectionD1Ev /home/mi/disk/2-v6-l-hermes-dev/frameworks/native/services/sensorservice/SensorService.cpp:1021
@frameworks/native/services/sensorservice/SensorService.cpp SensorService::SensorEventConnection::~SensorEventConnection() { ALOGD_IF(DEBUG_CONNECTIONS, "~SensorEventConnection(%p)", this); mService->cleanupConnection(this); if (mEventCache != NULL) { delete mEventCache; } }
就是delete mEventCache语句释放带有0000000a00000068的内存,看看它是被创建的代码:
@frameworks/native/services/sensorservice/SensorService.cpp status_t SensorService::SensorEventConnection::sendEvents( sensors_event_t const* buffer, size_t numEvents, sensors_event_t* scratch, SensorEventConnection const * const * mapFlushEventsToConnections) { ... mEventCache = new sensors_event_t[mMaxCacheSize];
这个mEventCache确实是个结构体数组指针。这个new sensors_event_t的定义如下:
@hardware/libhardware/include/hardware/sensors.h typedef struct sensors_event_t { /* must be sizeof(struct sensors_event_t) */ int32_t version; /* sensor identifier */ int32_t sensor; /* sensor type */ int32_t type; /* reserved */ int32_t reserved0; /* time is in nanosecond */ int64_t timestamp; union { ... } /* Reserved flags for internal use. Set to zero. */ uint32_t flags; uint32_t reserved1[3]; } sensors_event_t;
0x0000000a00000068中的68就是version,也就是这个结构体的大小,0x68=104字节,完全符合pattern。
接下来就是排查mEventCache,首先排除野指针,因为复现问题前并没有打印free的log,那只能是内存越界了。
再SensorService.cpp中所有能写mEventCache的地方加了log,发现复现问题时也没有相关log打印出来。
又在原生机器试了一下,发现原生没有这个问题,所以可能不是framework的SensorService代码问题。
sensors_event_t是hardware的头文件里定义的,由次可以判断sensors_event_t这个数据应该是hardware层给framework的buffer里写的。
有是只有MTk机器上出现的问题,所以问题很可能是出在hardware层。
在vendor下搜索sensors_event_t,会发现有很多sensor的实现,不太好定位问题:
vendor$ cgrep sensors_event_t ./mediatek/proprietary/hardware/sensor/InPocket.h:45: sensors_event_t mPendingEvent; ./mediatek/proprietary/hardware/sensor/InPocket.h:50: virtual int readEvents(sensors_event_t* data, int count); ./mediatek/proprietary/hardware/sensor/Activity.cpp:45: mPendingEvent.version = sizeof(sensors_event_t); ./mediatek/proprietary/hardware/sensor/Activity.cpp:225:int ActivitySensor::readEvents(sensors_event_t* data, int count) ./mediatek/proprietary/hardware/sensor/Shake.cpp:46: mPendingEvent.version = sizeof(sensors_event_t); ./mediatek/proprietary/hardware/sensor/Shake.cpp:216:int ShakeSensor::readEvents(sensors_event_t* data, int count) ./mediatek/proprietary/hardware/sensor/PickUp.h:45: sensors_event_t mPendingEvent; ./mediatek/proprietary/hardware/sensor/PickUp.h:50: virtual int readEvents(sensors_event_t* data, int count); ./mediatek/proprietary/hardware/sensor/Hwmsen.h:96: sensors_event_t mPendingEvents[numSensors]; ... ./mediatek/proprietary/hardware/sensor/PickUp.cpp:215:int PickUpSensor::readEvents(sensors_event_t* data, int count) ./mediatek/proprietary/hardware/sensor/GlanceGesture.cpp:47: mPendingEvent.version = sizeof(sensors_event_t); ./mediatek/proprietary/hardware/sensor/GlanceGesture.cpp:208:int GlanceGestureSensor::readEvents(sensors_event_t* data, int count) ./mediatek/proprietary/hardware/sensor/FaceDown.cpp:44: mPendingEvent.version = sizeof(sensors_event_t); ./mediatek/proprietary/hardware/sensor/FaceDown.cpp:213:int FaceDownSensor::readEvents(sensors_event_t* data, int count) ./mediatek/proprietary/hardware/sensor/AmbienteLight.cpp:62: mPendingEvent.version = sizeof(sensors_event_t); ./mediatek/proprietary/hardware/sensor/AmbienteLight.cpp:250:int AmbiLightSensor::readEvents(sensors_event_t* data, int count) ./mediatek/proprietary/hardware/sensor/HeartRate.h:44: sensors_event_t mPendingEvent; ./mediatek/proprietary/hardware/sensor/HeartRate.h:49: virtual int readEvents(sensors_event_t* data, int count); ./mediatek/proprietary/hardware/sensor/Proximity.h:64: sensors_event_t mPendingEvent; ./mediatek/proprietary/hardware/sensor/Proximity.h:69: virtual int readEvents(sensors_event_t* data, int count); ./mediatek/proprietary/hardware/sensor/GlanceGesture.h:45: sensors_event_t mPendingEvent; ./mediatek/proprietary/hardware/sensor/GlanceGesture.h:50: virtual int readEvents(sensors_event_t* data, int count); ./mediatek/proprietary/hardware/sensor/Proximity.cpp:64: mPendingEvent.version = sizeof(sensors_event_t); ./mediatek/proprietary/hardware/sensor/Proximity.cpp:253:int ProximitySensor::readEvents(sensors_event_t* data, int count)
0x0000000a00000068中的0xa是sensors_event_t结构中的sensor的值,这个看起来是sensor ID,这些ID是在device下定义的。
@device/mediatek/common/kernel-headers/linux/hwmsensor.h #define SENSOR_TYPE_LINEAR_ACCELERATION 10 #define SENSOR_TYPE_ROTATION_VECTOR 11 ... #define SENSOR_TYPE_SIGNIFICANT_MOTION 17 #define SENSOR_TYPE_STEP_DETECTOR 18 ... #define ID_BASE 0 ... #define ID_ACCELEROMETER (ID_BASE+SENSOR_TYPE_ACCELEROMETER-1) #define ID_LINEAR_ACCELERATION (ID_BASE+SENSOR_TYPE_LINEAR_ACCELERATION-1) #define ID_ROTATION_VECTOR (ID_BASE+SENSOR_TYPE_ROTATION_VECTOR-1) #define ID_GRAVITY (ID_BASE+SENSOR_TYPE_GRAVITY-1) #define ID_GYROSCOPE (ID_BASE+SENSOR_TYPE_GYROSCOPE-1)
这个ID是ID_ROTATION_VECTOR,这样可以进一步缩小范围
vendor$ cgrep ID_ROTATION_VECTOR ./mediatek/proprietary/hardware/sensor/nusensors.cpp:145: case ID_ROTATION_VECTOR: ./mediatek/proprietary/hardware/sensor/hwmsen_chip_info.c:766: .handle = ID_ROTATION_VECTOR+ID_OFFSET, ./mediatek/proprietary/hardware/sensor/BstSensor.cpp:207: case ID_ROTATION_VECTOR: ./mediatek/proprietary/hardware/sensor/BstSensor.cpp:265: case ID_ROTATION_VECTOR: ./mediatek/proprietary/hardware/sensor/BstSensor.cpp:352: case ID_ROTATION_VECTOR: ./mediatek/proprietary/hardware/sensor/BstSensor.cpp:411: case ID_ROTATION_VECTOR:
这里由于对sensors_event_t不熟悉,所以后面对ID的判断其实是很虚的,所以没有继续往下跟。决定直接上coredump。
由于此时已经找到了稳定的复现路径,所以抓取coredump也容易了,一共抓了5组coredump。
对每一份做了内存分析,发现都是大量的数据被覆盖,至少是几百K,
而且看起来出问题的时候还再写(出问题时结束位置——104字节对齐不明显),起始位置也没有堆chunk的head。
其中有一个对应的tombstone很可疑:
pid: 15788, tid: 15810, name: SensorService >>> system_server <<< signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xa00000080 backtrace: #00 pc 00000000000096a0 /system/lib64/hw/sensors.mt6795.so (sensors_poll_context_t::pollEvents(sensors_event_t*, int)+400) #01 pc 000000000000a474 /system/lib64/libsensorservice.so #02 pc 0000000000010eb0 /system/lib64/libsensorservice.so #03 pc 00000000000179c0 /system/lib64/libutils.so (android::Thread::_threadLoop(void*)+188) #04 pc 000000000009277c /system/lib64/libandroid_runtime.so (android::AndroidRuntime::javaThreadShell(void*)+96) #05 pc 0000000000017224 /system/lib64/libutils.so #06 pc 000000000001cbb0 /system/lib64/libc.so (__pthread_start(void*)+52) #07 pc 0000000000019044 /system/lib64/libc.so (__start_thread+16)
这个backtrace #00的sensors.mt6795.so库就是sensor hardware的代码,而当前还是确实也在操作sensors_event_t结构的指针。
加载该tombstone对应的coredump:
(gdb) core core-SensorService-15788 (gdb) bt #0 0x0000007f6a79d6a0 in sensors_poll_context_t::pollEvents (this=0x5575a43b20, data=0x5575a60c78, count=-653) at vendor/mediatek/proprietary/hardware/sensor/nusensors.cpp:470 #1 0x0000007f6a7e6478 in android::SensorDevice::poll ([email protected]=0x5575a863a0, buffer=0x5575a49b30, [email protected]=256) at frameworks/native/services/sensorservice/SensorDevice.cpp:123 #2 0x0000007f6a7eceb4 in android::SensorService::threadLoop (this=0x5575a43880) at frameworks/native/services/sensorservice/SensorService.cpp:409 #3 0x0000007f7f99c9c4 in android::Thread::_threadLoop ([email protected]=0x5575a438a0) at system/core/libutils/Threads.cpp:776 #4 0x0000007f7fc9f780 in android::AndroidRuntime::javaThreadShell (args=<optimized out>) at frameworks/base/core/jni/AndroidRuntime.cpp:1258 #5 0x0000007f7f99c228 in thread_data_t::trampoline (t=<optimized out>) at system/core/libutils/Threads.cpp:101 #6 0x0000007f7fe56bb4 in __pthread_start (arg=0x7f7ba4e000, [email protected]=<error reading variable: value has been optimized out>) at bionic/libc/bionic/pthread_create.cpp:141 #7 0x0000007f7fe53048 in __start_thread (fn=<optimized out>, arg=<optimized out>) at bionic/libc/bionic/clone.cpp:41 #8 0x0000000000000000 in ?? ()
backtrace中,#0里的count是-653,而#1的count值是256。
也 就是说android::SensorDevice::poll()调用sensors_poll_context_t::pollEvents() 时,count还是256,但在 sensors_poll_context_t::pollEvents()做处理时变成了负数。
一般count不应该是负数,所以很可能 sensors_poll_context_t::pollEvents()的逻辑有问题,
加上这个函数又是上面列出的重点怀疑的文件nusensors.cpp中定义的。所以这里出问题的可能性很大。
@vendor/mediatek/proprietary/hardware/sensor/nusensors.cpp int sensors_poll_context_t::pollEvents(sensors_event_t* data, int count) { int nbEvents = 0; int n = 0; //ALOGE("pollEvents count =%d",count ); do { // see if we have some leftover from the last poll() for (int i=0 ; count && i<numSensorDrivers ; i++) { SensorBase* const sensor(mSensors[i]); if ((mPollFds[i].revents & POLLIN) || (sensor->hasPendingEvents())) { int nb = sensor->readEvents(data, count); ... //if(nb < 0||nb > count) //ALOGE("pollEvents count error nb:%d, count:%d, nbEvents:%d", nb, count, nbEvents);//for sensor NE debug count -= nb; nbEvents += nb; data += nb; //if(nb < 0||nb > count) // ALOGE("pollEvents count error nb:%d, count:%d, nbEvents:%d", nb, count, nbEvents);//for sensor NE debug } } ... } while (n && count); return nbEvents; }
关键是下面这一句:
nb = sensor->readEvents(data, count);
这句话的意思应该是往data指向的地址中填写event,count是要读取的event个数,nb返回的是已读取的event个数。
这种方法类似于read()函数的,while循环能保证读取到count个数的event。
而这时如果nb返回异常呢?当然,从gdb中可以确定nb确实异常了。
然而我们sensors_poll_context_t::pollEvents()中没有对这个异常值做任何的保护,所以就会出问题。
当count为负数时sensors_poll_context_t::pollEvents()的while循环很难会退出,所以越界覆盖的buffer也应该很长。
当覆盖的内存刚好被别的线程访问,那个线程就会crash,而此时应该有线程在执行sensors_poll_context_t::pollEvents()函数。
再看看其他core:
(gdb) core core-InputReader-32381 (gdb) bt #0 art::Mutex::ExclusiveLock ([email protected]=0xa00000068, [email protected]=0x5597cc1f80) at art/runtime/base/mutex.cc:311 #1 0x0000007fafb410e0 in MutexLock (mu=..., self=0x5597cc1f80, this=<synthetic pointer>) at art/runtime/base/mutex.h:423 #2 art::gc::allocator::RosAlloc::AllocFromRun (this=0x5597c1efe0, [email protected]=0x5597cc1f80, [email protected]=20, bytes_allocated=0x75f8c7f8, [email protected]=0x7f9c42f088) at art/runtime/gc/allocator/rosalloc.cc:672 #3 0x0000007fafd57650 in Alloc<true> (bytes_allocated=0x7f9c42f088, size=20, self=0x5597cc1f80, this=<optimized out>) at art/runtime/gc/allocator/rosalloc-inl.h:33 #4 AllocCommon<true> (usable_size=0x7f9c42f080, bytes_allocated=0x7f9c42f078, num_bytes=20, self=0x5597cc1f80, this=<optimized out>) at art/runtime/gc/space/rosalloc_space-inl.h:57 #5 AllocNonvirtual (usable_size=0x7f9c42f080, bytes_allocated=0x7f9c42f078, num_bytes=20, self=0x5597cc1f80, this=<optimized out>) at art/runtime/gc/space/rosalloc_space.h:71 #6 TryToAllocate<false, false> (usable_size=0x7f9c42f080, bytes_allocated=0x7f9c42f078, alloc_size=20, allocator_type=art::gc::kAllocatorTypeRosAlloc, self=0x5597cc1f80, this=<optimized out>) at art/runtime/gc/heap-inl.h:208 #7 AllocObjectWithAllocator<false, false, art::VoidFunctor> (pre_fence_visitor=..., allocator=art::gc::kAllocatorTypeRosAlloc, byte_count=20, klass=0x6fce4650, self=0x5597cc1f80, this=<optimized out>) at art/runtime/gc/heap-inl.h:80 #8 Alloc<false, false> (allocator_type=art::gc::kAllocatorTypeRosAlloc, self=0x5597cc1f80, this=<optimized out>) at art/runtime/mirror/class-inl.h:561 #9 AllocObjectFromCodeInitialized<false> (method=<optimized out>, allocator_type=art::gc::kAllocatorTypeRosAlloc, self=0x5597cc1f80, klass=<optimized out>) at art/runtime/entrypoints/entrypoint_utils-inl.h:170
上面这个是被覆盖的线程,可以用info thread查看同一时刻其他线程的状态:
(gdb) info thread Id Target Id Frame 100 LWP 1525 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 99 LWP 32456 __epoll_pwait () at bionic/libc/arch-arm64/syscalls/__epoll_pwait.S:9 98 LWP 1586 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 97 LWP 581 nanosleep () at bionic/libc/arch-arm64/syscalls/nanosleep.S:7 96 LWP 2156 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 95 LWP 32465 __epoll_pwait () at bionic/libc/arch-arm64/syscalls/__epoll_pwait.S:9 94 LWP 32464 __epoll_pwait () at bionic/libc/arch-arm64/syscalls/__epoll_pwait.S:9 ... 21 LWP 32483 __accept4 () at bionic/libc/arch-arm64/syscalls/__accept4.S:7 20 LWP 32444 recvmsg () at bionic/libc/arch-arm64/syscalls/recvmsg.S:7 19 LWP 32403 sensors_poll_context_t::pollEvents (this=0x5597c20eb0, data=0x5597c21ca0, count=-402) at vendor/mediatek/proprietary/hardware/sensor/nusensors.cpp:470 18 LWP 32413 __epoll_pwait () at bionic/libc/arch-arm64/syscalls/__epoll_pwait.S:9 ... 5 LWP 1106 __ppoll () at bionic/libc/arch-arm64/syscalls/__ppoll.S:7 4 LWP 923 __ioctl () at bionic/libc/arch-arm64/syscalls/__ioctl.S:7 3 LWP 32441 __epoll_pwait () at bionic/libc/arch-arm64/syscalls/__epoll_pwait.S:9 2 LWP 32467 __epoll_pwait () at bionic/libc/arch-arm64/syscalls/__epoll_pwait.S:9 * 1 LWP 32442 art::Mutex::ExclusiveLock ([email protected]=0xa00000068, [email protected]=0x5597cc1f80) at art/runtime/base/mutex.cc:311
上面19号线程可以看出,它正在执行sensors_poll_context_t::pollEvents(),同时count值也是负数。
再看一个:
(gdb) core core-SensorEventAckR-15999 (gdb) bt #0 android::Looper::pollInner (this=0x558c9e0bd0, timeoutMillis=<optimized out>) at system/core/libutils/Looper.cpp:286 #1 0x0000007f9e1cc884 in android::Looper::pollOnce (this=0x558c9e0bd0, [email protected]=-1, [email protected]=0x0, [email protected]=0x0, [email protected]=0x0) at system/core/libutils/Looper.cpp:191 #2 0x0000007f89014708 in pollOnce (timeoutMillis=-1, this=<optimized out>) at system/core/include/utils/Looper.h:265 #3 android::SensorService::SensorEventAckReceiver::threadLoop (this=0x7f88fbf9e8) at frameworks/native/services/sensorservice/SensorService.cpp:559 #4 0x0000007f9e1c89c4 in android::Thread::_threadLoop ([email protected]=0x7f88fbf9e8) at system/core/libutils/Threads.cpp:776 #5 0x0000007f9e4cb780 in android::AndroidRuntime::javaThreadShell (args=<optimized out>) at frameworks/base/core/jni/AndroidRuntime.cpp:1258 #6 0x0000007f9e1c8228 in thread_data_t::trampoline (t=<optimized out>) at system/core/libutils/Threads.cpp:101 #7 0x0000007f9e682bb4 in __pthread_start (arg=0x7f95f99000, [email protected]=<error reading variable: value has been optimized out>) at bionic/libc/bionic/pthread_create.cpp:141 #8 0x0000007f9e67f048 in __start_thread (fn=<optimized out>, arg=<optimized out>) at bionic/libc/bionic/clone.cpp:41 #9 0x0000000000000000 in ?? ()
(gdb) info thread Id Target Id Frame 101 LWP 17316 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 100 LWP 17588 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 99 LWP 16075 __epoll_pwait () at bionic/libc/arch-arm64/syscalls/__epoll_pwait.S:9 98 LWP 16080 __epoll_pwait () at bionic/libc/arch-arm64/syscalls/__epoll_pwait.S:9 97 LWP 20823 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 96 LWP 16082 __epoll_pwait () at bionic/libc/arch-arm64/syscalls/__epoll_pwait.S:9 95 LWP 19034 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 ... 11 LWP 16036 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 10 LWP 16392 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 9 LWP 16021 0x0000007f88fc96a0 in sensors_poll_context_t::pollEvents (this=0x558c5cbe40, data=0x558ca1fcc0, count=-3472) at vendor/mediatek/proprietary/hardware/sensor/nusensors.cpp:470 8 LWP 16107 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 7 LWP 16031 __epoll_pwait () at bionic/libc/arch-arm64/syscalls/__epoll_pwait.S:9 6 LWP 16034 0x0000007f9e738104 in ?? () 5 LWP 16104 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 4 LWP 16061 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 3 LWP 16060 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 2 LWP 16085 syscall () at bionic/libc/arch-arm64/bionic/syscall.S:41 * 1 LWP 16024 android::Looper::pollInner (this=0x558c9e0bd0, timeoutMillis=<optimized out>) at system/core/libutils/Looper.cpp:286
加入debug code:
int sensors_poll_context_t::pollEvents(sensors_event_t* data, int count) { int nbEvents = 0; int n = 0; //ALOGE("pollEvents count =%d",count );do { // see if we have some leftover from the last poll() for (int i=0 ; count && i<numSensorDrivers ; i++) { SensorBase* const sensor(mSensors[i]); if ((mPollFds[i].revents & POLLIN) || (sensor->hasPendingEvents())) { int nb = sensor->readEvents(data, count); if(nb < 0 || nb > count) { nb += *(int*)0; //运行到这里时,触发native crash } ...
重新抓取coredump:
(gdb) bt #0 sensors_poll_context_t::pollEvents (this=0x55827f7610, data=0x5582c7b170, count=256) at vendor/mediatek/proprietary/hardware/sensor/nusensors.cpp:474 #1 0x0000007f92352478 in android::SensorDevice::poll (this=this@entry=0x5582be0ee0, buffer=0x5582c7b170, [email protected]=256) at frameworks/native/services/sensorservice/SensorDevice.cpp:123 #2 0x0000007f92358eb4 in android::SensorService::threadLoop (this=0x55827f7370) at frameworks/native/services/sensorservice/SensorService.cpp:409 #3 0x0000007fa75089c4 in android::Thread::_threadLoop ([email protected]=0x55827f7390) at system/core/libutils/Threads.cpp:776 #4 0x0000007fa780b780 in android::AndroidRuntime::javaThreadShell (args=<optimized out>) at frameworks/base/core/jni/AndroidRuntime.cpp:1258 #5 0x0000007fa7508228 in thread_data_t::trampoline (t=<optimized out>) at system/core/libutils/Threads.cpp:101 #6 0x0000007fa79c2bb4 in __pthread_start (arg=0x7fa35b9000, [email protected]=<error reading variable: value has been optimized out>) at bionic/libc/bionic/pthread_create.cpp:141 #7 0x0000007fa79bf048 in __start_thread (fn=<optimized out>, arg=<optimized out>) at bionic/libc/bionic/clone.cpp:41 #8 0x0000000000000000 in ?? ()
查看变量:
(gdb) info locals nb = 2235 sensor = <optimized out> i = 5 nbEvents = 0 n = 1
nb值(2235)确实比count
值(256)大!
再看看是哪个sensor:
(gdb) p mSensors $2 = {0x5582c2db80, 0x55827f80c0, 0x5582c22dd0, 0x5582c23ec0, 0x5582c27fa0, 0x5582be02b0, 0x5582c268b0, 0x5582c251c0, 0x5582c296a0, 0x5582c2ada0, 0x5582c2c490, 0x5582c30560, 0x5582c33340, 0x5582c31c50, 0x5582c70a60, 0x5582c71b40, 0x5582c73230, 0x5582c74920, 0x5582c76010, 0x5582c77700, 0x5582c78df0}
i值是5,那sensor就是0x5582be02b0了。
(gdb) p *(SensorBase*) 0x5582be02b0 $3 = {_vptr.SensorBase = 0x7f923317c0 <vtable for BstSensor+16>, dev_name = 0x0, data_name = 0x7f9231c698 "/data/misc/sensor/fifo_dat", dev_fd = -1, data_fd = 40}
找到虚函数表指针为0x7f923317c0,看下虚函数表中的内容:
(gdb) x /16gx 0x7f923317c0 0x7f923317c0 <_ZTV9BstSensor+16>: 0x0000007f92317e58 0x0000007f92317e94 0x7f923317d0 <_ZTV9BstSensor+32>: 0x0000007f923181a4 0x0000007f92309ab8 0x7f923317e0 <_ZTV9BstSensor+48>: 0x0000007f92309ab0 0x0000007f92317eb8 0x7f923317f0 <_ZTV9BstSensor+64>: 0x0000007f92318020 0x0000007f92317e18 0x7f92331800 <_ZTV9BstSensor+80>: 0x0000007f92317e20 0x0000000000000001 0x7f92331810: 0x000000000000283c 0x0000000000000001 0x7f92331820: 0x0000000000002844 0x0000000000000001 0x7f92331830: 0x0000000000002851 0x0000000000000001
可以知道,这个sensor类是BstSensor。又挨个试得知对应的readEvents
为0x0000007f923181a4。
(gdb) disassemble 0x0000007f923181a4 Dump of assembler code for function BstSensor::readEvents(sensors_event_t*, int): 0x0000007f923181a4 <+0>: stp x29, x30, [sp,#-176]! 0x0000007f923181a8 <+4>: cmp w2, wzr 0x0000007f923181ac <+8>: mov x29, sp 0x0000007f923181b0 <+12>: stp x19, x20, [sp,#16]
看源码:
@vendor/mediatek/proprietary/hardware/sensor/BstSensor.cpp int BstSensor::readEvents(sensors_event_t *pdata, int count) { ...while (rslt < count) { err = read(data_fd, &sensor_data, sizeof(sensor_data)); ... delta_mod= (float)(time-bst_last_ts[index])/(float)(bst_delay[index]); /* if delta_mode > x.5, loopcount = x+1 */ loopcount = (int)(delta_mod+0.5); ...for(int i=1; i<=loopcount; i++) { pdata_cur->version = sizeof(*pdata_cur); pdata_cur->sensor = sensor; pdata_cur->timestamp = time - (loopcount-i)*bst_delay[index]; switch (sensor) { ...case ID_ROTATION_VECTOR: pdata_cur->data[0] = sensor_data.data.data[0]; pdata_cur->data[1] = sensor_data.data.data[1]; pdata_cur->data[2] = sensor_data.data.data[2]; pdata_cur->data[3] = sensor_data.data.data[3]; pdata_cur->data[4] = sensor_data.data.data[4]; pdata_cur->type = SENSOR_TYPE_ROTATION_VECTOR; break; default: ALOGE("<BST> " "Invalid data pkt"); return rslt; } rslt++; pdata_cur++; } bst_last_ts[index] = time; } return rslt; }
关键是这个loopcount大于传入buffer的大小count的时候,返回值有可能大于count。这里可能需要做保护!
【解决方案】
diff --git a/proprietary/hardware/sensor/BstSensor.cpp b/proprietary/hardware/sensor/BstSensor.cpp index cc21b37..ea84bea 100755 --- a/proprietary/hardware/sensor/BstSensor.cpp +++ b/proprietary/hardware/sensor/BstSensor.cpp @@ -320,7 +320,9 @@ int BstSensor::readEvents(sensors_event_t *pdata, int count) sensors_event_t *pdata_cur; int loopcount; float delta_mod; + int count_bk; + count_bk = count; if (count > 1) { count = 1; } @@ -388,7 +390,10 @@ int BstSensor::readEvents(sensors_event_t *pdata, int count) bst_last_ts[index] = time; loopcount = 1; } - + if(loopcount >=count_bk) + { + loopcount = count_bk; + } for(int i=1; i<=loopcount; i++) {