timer bug导致系统hard lockup分析

　　问题现象：

　　系统（android）正常运行一段时间之后就会panic，查看kernel的log发现是发生了hard lockup，相关的log如下：

<4>[ 1815.827575] c0 cpu0 detected cpu1 has HARDLOCKUP!
<4>[ 1815.828063] c0 Modules linked in: sd8xxx(O) citty(O) mlan(O) exfat_fs(P) exfat_core(P) geu(O) galcore(O)

　　什么是HARDLOCKUP，简单来说，就是在多核（4核）系统中，如果其中一个核（core1）长时间没有响应timer中断，会被逻辑上的上一个核（core0）检测到，然后系统就会panic。而每次timer中断中watchdog thread都会调度到，所以我们从watchdog thread出发进行调试，首先利用crash工具来查看watchdog/1 上一次被调度到是什么时候。

crash> ps
3008 2 1 e0e29e00 IN 0.0 0 0 [watchdog/1]
crash> struct task_struct e0e29e00
struct task_struct {
exec_start = 1801178161618, // 1801.1178161618

　　所以上次watchdog thead被wake up大概是在1801s，大概是在hardlockup发生（1815s）前的14秒。为什么这么长时间了watchdog一直没有被再次调度呢，怀疑是timer除了问题，所以查看core1的next time event 还有timerkeeper。

crash> struct clock_event_device 0xc1a11200
struct clock_event_device {
event_handler = 0xc01596a4 <hrtimer_interrupt>,
set_next_event = 0xc0118470 <twd_set_next_event>,
set_next_ktime = 0,
next_event = {
tv64 = 665477081301 //665.477081301,
},

crash> p timekeeper
timekeeper = $11 = {
total_sleep_time = {
tv_sec = 1139,
tv_nsec = 350860596

　　根据上面的数据计算，next time event应该发生在665.47 + 1139 + 0.35 ~ 1805s左右，这个时间点watchdog thread理论上应该被再次wakeup。但是实际上过了将近14s还是没有收到新的timer event，感觉上就是timer变慢了，查看code发现local timer是根据启动的core frequency是配置的，而启动的core frequency是1.2G，而panic的时候core的frequency却只有312M了，差了4倍左右，导致原本只需要等4s现在却需要16s（next time event发生的时间变成1801+16 ~ 1807s），时间大大超出预期，core0就误判core1已经无法响应中断，一个hardlockup就这样被trigger。

时间： 2024-10-10 14:52:16

timer bug导致系统hard lockup分析

timer bug导致系统hard lockup分析的相关文章

高并发下Netty4底层bug导致直接内存溢出分析

MySQL Bug导致异常宕机的分析流程

【原创】访问Linux进程文件表导致系统异常复位的排查记录

View的post方法导致的内存泄漏分析

程序Bug导致了天大的损失，要枪毙程序员吗？

Linux中进行用户UID测试导致系统报错

Redis 未授权访问缺陷可轻易导致系统被黑

程序bug导致了天大的损失，要枪毙程序猿吗？

谈谈运行稳定性好效率高的千万级大型网站系统架构性分析