localtime死锁——多线程下fork子进程

最近测试我们自己改进的redis，发现在做rdb时，子进程会一直hang住，gdb attach上，堆栈如下：

(gdb) bt
#0  0x0000003f6d4f805e in __lll_lock_wait_private () from /lib64/libc.so.6
#1  0x0000003f6d49dcad in _L_lock_2164 () from /lib64/libc.so.6
#2  0x0000003f6d49da67 in __tz_convert () from /lib64/libc.so.6
#3  0x0000000000421004 in redisLogRaw (level=2, msg=0x7fff9f412b50 "[INFQ_INFO]: [infq.c:1483] infq persistent dump, suffix: 405665, start_index: 35637626, ele_count: 4") at redis.c:332
#4  0x0000000000421256 in redisLog (level=2, fmt=0x4eedcf "[INFQ_INFO]: %s") at redis.c:363
#5  0x000000000043926b in infq_info_log (msg=0x7fff9f413090 "[infq.c:1483] infq persistent dump, suffix: 405665, start_index: 35637626, ele_count: 4") at object.c:816
#6  0x00000000004b5677 in infq_log (level=1, file=0x501465 "infq.c", lineno=1483, fmt=0x5024e0 "infq persistent dump, suffix: %d, start_index: %lld, ele_count: %d") at logging.c:81
#7  0x00000000004b37f8 in dump_push_queue (infq=0x17d07f0) at infq.c:1480
#8  0x00000000004b1566 in infq_dump (infq=0x17d07f0, buf=0x7fff9f413650 "", buf_size=1024, data_size=0x7fff9f413a5c) at infq.c:720
#9  0x00000000004440e6 in rdbSaveObject (rdb=0x7fff9f413c50, o=0x7f8c0b4d0470) at rdb.c:600
#10 0x000000000044429c in rdbSaveKeyValuePair (rdb=0x7fff9f413c50, key=0x7fff9f413b90, val=0x7f8c0b4d0470, expiretime=-1, now=1434687031023) at rdb.c:642
#11 0x0000000000444471 in rdbSaveRio (rdb=0x7fff9f413c50, error=0x7fff9f413c4c) at rdb.c:686
#12 0x0000000000444704 in rdbSave (filename=0x7f8c0b410040 "dump.rdb") at rdb.c:750
#13 0x00000000004449cd in rdbSaveBackground (filename=0x7f8c0b410040 "dump.rdb") at rdb.c:831
#14 0x0000000000422b0e in serverCron (eventLoop=0x7f8c0b45a150, id=0, clientData=0x0) at redis.c:1240
#15 0x000000000041d47e in processTimeEvents (eventLoop=0x7f8c0b45a150) at ae.c:311
#16 0x000000000041d7c0 in aeProcessEvents (eventLoop=0x7f8c0b45a150, flags=3) at ae.c:423
#17 0x000000000041d8de in aeMain (eventLoop=0x7f8c0b45a150) at ae.c:455
#18 0x0000000000429ae3 in main (argc=2, argv=0x7fff9f414168) at redis.c:3843

都阻塞在redisLog上，用于打印日志。在打印日志时，需要调用localtime生成时间。查看glibc代码glibc-2.9/time/localtime.c：

/* Return the `struct tm‘ representation of *T in local time,
   using *TP to store the result.  */
struct tm *
__localtime_r (t, tp)
     const time_t *t;
     struct tm *tp;
{
  return __tz_convert (t, 1, tp);
}
weak_alias (__localtime_r, localtime_r)

/* Return the `struct tm‘ representation of *T in local time.  */
struct tm *
localtime (t)
     const time_t *t;
{
  return __tz_convert (t, 1, &_tmbuf);
}
libc_hidden_def (localtime)

无论localtime还是localtime_r都是调用__tz_convert函数完成实际功能的，接着看这个函数，在glibc-2.9/time/tzset.c中：

/* This locks all the state variables in tzfile.c and this file.  */
__libc_lock_define_initialized (static, tzset_lock)

/* Return the `struct tm‘ representation of *TIMER in the local timezone.
   Use local time if USE_LOCALTIME is nonzero, UTC otherwise.  */
struct tm *
__tz_convert (const time_t *timer, int use_localtime, struct tm *tp)
{
  long int leap_correction;
  int leap_extra_secs;

  if (timer == NULL)
    {
      __set_errno (EINVAL);
      return NULL;
    }

  // 加锁
  __libc_lock_lock (tzset_lock);

  // 一些出来逻辑

  // 解锁
  __libc_lock_unlock (tzset_lock);

  return tp;
}

这个函数是用的tzset_lock全局锁，是一个static变量。由于加锁访问，所以这个localtime_r是线程安全的，但是localtime使用全局变量所以不是线程安全的。但这两个函数都不是信号安全的，如果在信号处理函数中使用，就要考虑到死锁的情况。比如，程序调用localtime_r，加锁后信号发生，信号处理函数中也调用localtime_r的话，会因为获取不到锁所以一直阻塞。

上述localtime死锁，为什么在原生redis中不会发生？

因为，原生redis中不会多线程调用localtime函数，在fork子进程时，对于localtime的调用都是完整的，即锁以及释放了。

由于我们改进的redis中，使用了多线程，并且会调用redisLog打印日志，所以在fork子进程时，某个线程可能正处于localtime函数调用中（加锁了，但尚未解锁），这种情况下，子进程以copy-on-write方式共享主进程的内存空间，所以对应localtime的锁也是被占用的情况，所以子进程一直阻塞。

那么，解决方案呢？

如果，对于锁我们有控制权，那么在调用fork创建子进程前，可以通过库函数pthead_atfork加解锁，达到一致状态。

     #include <pthread.h>

     int
     pthread_atfork(void (*prepare)(void), void (*parent)(void), void (*child)(void));

prepare函数指针在fork前被调用，parent和child分别在父子进程中fork返回后调用。这样，可以在prepare中释放所有的锁，parent中按需要进行加锁。

由于没有办法操作localtime使用的锁，所以上述方式行不通。这里，我们采用了折中的方法：依靠redis中serverCron定时器去更新localtime并保存到全局变量中，组件的多线程打印日志时，只是获取缓存的全局变量，避免了多线程调用localtime函数。由于serverCron以最多10ms的间隔执行，所以不会出现太多误差，对于日志来说完全可用。
最后总结一下，这种有全局锁的函数都不是信号安全的，比如localtime，free，malloc等。同时这类函数，在多线程模式下调用，在fork子进程时可能会死锁。避免出现这种情况的方式，就是保证在fork时不会出现加锁的情况（可以通过避免多线程调用，或者通过自定义的锁区控制）。

时间： 2024-11-05 12:34:51

localtime死锁——多线程下fork子进程

localtime死锁——多线程下fork子进程的相关文章

Linux下Fork与Exec使用

Java多线程21：多线程下的其他组件之CyclicBarrier、Callable、Future和FutureTask

多线程下HashMap与Hashtable

多线程下的资源同步访问

多线程下的进程同步(线程同步问题总结篇)

（转）多线程下：Vector、Hashtable、ArrayList、LinkedList、HashMap 性能特征

【JAVA】HashMap的原理及多线程下死循环的原因

Window下高性能IOCP模型队列多线程下应用

hashmap，hashtable，concurrenthashmap多线程下的比较（持续更新）