memcached(v1.4.15) dead loop bug. 求大神前来指点

(由于这是发在memcached邮件列表的,所以只能用一下蹩脚的英文了)

I have got a deaploop on memcached(v1.4.15, on centos 6.3 x86_64), it can‘t be reproduced. In our product environment, there are hundreds of memcached instances running, and this bug happend 3 times this years.  when the bug occurred, thouands of tcp connections keep in CLOSE_WAIT status and reached the maximum connection number, then clients cann‘t connect to the cache servers.  The SA have to restart the memcached instance to recover our business,  but the recent time i got the chance to create a core file.

FYI:     before start, you can get the the memcached package(with a debug-info package)at: http://mirrors.htbindustries.org/CentOS/6/x86_64/,and you can get the core file at: http://pan.baidu.com/s/1kTFTRQf

backtrack in gdb.

Thread 8 (Thread 0x7fa8a1ee1700 (LWP 26053)):
#0  0x0000003c9e2e7c73 in epoll_wait () from /lib64/libc.so.6
#1  0x000000323cc12e4b in ?? () from /usr/lib64/libevent-1.4.so.2
#2  0x000000323cc068c3 in event_base_loop () from /usr/lib64/libevent-1.4.so.2
#3  0x0000000000406447 in main (argc=<value optimized out>,
    argv=<value optimized out>) at memcached.c:5228

Thread 7 (Thread 0x7fa89e021700 (LWP 26061)):
#0  0x0000003c9e60b43c in [email protected]@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x000000000040cb33 in slab_rebalance_thread (arg=<value optimized out>)
    at slabs.c:764
#2  0x0000003c9e607851 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003c9e2e767d in clone () from /lib64/libc.so.6

Thread 6 (Thread 0x7fa89ea22700 (LWP 26060)):
#0  0x0000003c9e2ab91d in nanosleep () from /lib64/libc.so.6
#1  0x0000003c9e2ab790 in sleep () from /lib64/libc.so.6
#2  0x000000000040d0ad in slab_maintenance_thread (arg=<value optimized out>)
    at slabs.c:728
#3  0x0000003c9e607851 in start_thread () from /lib64/libpthread.so.0
#4  0x0000003c9e2e767d in clone () from /lib64/libc.so.6
---Type <return> to continue, or q <return> to quit---

Thread 5 (Thread 0x7fa89f423700 (LWP 26059)):
#0  0x0000003c9e60b43c in [email protected]@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#1  0x000000000040fa8d in assoc_maintenance_thread (arg=<value optimized out>)
    at assoc.c:251
#2  0x0000003c9e607851 in start_thread () from /lib64/libpthread.so.0
#3  0x0000003c9e2e767d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fa89fe24700 (LWP 26058)):
#0  0x000000000040fd34 in assoc_find (key=<value optimized out>,
    nkey=<value optimized out>, hv=<value optimized out>) at assoc.c:92
#1  0x000000000040ef2e in do_item_get (key=0x7fa88008bd34 "B920818_0",
    nkey=<value optimized out>, hv=3789230535) at items.c:523
#2  0x0000000000411076 in item_get (key=0x7fa88008bd34 "B920818_0", nkey=9)
    at thread.c:499
#3  0x000000000040731e in process_get_command (c=0x7fa88008bb30,
    tokens=0x7fa89fe23bf0, ntokens=<value optimized out>, return_cas=false)
    at memcached.c:2725
#4  0x0000000000409b63 in process_command (c=0x7fa88008bb30,
    command=<value optimized out>) at memcached.c:3249
#5  0x000000000040a7e2 in try_read_command (c=0x7fa88008bb30)
    at memcached.c:3504
---Type <return> to continue, or q <return> to quit---
#6  0x000000000040b478 in drive_machine (fd=<value optimized out>,
    which=<value optimized out>, arg=0x7fa88008bb30) at memcached.c:3824
#7  event_handler (fd=<value optimized out>, which=<value optimized out>,
    arg=0x7fa88008bb30) at memcached.c:4065
#8  0x000000323cc06b44 in event_base_loop () from /usr/lib64/libevent-1.4.so.2
#9  0x000000000041070d in worker_libevent (arg=0x1ad5d88) at thread.c:384
#10 0x0000003c9e607851 in start_thread () from /lib64/libpthread.so.0
#11 0x0000003c9e2e767d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fa8a0825700 (LWP 26057)):
#0  0x0000003c9e6094b3 in pthread_mutex_trylock () from /lib64/libpthread.so.0
#1  0x0000000000410d10 in mutex_lock (hv=<value optimized out>)
    at memcached.h:493
#2  item_lock (hv=<value optimized out>) at thread.c:127
#3  0x0000000000411069 in item_get (
    key=0x7fa8985d8944 "1368_yhd.orders.get_1.0_visitDateList_0", nkey=39)
    at thread.c:498
#4  0x000000000040731e in process_get_command (c=0x7fa89829b7e0,
    tokens=0x7fa8a0824bf0, ntokens=<value optimized out>, return_cas=false)
    at memcached.c:2725
#5  0x0000000000409b63 in process_command (c=0x7fa89829b7e0,
    command=<value optimized out>) at memcached.c:3249
#6  0x000000000040a7e2 in try_read_command (c=0x7fa89829b7e0)
---Type <return> to continue, or q <return> to quit---
    at memcached.c:3504
#7  0x000000000040b478 in drive_machine (fd=<value optimized out>,
    which=<value optimized out>, arg=0x7fa89829b7e0) at memcached.c:3824
#8  event_handler (fd=<value optimized out>, which=<value optimized out>,
    arg=0x7fa89829b7e0) at memcached.c:4065
#9  0x000000323cc06b44 in event_base_loop () from /usr/lib64/libevent-1.4.so.2
#10 0x000000000041070d in worker_libevent (arg=0x1ad2a00) at thread.c:384
#11 0x0000003c9e607851 in start_thread () from /lib64/libpthread.so.0
#12 0x0000003c9e2e767d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fa8a1226700 (LWP 26056)):
#0  0x0000003c9e6094b3 in pthread_mutex_trylock () from /lib64/libpthread.so.0
#1  0x0000000000410d10 in mutex_lock (hv=<value optimized out>)
    at memcached.h:493
#2  item_lock (hv=<value optimized out>) at thread.c:127
#3  0x0000000000410d7d in store_item (item=0x7fa89d0dddb8, comm=2,
    c=0x7fa880118a00) at thread.c:598
#4  0x000000000040bed4 in complete_nread_ascii (fd=<value optimized out>,
    which=<value optimized out>, arg=0x7fa880118a00) at memcached.c:843
#5  complete_nread (fd=<value optimized out>, which=<value optimized out>,
    arg=0x7fa880118a00) at memcached.c:2248
#6  drive_machine (fd=<value optimized out>, which=<value optimized out>,
    arg=0x7fa880118a00) at memcached.c:3861
---Type <return> to continue, or q <return> to quit---
#7  event_handler (fd=<value optimized out>, which=<value optimized out>,
    arg=0x7fa880118a00) at memcached.c:4065
#8  0x000000323cc06b44 in event_base_loop () from /usr/lib64/libevent-1.4.so.2
#9  0x000000000041070d in worker_libevent (arg=0x1acf678) at thread.c:384
#10 0x0000003c9e607851 in start_thread () from /lib64/libpthread.so.0
#11 0x0000003c9e2e767d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fa8a1c27700 (LWP 26055)):
#0  0x0000003c9e6094b3 in pthread_mutex_trylock () from /lib64/libpthread.so.0
#1  0x0000000000410d10 in mutex_lock (hv=<value optimized out>)
    at memcached.h:493
#2  item_lock (hv=<value optimized out>) at thread.c:127
#3  0x0000000000411069 in item_get (
    key=0x7fa881755f64 "4872_yhd.orders.get_1.0_0", nkey=25) at thread.c:498
#4  0x000000000040731e in process_get_command (c=0x7fa88049e180,
    tokens=0x7fa8a1c26bf0, ntokens=<value optimized out>, return_cas=false)
    at memcached.c:2725
#5  0x0000000000409b63 in process_command (c=0x7fa88049e180,
    command=<value optimized out>) at memcached.c:3249
#6  0x000000000040a7e2 in try_read_command (c=0x7fa88049e180)
    at memcached.c:3504
#7  0x000000000040b478 in drive_machine (fd=<value optimized out>,
    which=<value optimized out>, arg=0x7fa88049e180) at memcached.c:3824
---Type <return> to continue, or q <return> to quit---
#8  event_handler (fd=<value optimized out>, which=<value optimized out>,
    arg=0x7fa88049e180) at memcached.c:4065
#9  0x000000323cc06b44 in event_base_loop () from /usr/lib64/libevent-1.4.so.2
#10 0x000000000041070d in worker_libevent (arg=0x1acc2f0) at thread.c:384
#11 0x0000003c9e607851 in start_thread () from /lib64/libpthread.so.0
#12 0x0000003c9e2e767d in clone () from /lib64/libc.so.6
(gdb)
(gdb) 

It‘s easy to know that there is something wrong in thread 4, also know that the thread 4 got a lock of an item then other thread is block by the lock.

(gdb) thread 4
[Switching to thread 4 (Thread 0x7fa89fe24700 (LWP 26058))]#0  0x000000000040fd34 in assoc_find (key=<value optimized out>, nkey=<value optimized out>,
    hv=<value optimized out>) at assoc.c:92
92              if ((nkey == it->nkey) && (memcmp(key, ITEM_key(it), nkey) == 0)) {
(gdb) bt
#0  0x000000000040fd34 in assoc_find (key=<value optimized out>,
    nkey=<value optimized out>, hv=<value optimized out>) at assoc.c:92
#1  0x000000000040ef2e in do_item_get (key=0x7fa88008bd34 "B920818_0",
    nkey=<value optimized out>, hv=3789230535) at items.c:523
#2  0x0000000000411076 in item_get (key=0x7fa88008bd34 "B920818_0", nkey=9)
    at thread.c:499
#3  0x000000000040731e in process_get_command (c=0x7fa88008bb30,
    tokens=0x7fa89fe23bf0, ntokens=<value optimized out>, return_cas=false)
    at memcached.c:2725
#4  0x0000000000409b63 in process_command (c=0x7fa88008bb30,
    command=<value optimized out>) at memcached.c:3249
#5  0x000000000040a7e2 in try_read_command (c=0x7fa88008bb30)
    at memcached.c:3504
#6  0x000000000040b478 in drive_machine (fd=<value optimized out>,
    which=<value optimized out>, arg=0x7fa88008bb30) at memcached.c:3824
#7  event_handler (fd=<value optimized out>, which=<value optimized out>,
    arg=0x7fa88008bb30) at memcached.c:4065
#8  0x000000323cc06b44 in event_base_loop () from /usr/lib64/libevent-1.4.so.2
(gdb) p it
$1 = (item *) 0x7fa89799e5b0
(gdb) p *it
$2 = {next = 0x7fa89d0646c0, prev = 0x7fa8979b0760, h_next = 0x7fa89799e5b0, time = 31672441, exptime = 31758841, nbytes = 10, refcount = 1, nsuffix = 10 ‘\n‘,
  it_flags = 3 ‘\003‘, slabs_clsid = 2 ‘\002‘, nkey = 34 ‘"‘, data = 0x7fa89799e5b0}
(gdb) 

so the it->h_next is point to itself, then the deadloop happend.

i have check all the code that modify the value it->h_next, (the assoc_insert/assoc_deleteassoc_maintenance_thread function), whenever it is modified, it  have to get the lock "cache_lock". so i can‘t find the reason why an item pointed to itself.

 more infomation:  

In thread 4:

si r8 --> nkey=9

di r9 --> key ="B920818_0"
hv = 0xe1db11c7,  so offset in primary_hashtable is=0x111c7

(gdb) p primary_hashtable
$22 = (item **) 0x7fa8840008c

so i get the first item in the hash bucket.

(gdb) x /10g 0x00007fa8840008c0+0x8*0x111c7
0x7fa8840896f8: 0x00007fa890b46ef0

and then the deadloop item‘s address is the second item in the hash bucket list. yes, there is hash conflict.

(gdb) x /10xg 0x00007fa890b46ef0
0x7fa890b46ef0: 0x00007fa89d1884f0 0x00007fa89d1ee2b0
0x7fa890b46f00: 0x00007fa89799e5b0 0x01f3710601f362f6
0x7fa890b46f10: 0x0307000100000003 0x0000000000001501
0x7fa890b46f20: 0x000000030ed02eb6 0x4544524f5f594247
0x7fa890b46f30: 0x3632343532325f52 0x332032305f383533
(gdb)

the item in the hash bucket list is:

(gdb) p it
$1 = (item *) 0x7fa89799e5b0
(gdb) p *it
$2 = {next = 0x7fa89d0646c0, prev = 0x7fa8979b0760, h_next = 0x7fa89799e5b0, time = 31672441, exptime = 31758841, nbytes = 10, refcount = 1, nsuffix = 10 ‘\n‘,
  it_flags = 3 ‘\003‘, slabs_clsid = 2 ‘\002‘, nkey = 34 ‘"‘, data = 0x7fa89799e5b0}
(gdb)
(gdb) p *(item *)0x00007fa89d1884f0
$12 = {next = 0x7fa890b4f110, prev = 0x7fa890b46ef0, h_next = 0x0, time = 32727785, exptime = 32731385, nbytes = 3, refcount = 1, nsuffix = 7 ‘\a‘,
  it_flags = 3 ‘\003‘, slabs_clsid = 1 ‘\001‘, nkey = 21 ‘\025‘, data = 0x7fa89d1884f0}
(gdb)

item[1]:"71912_yhd.serial.product.get_1.0_0 16384 8\r\n" nsuffix=10; slab 2; nkey=34; nbytes=10

(little endian)
(gdb) x /100xb 0x00007fa89799e5b0
0x7fa89799e5b0: 0xc0 0x46 0x06 0x9d 0xa8 0x7f 0x00 0x00
0x7fa89799e5b8: 0x60 0x07 0x9b 0x97 0xa8 0x7f 0x00 0x00
0x7fa89799e5c0: 0xb0 0xe5 0x99 0x97 0xa8 0x7f 0x00 0x00
0x7fa89799e5c8: 0x79 0x48 0xe3 0x01 0xf9 0x99 0xe4 0x01
0x7fa89799e5d0: 0x0a 0x00 0x00 0x00 0x01 0x00 0x0a 0x03
0x7fa89799e5d8: 0x02 0x22 0x00 0x00 0x00 0x00 0x00 0x00
0x7fa89799e5e0: 0x6d 0x33 0x4f 0xd6 0x02 0x00 0x00 0x00
0x7fa89799e5e8: 0x37 0x31 0x39 0x31 0x32 0x5f 0x79 0x68
0x7fa89799e5f0: 0x64 0x2e 0x73 0x65 0x72 0x69 0x61 0x6c
0x7fa89799e5f8: 0x2e 0x70 0x72 0x6f 0x64 0x75 0x63 0x74
0x7fa89799e600: 0x2e 0x67 0x65 0x74 0x5f 0x31 0x2e 0x30
0x7fa89799e608: 0x5f 0x30 0x20 0x20 0x31 0x36 0x33 0x38
0x7fa89799e610: 0x34 0x20 0x38 0x0d 0x0a 0x00 [0x00 0x00   8\r\n"
0x7fa89799e618: 0x00 0x00 0x00 0x00 0x2d 0x0d] 0x0a 0x0a
0x7fa89799e620: 0x08 0xe1 0x0d 0x0a 0x0d 0x70 0x0d 0x0a
(key

 

item[0]:"Y_ORDER_225426358_02 32 1\r\n1\r\n513\r\n" nssuffix=7,slab=1,nkey=21,nbytes=3
item[1]:"71912_yhd.serial.product.get_1.0_0  16384 8\r\n"   nsuffix=10;  slab 2; nkey=34; nbytes=10
when the deadloop happened, it is looking for key "B920818_0". and the key is not in the list(may be it is).

it is odd that the item[1] have no data, although the nbyte is 10!

----------------

i guess that the bug occurred when hash expanding.  the hashpower=17(the default is 16), and   hash_items=101025, the hash expand when  hash_items>98304. It is very possible that the dealoop hapened after expanding, and then all the thread is hanged.

(gdb) p expand_bucket
$16 = 65536
(gdb) p stats.hash_is_expanding
$17 = false
(gdb) p hash_items
$18 = 101025
(gdb) p 256*256*3/2
$19 = 98304

(gdb) p expanding

$20 = false
(gdb)

(gdb) p hashpower
$21 = 17
(gdb)


i can‘t not find the bug in the code, so if any guys have any suggestion, please tell me.

thanks a lot.

memcached(v1.4.15) dead loop bug. 求大神前来指点

时间: 2024-12-19 20:32:02

memcached(v1.4.15) dead loop bug. 求大神前来指点的相关文章

ie6/7下给table的行tr 添加position:relative;的奇特bug,求大神告知

代码: <!DOCTYPE html> <html> <head> <title>ie6/7下给table的行tr 添加position:relative;的奇特bug,求大神告知</title> <style> * { padding: 0; margin: 0; } table tr { position: relative;float:left;overflow:hidden;z-index:1;height:100%;widt

第一个简单的flask程序(有问题,求大神帮忙解决)

1.初始化 1 2 3 4 5 6 7 8 9 10 11 12 13 from flask import Flask app = Flask(__name__) @app.route('/') def index():  return '<h1>Hello World!</h1>' if __name__ == '__main__':  app.run(debug=True) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2

[LeetCode]LRU Cache有个问题,求大神解答

题目: Design and implement a data structure for Least Recently Used (LRU) cache. It should support the following operations: get and set. get(key) - Get the value (will always be positive) of the key if the key exists in the cache, otherwise return -1.

python 抓取搜狗微信出现的问题,求大神解决

爬取到的data不是想要获取文章页面的源码,而是跳转到验证码的页面的源码.我网上查了一下是由于访问过于频繁导致的,我也加了time.sleep和改了请求头但还是显示不了,求大神支招,除了识别验证码的方式还能怎么做?? import re import urllib.request import time import urllib.error headers = {'User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/2

开发的一个android程序,总是显示无法运行,就退出了。但是没有报错。求大神解救啊!

============问题描述============ 代码如下: 在网上查有人说是intent的问题 原来没有加入监听器的时候没有问题 我是个新手,问题有点没水平啦~ 求大神啊.谢谢谢谢 ============解决方案1============ 把log贴上来看看,是不是menu这个activity没有在manifest 声明 ============解决方案2============ setonclicklistener那句话,把new helplistener前面的那个类型转换删除 =

用java制作简单登陆窗口,求大神教导

import java.awt.Container; import java.awt.event.ActionEvent; import java.awt.event.ActionListener; import java.util.Random; import javax.swing.JButton; import javax.swing.JFrame; import javax.swing.JLabel; import javax.swing.JOptionPane; import java

hdu 1008为何不对?求大神指导!

/*#include<iostream> using namespace std; int main(){ int n; while (cin >> n){ if (n == 0) break; else if (n > 0 && n < 100){ int sum=0, max,min; int *f = new int[n]; for (int i = 0; i < n; i++) cin >> f[i]; max = min =

跪求大神指教一个ImageButton与surfaceView问题

============问题描述============ 用xml文件写好了imagebutton按钮,然后让surfaceview连接到写好了的imagebutton按钮的ml文件当中,怎么实现点击Imagebutton消除用surfaceview中代码写得运动的物体.跪求大神指教... ============解决方案1============ 你可以试试自己定义接口回调,或者通过广播来处理.

安卓摄像头对焦问题,求大神支招。

============问题描述============ 查了一些资料,都没有找到,手机摄像头对选定的区域对焦的方法,就像小米照相机一样,点击哪.哪里显示就特别清晰的那个功能,求大神 支招 怎么做到这个功能.... ============解决方案1============ 对焦这个问题,早期版本未能解决.安卓4.0新增的Camera类,有一个setFocusAreas方法可以设置焦距. 早起SDK因为手机品种不齐,有些缺乏硬件支持,所以没有一个统一的对焦接口,新的SDK里已经出现了. ====