cloudstack下libvirtd服务无响应问题

在cloudstack4.5.2版本下,偶尔出现libvirtd服务无响应的情况,导致virsh命令无法使用,同时伴随cloudstack master丢失该slave主机连接的情况。最初怀疑是libvirtd服务或版本的问题,经过分析和排查最终确定是cloudstack-agent的问题。但是在官网上并没有找到类似的bug提交,该问题可能还存在于更高的版本,需要时间进一步从根本上分析。下面是该问题的处理过程,在此记录下,关注和使用cloudstack的朋友可以参考。

众所周知,cloudstack的社区热度远不如openstack,为什么还要选择clcoudstack?这个问题以后有机会再和大家聊。言归正传。

环境交代:

宿主机操作系统:centos6.5x64(2.6.32-431.el6.x86_64)
cloudstack版本:4.5.2
libvirt版本:libvirt-0.10.2-54.el6_7.2.x86_64

问题描述和简单维护:

通过cloudstackapi listHosts报警信息显示:
node5.cloud.rtmap:192.168.14.20 state is Down at 2016-05-13T07:19:04+0800
#有关cloudstackapi的使用方法在其它文章中总结,不在此处说明。

登陆问题宿主服务器检查:
[[email protected] log]#virsh list --all
没有响应ctrl^c退出
这时的vm可以正常工作,但处于失控状态

尝试重启启动libvirtd服务:
[[email protected] log]# service libvirtd stop
正在关闭 libvirtd 守护进程:                               [失败]  #无法关闭libvirtd服务

尝试重启启动cloudstack-agent服务:
[[email protected] libvirt]# service cloudstack-agent restart
Stopping Cloud Agent:
Starting Cloud Agent:
libvirtd故障依旧

[[email protected] ping]# libvirtd -d -l --config /etc/libvirt/libvirtd.conf
libvirtd:错误:Unable to initialize network sockets。查看 /var/log/messages 或者运行不带 --daemon 的命令查看更多信息。

[[email protected] log]# libvirtd -d
可以执行成功,这时执行virsh list --all 可以查看和操作vm

[[email protected] log]#virsh list --all
Id    名称                         状态
----------------------------------------------------
 2     i-4-185-VM                     running

虽然vm运行正常,现在也可以通过命令正常管理了。但是对于cloudstack平台而言,宿主机处于down状态,vm处于失控状态。

临时解决办法是在其它大的升级和维护过程中重启服务器解决,根本解决还要具体问题具体分析。

分析排查与解决过程:

1:检查进程

[[email protected] log]# ps ax |grep libvirtd
  6485 ?        R    863:37 libvirtd --daemon -l  #该服务始终处于run状态

[[email protected] log]# top -p 6485
top -p 6485
top - 09:19:41 up 12 days, 22:27,  1 user,  load average: 3.05, 5.07, 6.64
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
Cpu(s):  4.8%us,  1.4%sy,  0.0%ni, 93.1%id,  0.6%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:  264420148k total, 182040780k used, 82379368k free,   834232k buffers
Swap:  8388600k total,       92k used,  8388508k free, 100453708k cached

   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  6485 root      20   0  984m  12m 4440 R  100  0.0 844:22.68 libvirtd        #cpu占用100%,无法释放,影响系统稳定性

2:杀进程

[[email protected] log]# kill -9 6485
[[email protected] log]# kill -9 6485
[[email protected] log]# ps ax |grep libvirtd  #检查进程依然存在
  6485 ?        R    863:37 libvirtd --daemon -l
[[email protected] ~]# libvirtd -d -l --config /etc/libvirt/libvirtd.conf
libvirtd:错误:Unable to initialize network sockets。查看 /var/log/messages 或者运行不带 --daemon 的命令查看更多信息。
[[email protected] ~]# netstat -antp |grep 16509
tcp        0      0 0.0.0.0:16509               0.0.0.0:*                   LISTEN      3658/libvirtd
tcp        1      0 192.168.14.25:16509         192.168.14.22:8717          CLOSE_WAIT  -
tcp        1      0 192.168.14.25:16509         192.168.14.20:5152          CLOSE_WAIT  -
tcp        1      0 192.168.14.25:16509         192.168.14.10:39359         CLOSE_WAIT  -
tcp        0      0 :::16509                    :::*                        LISTEN      3658/libvirtd
tcp       39      0 ::1:16509                   ::1:19715                   CLOSE_WAIT  -  

经过上述操作,初步判断libvirtd陷入了hang死状态。

3:追踪进程

[[email protected] log]#strace -f libvirtd
[pid 107570] close(23058)               = -1 EBADF (Bad file descriptor)
[pid 107570] close(23059)               = -1 EBADF (Bad file descriptor)
[pid 107570] close(23060)               = -1 EBADF (Bad file descriptor)
[pid 107570] close(23061)               = -1 EBADF (Bad file descriptor)
[pid 107570] close(23062)               = -1 EBADF (Bad file descriptor)
[pid 107570] close(23063)               = -1 EBADF (Bad file descriptor)
[pid 107570] close(23064)               = -1 EBADF (Bad file descriptor)
[pid 107570] close(23065)               = -1 EBADF (Bad file descriptor)
[pid 107570] close(23066)               = -1 EBADF (Bad file descriptor)
[pid 107570] close(23067)               = -1 EBADF (Bad file descriptor)
[pid 107570] close(23068)               = -1 EBADF (Bad file descriptor)
[pid 107570] close(23069)               = -1 EBADF (Bad file descriptor)
[pid 107570] close(23070)               = -1 EBADF (Bad file descriptor)
[pid 107570] close(23071)               = -1 EBADF (Bad file descriptor)
^C[pid 107570] close(23072 <unfinished ...>
Process 107559 detached
Process 107560 detached
Process 107561 detached
Process 107562 detached
Process 107563 detached
Process 107564 detached
Process 107565 detached
Process 107566 detached
Process 107567 detached
Process 107568 detached
Process 107569 detached
Process 107570 detached

父进程6485在不断的产生和关闭子进程,并返回错误信息。Bad file descriptor的原因(如何触发的,谁触发的)? 循环为何无法退出?问题如何再现?

4:获得更多的线索
官方文档:(libvirtd各种故障诊断记录和解决办法非常详尽)
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Deployment_and_Administration_Guide/sect-Troubleshooting-Common_libvirt_errors_and_troubleshooting.html#sect-libvirtd_failed_to_start

开启系统日志:

Change libvirt‘s logging in /etc/libvirt/libvirtd.conf by enabling the line below. To enable the setting the line, open the /etc/libvirt/libvirtd.conf file in a text editor, remove the hash (or #) symbol from the beginning of the following line, and save the change:

log_outputs="3:syslog:libvirtd"

参照配置,重启服务器等待下次故障观察日志

......

Jun  1 12:42:26 node5 abrtd: New client connected
Jun  1 12:42:26 node5 abrtd: Directory ‘pyhook-2016-06-01-12:42:26-70065‘ creation detected
Jun  1 12:42:26 node5 abrt-server[70066]: Saved Python crash dump of pid 70065 to /var/spool/abrt/pyhook-2016-06-01-12:42:26-70065
Jun  1 12:42:26 node5 abrtd: Package ‘cloudstack-common‘ isn‘t signed with proper key
Jun  1 12:42:26 node5 abrtd: ‘post-create‘ on ‘/var/spool/abrt/pyhook-2016-06-01-12:42:26-70065‘ exited with 1
Jun  1 12:42:26 node5 abrtd: Deleting problem directory ‘/var/spool/abrt/pyhook-2016-06-01-12:42:26-70065‘
Jun  1 12:43:26 node5 abrt: detected unhandled Python exception in ‘/usr/share/cloudstack-common/scripts/vm/network/security_group.py‘
......

Jun  6 10:36:21 node5 libvirtd: 102840: warning : qemuDomainObjBeginJobInternal:878 : Cannot start job (modify, none) for domain i-4-30-VM; current job is (modify, none) owned by (102925, 0)
Jun  6 10:36:21 node5 libvirtd: 102840: error : qemuDomainObjBeginJobInternal:883 : Timed out during operation: cannot acquire state change lock
Jun  6 10:39:59 node5 libvirtd: 114071: info : libvirt version: 0.10.2, package: 54.el6_7.2 (CentOS BuildSystem <http://bugs.centos.org>, 2015-11-10-10:25:08, c6b9.bsys.dev.centos.org)
Jun  6 10:39:59 node5 libvirtd: 114071: error : virNetSocketNewListenTCP:312 : Unable to bind to port: 地址已在使用
Jun  6 10:40:46 node5 libvirtd: 114147: info : libvirt version: 0.10.2, package: 54.el6_7.2 (CentOS BuildSystem <http://bugs.centos.org>, 2015-11-10-10:25:08, c6b9.bsys.dev.centos.org)
Jun  6 10:40:46 node5 libvirtd: 114147: error : virNetSocketNewListenTCP:312 : Unable to bind to port: 地址已在使用
Jun  6 10:42:15 node5 libvirtd: 114204: info : libvirt version: 0.10.2, package: 54.el6_7.2 (CentOS BuildSystem <http://bugs.centos.org>, 2015-11-10-10:25:08, c6b9.bsys.dev.centos.org)
Jun  6 10:42:15 node5 libvirtd: 114204: error : virNetSocketNewListenTCP:312 : Unable to bind to port: 地址已在使用
Jun  6 10:47:05 node5 libvirtd: 114375: info : libvirt version: 0.10.2, package: 54.el6_7.2 (CentOS BuildSystem <http://bugs.centos.org>, 2015-11-10-10:25:08, c6b9.bsys.dev.centos.org)
Jun  6 10:47:05 node5 libvirtd: 114375: error : virNetSocketNewListenTCP:312 : Unable to bind to port: 地址已在使用
Jun  6 10:47:23 node5 libvirtd: 114412: info : libvirt version: 0.10.2, package: 54.el6_7.2 (CentOS BuildSystem <http://bugs.centos.org>, 2015-11-10-10:25:08, c6b9.bsys.dev.centos.org)
Jun  6 10:47:23 node5 libvirtd: 114412: error : virNetSocketNewListenTCP:312 : Unable to bind to port: 地址已在使用
......

Jun 12 03:08:02 node5 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="3111" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
Jun 12 09:20:40 node5 libvirtd: 72575: info : libvirt version: 0.10.2, package: 54.el6_7.2 (CentOS BuildSystem <http://bugs.centos.org>, 2015-11-10-10:25:08, c6b9.bsys.dev.centos.org)
Jun 12 09:20:40 node5 libvirtd: 72575: error : virPidFileAcquirePath:410 : Failed to acquire pid file ‘/var/run/libvirtd.pid‘: 资源暂时不可用

并未获得致命错误和更多线索。(该日志配置选项还是很有必要打开的,很多问题都可以通过它来定位)

5:解决思路:

  1. 尝试和找到终止进程、重启服务的方法
  2. 提交bug,等待补丁升级
  3. 分析源代码,再现问题,解决问题(投入研发和时间)

由于不能再现问题,还是从简入繁吧。触发这些子进程的元凶是谁?还是cloudstack-agent的嫌疑最大,但之前重启过该服务并没有解决问题,那么agent服务是怎么一回事呢?

看下启动脚本可以基本了解,

[[email protected] libvirt]# cat /etc/rc.d/init.d/cloudstack-agent

#!/bin/bash

# chkconfig: 35 99 10
# description: Cloud Agent

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.

# WARNING: if this script is changed, then all other initscripts MUST BE changed to match it as well

. /etc/rc.d/init.d/functions

# set environment variables

SHORTNAME=$(basename $0 | sed -e ‘s/^[SK][0-9][0-9]//‘)
PIDFILE=/var/run/"$SHORTNAME".pid
LOCKFILE=/var/lock/subsys/"$SHORTNAME"
LOGDIR=/var/log/cloudstack/agent
LOGFILE=${LOGDIR}/agent.log
PROGNAME="Cloud Agent"
CLASS="com.cloud.agent.AgentShell"
JSVC=`which jsvc 2>/dev/null`;

# exit if we don‘t find jsvc
if [ -z "$JSVC" ]; then
    echo no jsvc found in path;
    exit 1;
fi

unset OPTIONS
[ -r /etc/sysconfig/"$SHORTNAME" ] && source /etc/sysconfig/"$SHORTNAME"

# The first existing directory is used for JAVA_HOME (if JAVA_HOME is not defined in $DEFAULT)
JDK_DIRS="/usr/lib/jvm/jre /usr/lib/jvm/java-7-openjdk /usr/lib/jvm/java-7-openjdk-i386 /usr/lib/jvm/java-7-openjdk-amd64 /usr/lib/jvm/java-6-openjdk /usr/lib/jvm/java-6-openjdk-i386 /usr/lib/jvm/java-6-openjdk-amd64 /usr/lib/jvm/java-6-sun"

for jdir in $JDK_DIRS; do
    if [ -r "$jdir/bin/java" -a -z "${JAVA_HOME}" ]; then
        JAVA_HOME="$jdir"
    fi
done
export JAVA_HOME

ACP=`ls /usr/share/cloudstack-agent/lib/*.jar | tr ‘\n‘ ‘:‘ | sed s‘/.$//‘`
PCP=`ls /usr/share/cloudstack-agent/plugins/*.jar 2>/dev/null | tr ‘\n‘ ‘:‘ | sed s‘/.$//‘`

# We need to append the JSVC daemon JAR to the classpath
# AgentShell implements the JSVC daemon methods
export CLASSPATH="/usr/share/java/commons-daemon.jar:$ACP:$PCP:/etc/cloudstack/agent:/usr/share/cloudstack-common/scripts"

start() {
    echo -n $"Starting $PROGNAME: "
    if hostname --fqdn >/dev/null 2>&1 ; then
        $JSVC -Xms256m -Xmx2048m -cp "$CLASSPATH" -pidfile "$PIDFILE"             -errfile $LOGDIR/cloudstack-agent.err -outfile $LOGDIR/cloudstack-agent.out $CLASS
        RETVAL=$?
        echo
    else
        failure
        echo
        echo The host name does not resolve properly to an IP address.  Cannot start "$PROGNAME". > /dev/stderr
        RETVAL=9
    fi
    [ $RETVAL = 0 ] && touch ${LOCKFILE}
    return $RETVAL
}

stop() {
    echo -n $"Stopping $PROGNAME: "
    $JSVC -pidfile "$PIDFILE" -stop $CLASS
    RETVAL=$?
    echo
    [ $RETVAL = 0 ] && rm -f ${LOCKFILE} ${PIDFILE}
}

case "$1" in
    start)
        start
        ;;
    stop)
        stop
        ;;
    status)
        status -p ${PIDFILE} $SHORTNAME
        RETVAL=$?
        ;;
    restart)
        stop
        sleep 3
        start
        ;;
    condrestart)
        if status -p ${PIDFILE} $SHORTNAME >&/dev/null; then
            stop
            sleep 3
            start
        fi
        ;;
    *)
    echo $"Usage: $SHORTNAME {start|stop|restart|condrestart|status|help}"
    RETVAL=3
esac

exit $RETVAL

[[email protected] libvirt]# ps ax |grep jsvc.exec
 6655 ?        Ss     0:00 jsvc.exec -Xms256m -Xmx2048m -cp /usr/share/java/commons-daemon.jar:/usr/share/cloudstack-agent/lib/activation-1.1.jar:/usr/share/cloudstack-agent/lib/antisamy-1.4.3.jar:/usr/share/cloudstack-agent/lib/aopalliance-1.0.jar:/usr/share/cloudstack-agent/lib/apache-log4j-extras-1.1.jar:/usr/share/cloudstack-agent/lib/aspectjweaver-1.7.0.jar:/usr/share/cloudstack-agent/lib/aws-java-sdk-1.3.22.jar:/usr/share/cloudstack-agent/lib/batik-css-1.7.jar:/usr/share/cloudstack-agent/lib/batik-ext-1.7.jar:/usr/share/cloudstack-agent/lib/batik-util-1.7.jar:/usr/share/cloudstack-agent/lib/bcprov-jdk15-1.46.jar:/usr/share/cloudstack-agent/lib/bcprov-jdk16-1.46.jar:/usr/share/cloudstack-agent/lib/bsh-core-2.0b4.jar:/usr/share/cloudstack-agent/lib/cglib-nodep-2.2.2.jar:/usr/share/cloudstack-agent/lib/cloud-agent-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-api-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-core-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-engine-api-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-engine-components-api-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-engine-schema-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-cluster-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-config-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-db-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-events-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-ipc-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-jobs-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-managed-context-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-rest-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-security-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-plugin-hypervisor-kvm-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-plugin-network-ovs-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-server-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-utils-4.5.2.jar:/usr/share/cloudstack-agent/lib/commons-beanutils-core-1.7.0.jar:/usr/share/cloudstack-agent/lib/commons-codec-1.6.jar:/usr/share/cloudstack-agent/lib/commons-collections-3.2.1.jar:/usr/share/cloudstack-agent/lib/commons-configuration-1.8.jar:/usr/share/cloudstack-agent/lib/commons-daemon-1.0.10.jar:/usr/share/cloudstack-agent/lib/commons-dbcp-1.4.jar:/usr/share/cloudstack-agent/lib/commons-fileupload-1.2.jar:/usr/share/cloudstack-agent/lib/commons-httpclient-3.1.jar:/usr/share/cloudstack-agent/lib/commons-io-1.4.jar:/usr/share/cloudstack-agent/lib/commons-lang-2.6.jar:/usr/share/cloudstack-agent/lib/commons-logging-1.1.3.jar:/usr/share/cloudstack-agent/lib/commons-net-3.3.jar:/usr/share/cloudstack-agent/lib/commons-pool-1.6.jar:/usr/share/cloudstack-agent/lib/cxf-bundle-jaxrs-2.7.0.jar:/usr/share/cloudstack-agent/lib/dom4j-1.6.1.jar:/usr/share/cloudstack-agent/lib/ehcache-core-2.6.6.jar:/usr/share/cloudstack-agent/lib/ejb-api-3.0.jar:/usr/share/cloudstack-agent/lib/esapi-2.0.1.jar:/usr/share/cloudstack-agent/lib/geronimo-javamail_1.4_spec-1.7.1.jar:/usr/share/cloudstack-agent/lib/geronimo-servlet_3.0_spec-1.0.jar:/usr/share/cloudstack-agent/lib/gson-1.7.2.jar:/usr/share/cloudstack-agent/lib/guava-14.0-rc1.jar:/usr/share/cloudstack-agent/lib/httpclient-4.3.6.jar:/usr/share/cloudstack-agent/lib/httpcore-4.3.3.jar:/usr/share/cloudstack-agent/lib/jackson-annotations-2.1.1.jar:/usr/share/cloudstack-agent/lib/jackson-core-2.1.1.jar:/usr/share/cloudstack-agent/lib/jackson-core-asl-1.8.9.jar:/usr/share/cloudstack-agent/lib/jackson-databind-2.1.1.jar:/usr/share/cloudstack-agent/lib/jackson-jaxrs-json-provider-2.1.1.jar:/usr/share/cloudstack-agent/lib/jackson-mapper-asl-1.8.9.jar:/usr/share/cloudstack-agent/lib/jackson-module-jaxb-annotations-2.1.1.jar:/usr/share/cloudstack-agent/lib/jasypt-1.9.0.jar:/usr/share/cloudstack-agent/lib/java-ipv6-0.10.jar:/usr/share/cloudstack-agent/lib/javassist-3.12.1.GA.jar:/usr/share/cloudstack-agent/lib/javassist-3.18.1-GA.jar:/usr/share/cloudstack-agent/lib/javax.inject-1.jar:/usr/share/cloudstack-agent/lib/javax.persistence-2.0.0.jar:/usr/share/cloudstack-agent/lib/javax.ws.rs-api-2.0-m10.jar
 6657 ?        Sl     0:05 jsvc.exec -Xms256m -Xmx2048m -cp /usr/share/java/commons-daemon.jar:/usr/share/cloudstack-agent/lib/activation-1.1.jar:/usr/share/cloudstack-agent/lib/antisamy-1.4.3.jar:/usr/share/cloudstack-agent/lib/aopalliance-1.0.jar:/usr/share/cloudstack-agent/lib/apache-log4j-extras-1.1.jar:/usr/share/cloudstack-agent/lib/aspectjweaver-1.7.0.jar:/usr/share/cloudstack-agent/lib/aws-java-sdk-1.3.22.jar:/usr/share/cloudstack-agent/lib/batik-css-1.7.jar:/usr/share/cloudstack-agent/lib/batik-ext-1.7.jar:/usr/share/cloudstack-agent/lib/batik-util-1.7.jar:/usr/share/cloudstack-agent/lib/bcprov-jdk15-1.46.jar:/usr/share/cloudstack-agent/lib/bcprov-jdk16-1.46.jar:/usr/share/cloudstack-agent/lib/bsh-core-2.0b4.jar:/usr/share/cloudstack-agent/lib/cglib-nodep-2.2.2.jar:/usr/share/cloudstack-agent/lib/cloud-agent-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-api-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-core-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-engine-api-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-engine-components-api-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-engine-schema-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-cluster-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-config-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-db-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-events-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-ipc-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-jobs-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-managed-context-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-rest-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-security-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-plugin-hypervisor-kvm-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-plugin-network-ovs-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-server-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-utils-4.5.2.jar:/usr/share/cloudstack-agent/lib/commons-beanutils-core-1.7.0.jar:/usr/share/cloudstack-agent/lib/commons-codec-1.6.jar:/usr/share/cloudstack-agent/lib/commons-collections-3.2.1.jar:/usr/share/cloudstack-agent/lib/commons-configuration-1.8.jar:/usr/share/cloudstack-agent/lib/commons-daemon-1.0.10.jar:/usr/share/cloudstack-agent/lib/commons-dbcp-1.4.jar:/usr/share/cloudstack-agent/lib/commons-fileupload-1.2.jar:/usr/share/cloudstack-agent/lib/commons-httpclient-3.1.jar:/usr/share/cloudstack-agent/lib/commons-io-1.4.jar:/usr/share/cloudstack-agent/lib/commons-lang-2.6.jar:/usr/share/cloudstack-agent/lib/commons-logging-1.1.3.jar:/usr/share/cloudstack-agent/lib/commons-net-3.3.jar:/usr/share/cloudstack-agent/lib/commons-pool-1.6.jar:/usr/share/cloudstack-agent/lib/cxf-bundle-jaxrs-2.7.0.jar:/usr/share/cloudstack-agent/lib/dom4j-1.6.1.jar:/usr/share/cloudstack-agent/lib/ehcache-core-2.6.6.jar:/usr/share/cloudstack-agent/lib/ejb-api-3.0.jar:/usr/share/cloudstack-agent/lib/esapi-2.0.1.jar:/usr/share/cloudstack-agent/lib/geronimo-javamail_1.4_spec-1.7.1.jar:/usr/share/cloudstack-agent/lib/geronimo-servlet_3.0_spec-1.0.jar:/usr/share/cloudstack-agent/lib/gson-1.7.2.jar:/usr/share/cloudstack-agent/lib/guava-14.0-rc1.jar:/usr/share/cloudstack-agent/lib/httpclient-4.3.6.jar:/usr/share/cloudstack-agent/lib/httpcore-4.3.3.jar:/usr/share/cloudstack-agent/lib/jackson-annotations-2.1.1.jar:/usr/share/cloudstack-agent/lib/jackson-core-2.1.1.jar:/usr/share/cloudstack-agent/lib/jackson-core-asl-1.8.9.jar:/usr/share/cloudstack-agent/lib/jackson-databind-2.1.1.jar:/usr/share/cloudstack-agent/lib/jackson-jaxrs-json-provider-2.1.1.jar:/usr/share/cloudstack-agent/lib/jackson-mapper-asl-1.8.9.jar:/usr/share/cloudstack-agent/lib/jackson-module-jaxb-annotations-2.1.1.jar:/usr/share/cloudstack-agent/lib/jasypt-1.9.0.jar:/usr/share/cloudstack-agent/lib/java-ipv6-0.10.jar:/usr/share/cloudstack-agent/lib/javassist-3.12.1.GA.jar:/usr/share/cloudstack-agent/lib/javassist-3.18.1-GA.jar:/usr/share/cloudstack-agent/lib/javax.inject-1.jar:/usr/share/cloudstack-agent/lib/javax.persistence-2.0.0.jar:/usr/share/cloudstack-agent/lib/javax.ws.rs-api-2.0-m10.jar

[[email protected] bin]# service cloudstack-agent status
cloudstack-agent (pid  6657) 正在运行...
[[email protected] bin]# service cloudstack-agent stop
Stopping Cloud Agent:

[[email protected] bin]# service cloudstack-agent statuscloudstack-agent (pid  6657) 正在运行..

ps ax |grep jsvc.exec 也验证了进程依然存在

眼前一亮的同时,也发现了之前使用restart带来的问题,stop不成功的问题被掩盖了~~~有没有懊恼? 不过来不及反思,接下来的问题还远不是这么简单......

[[email protected] bin]# kill -9 6655 6657
[[email protected] bin]# kill -9 6655 6657
-bash: kill: (6655) - 没有那个进程
-bash: kill: (6657) - 没有那个进程
[[email protected] bin]# service cloudstack-agent status
cloudstack-agent 已死,但 pid 文件仍存
[[email protected] bin]# rm /var/run/cloudstack-agent.pid
rm:是否删除普通文件 "/var/run/cloudstack-agent.pid"?y
[[email protected] bin]# service cloudstack-agent status
cloudstack-agent 已死,但是 subsys 被锁
[[email protected] bin]# service cloudstack-agent start
[[email protected] bin]# service cloudstack-agent status
cloudstack-agent (pid  109382) 正在运行...
[[email protected] bin]# netstat -antp |grep 8250
tcp        0      0 192.168.14.20:22220         192.168.14.10:8250          ESTABLISHED 109382/jsvc.exec 

处理后状态恢复正常,但是libvirtd仍然无法杀掉, 很快netstat -antp |grep 8250 状态再次消失,cloudstack master平台监控主机记录由Up状态转为disconnect状态。不过毕竟不是down状态,较之前已经有了进步。

启动一个libvirtd -d看下,

[[email protected] bin]# libvirtd -d
[[email protected] bin]# ps ax |grep libvirtd
    6485 ?        R    863:37 libvirtd --daemon -l
  130057 ?        Sl     0:38 libvirtd -d
  28904 pts/0     S+     0:00 grep libvirtd

然后在cloudstack master平台上手工点击强制重新连接该主机,成功了。主机监控状态由disconnect转为Up,这时再次尝试杀掉6485仍然是不成功的,于是又在cloudstack master管理平台上尝试着点击操作了一下暂停vm命令,vm成功暂停。再返回服务器上观察原来hung死的libvirtd进程已经消失。

[[email protected] bin]# libvirtd -d
[[email protected] bin]# ps ax |grep libvirtd
  130057 ?        Sl     0:38 libvirtd -d
  28904 pts/0     S+     0:00 grep libvirtd

最后为防止出现 【libvirtd -d -l/libvirtd:错误:Unable to obtain pidfile。查看 /var/log/messages 或者运行不带 --daemon 的命令查看更多信息。】的错误

索性注释了/etc/sysconfig/libvirtd 下的#LIBVIRTD_ARGS=-l,测试观察cloudstack操作正常。

至此既恢复了平台对该主机的管控,也终止了libvirtd异常进程。问题初步归于cloudstack-agent在处理发送个libvirtd的信号上存在些小问题。以后再单独分析下jsvc进程,再现问题和根本解决。

问题反思:

在处理服务异常的问题上,命令行参数不要用restart,用stop和kill来调试。说起来都是泪!

---火罐儿

本文系作者原创,转载请注明出处

时间: 2024-11-07 15:02:23

cloudstack下libvirtd服务无响应问题的相关文章

线程的两种睡眠方法&amp;ANR(进程/服务无响应)

1 method1: try { 2 Thread.sleep(3000); 3 } catch (InterruptedException e) { 4 e.printStackTrace(); 5 } 6 method2: SystemClock.sleep(3000); ANR(Application Not Responding) 在Android上,如果应用程序有一段时间响应不够灵敏,系统会向用户显示一个对话框,这个对话框称作应用程序无响应(ANR:Application Not Re

tomcat服务无响应堆栈分析

tomcat服务突然无响应了,导出内存堆栈和线程堆栈,分析后发现是同步锁使用不合理导致的. [[email protected] ~]# pgrep java10472[[email protected] ~]# jmap -heap 10472Attaching to process ID 10472, please wait...Debugger attached successfully.Server compiler detected.JVM version is 25.111-b14

ubuntu下QtCreator启动无响应问题解决

打开Qt后就卡死. 解决方法:删除系统配置目录下的QtProject文件夹: find / -name QtProject 输出: /root/.config/QtProject 删除QtProject文件夹: rm -rf QtProject 重启QtCreator试试. 原文地址:https://www.cnblogs.com/chay/p/10245808.html

映像笔记遇到的无响应bug

1.遇到在没有联网状态下打开印象笔记无响应问题可以看看有没有帮助. 印象笔记版本6.1.1下好了在联网状态下无任何问题.但是退出后,在没有联网的状态下打开一直无响应,退出再打开还是一样,再次联网后打开也是无响应. 图片没 - - ! 表示重下过好几次,猜测是同步问题,因为打开的时候它就在转,然后在偏好设置里面找到了同步,把它设置为手动就没问题了. 以后遇到类似情况就看看什么东西在进行不断的联网. 2.解决办法 卸载后重新下过,然后点击Evernote 选择偏好设置 点击同步并选择手动就??了 然

Stop a hung service 关闭一个无响应的windows 服务

If you ever have trouble with a service being stuck in a 'starting' or 'stopping' state, you can run a couple of simple commands to kill the service. 1. Query the process To kill the service you have to know its PID or Process ID. To find this just t

重装mysql出现无响应解决

问题 在安装mysql数据库时,如果重新安装,很容易遇见apply security setting error,即在配置mysql启动服务时,在启动apply security setting时会出错, 在安装最后一步出现无响应卡死状态 原因 是卸载mysql时并没有完全删除文件,所以有必要手动清除这些,要清除的文件 解决方法 第一步:删除mysql的安装目录,一般为C:\Program Files目录下. 第二步:删除mysql的数据存放目录,一般在C:\Documents and Sett

oracle数据库连接无响应的解决

昨天中午时,查询到服务器的数据流水最晚记录是早上8点的,现场查看服务日志很奇怪,日志输出显示挂死在数据库连接这一步.多次调试无果,随后百度发现有资料显示oracle 10.2.1的版本有登录无响应的BUG,每隔28天左右时会产生.又一个坑的,随后先重启下服务器主机临时解决下. 正式解决方法 1 打patch set, 升级版本 2 打 oracle4612267补丁 附上 oracle 10.2.1  4612267 32bit与64bit 补丁和安装方法

IIS服务器运行一段时间后卡死,且无法打开网站(IIS管理无响应,必须重启电脑)

问题描述: 公司希望使用IIS配合网站显示一些订单跟进的情况并展示出来,所以我们在一台演示的Win7 Pro电脑上安装了IIS,但使用了一段时间后发现每过几天页面就无法正常访问了,而且打开IIS管理器也是一直无响应,根本无法进行IIS的重启.只有重启电脑才能解决问题. 问题参考: http://support.microsoft.com/kb/934878/zh-cn 原因:服务器上的可用非分页缓冲的池内存小于 20 兆字节 (MB) 时,会出现此问题.可用非分页缓冲的池内存小于 20 兆字节

解决SSH自动断线,无响应的问题。

在连接远程SSH服务的时候,经常会发生长时间后的断线,或者无响应(无法再键盘输入). 总体来说有两个方法: 1.依赖ssh客户端定时发送心跳. putty.SecureCRT.XShell都有这个功能,但是目测不太好用. 此外在Linux下: 1 2 3 4 5 #打开 sudo vim  / etc / ssh / ssh_config # 添加 ServerAliveInterval 20 ServerAliveCountMax 999 即每隔20秒,向服务器发出一次心跳.若超过999次请求