主板故障导致服务器不定时频繁重启故障解决过程全记录

服务器:HP DL385 G7

操作系统:suse10 sp3

数据库:oracle 11g
R2

集群软件:VCS 双机主备

环境:两台服务器使用VCS软件做的oracle主备切换数据库

故障现象:

1.两台数据库主机不定期频繁重启,每次重启时在操作系统message日志中均没有任何记录;

2.系统启动时,message 日志出现与硬件相关的错误信息

message 日志信息:

-------------------------------------------------------------------------------------------------------------

Oct 27 17:51:01
linux10 /usr/sbin/cron[5968]: (CRON) STARTUP (V5.0)

Oct 27 17:51:02
linux10 sshd[6047]: Server listening on 0.0.0.0 port 22.

Oct 27
17:51:02 linux10 rcpowersaved: CPU frequency scaling is not supported by your
processor.

Oct 27
17:51:02 linux10 rcpowersaved: enter ‘CPUFREQ_ENABLED=no‘ in
/etc/powersave/cpufreq to avoid this warning.

Oct 27
17:51:02 linux10 rcpowersaved: Cannot load cpufreq governors - No cpufreq
driver available

Oct 27 17:51:03
linux10 rcpowersaved: s2ram does not know your machine. See ‘s2ram -i‘ for
details. (127)

Oct 27 17:51:03
linux10 rcpowersaved: Use SUSPEND2RAM_FORCE=yes to override this detection.

Oct 27 17:51:03
linux10 modprobe: FATAL: Error running install command for binfmt_misc

Oct 27 17:51:06
linux10 kernel: klogd 1.4.1, log source = /proc/kmsg started.

Oct 27 17:51:06
linux10 kernel: Floppy drive(s): fd0 is 1.44M

Oct 27 17:51:06
linux10 syslog-ng[5762]: Changing permissions on special file /dev/xconsole

Oct 27 17:51:06
linux10 syslog-ng[5762]: Changing permissions on special file /dev/tty10

Oct 27 17:51:06
linux10 kernel: JBD: barrier-based sync failed on dm-10 - disabling barriers

Oct 27 17:51:06
linux10 kernel: JBD: barrier-based sync failed on dm-11 - disabling barriers

Oct 27 17:51:06
linux10 kernel: JBD: barrier-based sync failed on dm-12 - disabling barriers

Oct 27 17:51:06
linux10 kernel: AppArmor: AppArmor initialized

Oct 27 17:51:06
linux10 kernel: audit(1414403451.182:2): 
info="AppArmor initialized" pid=4403

Oct 27 17:51:06
linux10 kernel: floppy0: no floppy controllers found

Oct 27 17:51:06
linux10 kernel: ACPI: Power Button (FF) [PWRF]

Oct 27 17:51:06
linux10 kernel: rdac: device handler unregistered

Oct 27 17:51:06
linux10 kernel: No dock devices found.

Oct 27 17:51:06
linux10 kernel: bnx2: eth0: using MSI

Oct 27 17:51:06
linux10 kernel: bnx2: eth1: using MSI

Oct 27 17:51:06
linux10 kernel: Ethernet Channel Bonding Driver: v3.2.5 (March 21, 2008)

Oct 27 17:51:06
linux10 kernel: bonding: Warning: either miimon or arp_interval and
arp_ip_target module parameters must be s

pecified, otherwise
bonding will not detect link failures! see bonding.txt for details.

Oct 27 17:51:06
linux10 kernel: JBD: barrier-based sync failed on dm-8 - disabling barriers

Oct 27 17:51:06
linux10 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex

Oct 27 17:51:06
linux10 kernel: bonding: bond0: setting mode to active-backup (1).

Oct 27 17:51:06
linux10 kernel: bonding: bond0: Setting MII monitoring interval to 100.

Oct 27 17:51:06
linux10 kernel: bonding: bond0: Setting use_carrier to 0.

Oct 27 17:51:06
linux10 kernel: bnx2: eth0: using MSI

Oct 27 17:51:06
linux10 kernel: bonding: bond0: enslaving eth0 as a backup interface with a
down link.

Oct 27 17:51:06
linux10 kernel: bnx2: eth1: using MSI

Oct 27 17:51:06
linux10 kernel: bonding: bond0: enslaving eth1 as a backup interface with a
down link.

Oct 27 17:51:06
linux10 kernel: audit(1414403461.814:3): audit_pid=5906 old=0 by
auid=4294967295

Oct 27 17:51:06
linux10 kernel: llt: module not supported by Novell, setting U taint flag.

Oct 27 17:51:06
linux10 kernel: LLT INFO V-14-1-10009 LLT 5.1.100.000-SP1GA Protocol available

Oct 27 17:51:06
linux10 kernel: bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex

Oct 27 17:51:06
linux10 kernel: bonding: bond0: link status definitely up for interface eth1.

Oct 27 17:51:06
linux10 kernel: bonding: bond0: making interface eth1 the new active one.

Oct 27 17:51:06
linux10 kernel: bonding: bond0: first active interface up!

Oct 27 17:51:06
linux10 kernel: bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex

Oct 27 17:51:06
linux10 kernel: powernow-k8: Found 4 AMD Opteron(tm) Processor 6134 processors
(16 cpu cores) (version 2.20.0

0)

Oct 27 17:51:06
linux10 kernel: powernow-k8: MP systems not supported by PSB BIOS structure

…………

Oct 28 18:10:01
linux10 /usr/sbin/cron[17099]: (root) CMD (/usr/sbin/ntpdate 172.29.141.162)

Oct 28 18:11:14
linux10 zmd: ShutdownManager (WARN): Preparing to sleep...

Oct 28 18:11:15
linux10 zmd: ShutdownManager (WARN): Going to sleep, waking up at 10/29/2014
17:41:08

Oct 28 18:11:49
linux10 syslog-ng[5762]: Error connecting to remote host
AF_INET(172.29.141.162:5140), reattempting in 60 sec

onds

…………

-----------------------------------------------------------------------

在上面的日志中出现两个问题分别为:

一、zmd:
ShutdownManager (WARN): Preparing to sleep…

二、

Oct 27 17:51:02
linux10 rcpowersaved: CPU frequency scaling is not supported by your processor.

Oct 27 17:51:02
linux10 rcpowersaved: enter ‘CPUFREQ_ENABLED=no‘ in /etc/powersave/cpufreq to
avoid this warning.

Oct 27 17:51:02
linux10 rcpowersaved: Cannot load cpufreq governors - No cpufreq driver
available

…………

Oct 27 17:51:06
linux10 kernel: JBD: barrier-based sync failed on dm-10 - disabling barriers

Oct 27 17:51:06
linux10 kernel: JBD: barrier-based sync failed on dm-11 - disabling barriers

Oct 27 17:51:06
linux10 kernel: JBD: barrier-based sync failed on dm-12 - disabling barriers

问题一 由ZMD服务器引起,Novell对ZMD服务的解释为:

The zmd daemon
performs software management functions on the ZENworks managed device,
including updating, installing, and removing software, and performing basic
queries of the device‘s package management database. Typically, these
management tasks are initiated through the ZENworks Control Center or the rug,
zen-installer, zen-updater, or zen-remover utilities, which means you should
not need to interact directly with zmd.

ZMD服务主要负责用户软件的更新、安装管理操作,在开机时自动启动,ZMD服务启动后,默认每六小时联网更新,更新时会占用80端口,因此经常会与tomcat 等服务器产生端口,因此在软件安装或更新完后,可以及时关闭此服务,

#/etc/init.d/novell-zmd
status

Checking for ZENworks
Management Daemon:                             
running

#/etc/init.d/novell-zmd
stop

Shutting down
ZENworks Management Daemon                              done

注:关闭 此服务后,安装软件是比较麻烦,因此在需要时可以在此打开,改服务在更新时有可能会长时间锁定/etc/mtab,因此需要注意。

解决方法:

关闭novell-zmd服务后,此日志消失。

有时我们为了提高开机速度,会将novell-zmd服务进行关闭

chkconfig -delete
novell-zmd

问题二:

单从日志信息上看cpu不支持变频的问题,由于在操作系统和VCS日志中均没有发现其他异常,因此怀疑是服务器硬件出了问题,去机房一看,服务器住面板有电流符号的故障灯显示橘红色,这时基本就能放松了,硬件肯定是不对了,于是收集硬件日志联系HP厂商,经确定是主板故障,更换主板后,服务器没有重启。

时间: 2024-10-17 00:25:28

主板故障导致服务器不定时频繁重启故障解决过程全记录的相关文章

服务器不定时无故重启

系统:Centos6.8 64位   32GB内存 以下是messages及dmesg日志. [[email protected] ~]# less /var/log/messages | grep -i errorMay 27 07:59:20 CYCN-APSRV kernel: end_request: I/O error, dev xvda, sector 280590May 27 07:59:20 CYCN-APSRV kernel: end_request: I/O error, d

windows2008设置IIS服务器定时自动重启的方法

我们在使用windows2008下IIS服务器时会经常出现资源耗尽的现象,运行一段时间下来就会出现访问服务器上的网站时提示数据库连接出错,重启IIS后网站又能正常访问了,这个问题可能困扰了很多站长朋友.青岛做网站经过不断的实践找到了一个比较笨,但是有效的方法,那就是设置windows2008IIS服务器定时自动重启,在Windows的任务计划中指定一个时间让 IIS服务器自动重启. 设置IIS服务器定时自动重启的方法:(这里我以Windows Server 2008为例) 1.首先开启Windo

EDAC DIMM CE Error错误导致服务器重启

现象: 最近几天一个华为RH2285服务器一直不定时自动重启,基本每天一两次,查看系统日志报下面的错误,每秒记录一条错误日志 OS:OEL 6.5 $ more /var/log/message Jul 21 08:54:32 customerkernel: EDAC MC1: 5486 CE error on CPU#1Channel#2_DIMM#1 (channel:2 slot:1page:0x0 offset:0x0 grain:8 syndrome:0x0) Jul 21 08:54

定时任务服务器不定时重启原因解析

现在定时任务服务存在问题,请帮忙核实一下: 现象:定时任务服务器的resin不定时重启,有时能达到一天一两次,原因是在创建单例bean时,单例bean正在被销毁.报错的类都是RunPickupPathService,每次重启前都会报如下错误. 日志如下: Caused by: org.springframework.beans.factory.BeanCreationNotAllowedException: Error creating bean with name 'transactionMa

服务器raid5磁盘阵列不同故障导致数据丢失的数据恢复方法(案例)

服务器Raid 5阵列算法 Raid5阵列使用的算法通常被称为"异或运算",这是一个数学运算符.它应用于逻辑运算.异或的数学符号为"⊕",计算机符号为"xor".其运算法则为:a⊕b = (?a ∧ b) ∨ (a ∧?b).如果a.b两个值不相同,则异或结果为1.如果a.b两个值相同,异或结果为0.异或也叫半加运算,其运算法则相当于不带进位的二进制加法:二进制下用1表示真,0表示假,则异或的运算法则为:0⊕0=0,1⊕0=1,0⊕1=1,1⊕1

记一次zimbra服务器故障导致mysql起不来问题

记一次zimbra服务器故障导致mysql起不来问题服务器有一天突然访问不了,局域网连接不上,去机房查看,硬盘灯亮着,屏黑的,按电源键没法关机,没办法,只能强制关机了.强制关机后,启动起来,登陆进去看.zmcontrol status过了好久才出现内容,提示如下:Unable to determine enabled services from ldap. Unable to determine enabled services. Cache is out of date or doesn't

RAC OCR盘故障导致的集群重启恢复

一.事故说明 最近出现了一次OCR盘的故障导致Oracle集群件宕机的事故,后以独占模式启动集群,并使用ocr备份恢复了OCR文件以及重新设置了vote disk,然后关闭集群,重启成功. 因此在此处进行事故重现以吸取教训. 二.重现步骤 测试RAC环境中只有+OCR和+DATA两个ASM磁盘组. 1.做好ocr的手工备份 [[email protected] ~]# ocrconfig -export /home/oracle/ocr.bak 紧急情况下没有ocr的备份也不要紧,在$CRS_H

服务器运维常见的故障及其解决办法

进入信息时代,各种行业对数据的安全和技术要求也越来越高,,同时也遇到了各种各样的服务器故障问题,虽然能够接到服务器厂商的支持,但是往往耗时耗工,特别是有些不能够立即判断和解决的问题,造成了企业不必要的损失,下面是针对一些常见的服务器故障现象和解决方法,以便更好的更快的进行故障处理和排查. 01 服务器常见故障及现象 有关服务器无法启动的主要原因 : ①市电或电源线故障(断电或接触不良) ②电源或电源模组故障 ③内存故障(一般伴有报警声) ④CPU故障(一般也会有报警声) ⑤主板故障 ⑥其它插卡造

记录一次ARP故障导致网络异常

故障现象:单位某台PC出现无法打印故障,提示为下图 该PC重新设置打印机就能够使用但在重启之后无法连接到打印机,并且还会出现连接共享文件的时候会出现类似掉线的情况 分析过程:打印机为网络打印机 内存使用率仅有10% 所以排除1和3 剩下围绕着2的提示来解决 实践过程: 先从网络方面下手,利用IMCP的PING功能对打印服务器,打印机本身进行测试,结果为能PING通. 注!此时发现一个问题,服务器以及其他PC无法PING通本PC! 针对上述所说的故障对本PC进行了网卡驱动更新,系统重装,甚至更换网