【翻译自mos文章】为什么GI 的 Rebootless Fencing 会失败?

为什么GI 的 Rebootless Fencing 会失败?

参考自:

Why Grid Infrastructure Rebootless Node Fencing Fails (Doc ID 1502282.1)

适用于:

Oracle Server - Enterprise Edition - Version 11.2.0.2 and later

Information in this document applies to any platform.

目的:

Rebootless Fencing  是从11.2.0.2引入的新特性,当evict 发生时,Rebootless Fencing 取代了 11.2.0.2之前的reboot 节点。Rebootless Fencing 会尝试在被驱逐的节点上 gracefully 停止 GI 以避免reboot节点。

若是Rebootless Fencing 失败了,被驱逐的节点会被reboot。本文列示了Rebootless Fencing 失败的通常原因。

细节:

1. Resources fails to stop.

If one or more resources fail to stop, rebootless fencing will fail and node will be rebooted.

In this case, rebootless fencing fails on node2 after network split brain and node 2 is rebooted as expected:

?<GI_HOME>/log/<node>/alert<node>.log from evicted node:

..
2012-09-11 12:04:34.363
[cssd(18834)]CRS-1610:Network communication with node racnode1 (1) missing for 90% of timeout interval.  Removal of this node from cluster in 2.020 seconds
2012-09-11 12:04:36.379
[cssd(18834)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /ocw/grid/log/racnode2/cssd/ocssd.log.
2012-09-11 12:04:36.379
[cssd(18834)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /ocw/grid/log/racnode2/cssd/ocssd.log
2012-09-11 12:04:36.399
[cssd(18834)]CRS-1652:Starting clean up of CRSD resources.
2012-09-11 12:04:36.586
[crsd(26115)]CRS-5833:Cleaning resource 'zDRMON.sh.racnode2 1 1' failed as part of reboot-less node fencing
2012-09-11 12:04:36.588
[cssd(18834)]CRS-1653:The clean up of the CRSD resources failed.                    ##>> user resource fails to be cleaned
2012-09-11 12:04:37.042
[ohasd(16821)]CRS-2765:Resource 'ora.evmd' has failed on server 'racnode2'.
2012-09-11 12:04:37.052
[/ocw/grid/bin/scriptagent.bin(27696)]CRS-5822:Agent '/ocw/grid/bin/scriptagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:4:10} in /ocw/grid/log/racnode2/agent/crsd/scriptagent_oracle/scriptagent_oracle.log.
2012-09-11 12:04:37.062
[ohasd(16821)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode2'.                 ##>> node rebooted after this message, in some cases, this message won't be there
2012-09-11 12:10:47.356
[ohasd(16677)]CRS-2112:The OLR service started on node racnode2.
2012-09-11 12:10:47.521
[ohasd(16677)]CRS-1301:Oracle High Availability Service started on node racnode2.
2012-09-11 12:10:47.539
[ohasd(16677)]CRS-8011:reboot advisory message from host: racnode2, component: cssagent, with time stamp: L-2012-09-11-12:04:37.140      ##>> reboot advisory shows both cssdagent and cssdmonitor took the action to reboot
[ohasd(16677)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2012-09-11 12:10:47.594
[ohasd(16677)]CRS-8011:reboot advisory message from host: racnode2, component: cssmonit, with time stamp: L-2012-09-11-12:04:37.139
[ohasd(16677)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS
2012-09-11 12:10:47.605
[ohasd(16677)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 2 were announced and 0 errors occurred

When resource fails to stop, cssdagent or cssdmonitor or both will try to reboot the node, below is sample log.

?     <GI_HOME>/agent/ohasd/oracssdmonitor_root/oracssdmonitor_root.log

2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: got posted

2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: shutdown initiated by CSS, requested to sync

2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnwork_queue: posting worker thread

2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: exiting check loop

2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: got HB signal

2012-09-11 12:04:36.400: [ USRTHRD][1097382208] clsnwork_process_work: calling sync

2012-09-11 12:04:36.413: [ USRTHRD][1097382208] clsnwork_process_work: sync completed

2012-09-11 12:04:37.035: [ CSSCLNT][1095805248]clsssRecvMsg: got a disconnect from the server while waiting for message type 22

2012-09-11 12:04:37.035: [ CSSCLNT][1098959168]clsssRecvMsg: got a disconnect from the server while waiting for message type 27

2012-09-11 12:04:37.035: [ USRTHRD][1095805248] clsnwork_queue: posting worker thread

2012-09-11 12:04:37.035: [ USRTHRD][1095805248] clsnpollmsg_main: exiting check loop

2012-09-11 12:04:37.035: [GIPCXCPT][1098959168]gipcInternalSend: connection not valid for send operation endp 0x8e3e60 [00000000000001b7] { gipcEndpoint : localAddr ‘clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=3165a05b-7e7139a5-18801))‘, remoteAddr ‘clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_racnode2_)(GIPCID=7e7139a5-3165a05b-18834))‘,
numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18834, flags 0x3861e, usrFlags 0x20010 }, ret gipcretConnectionLost (12)

2012-09-11 12:04:37.035: [ USRTHRD][1097382208] clsnwork_process_work: calling sync

2012-09-11 12:04:37.035: [ CSSCLNT][1077418304]clsssRecvMsg: got a disconnect from the server while waiting for message type 1

2012-09-11 12:04:37.036: [ CSSCLNT][1077418304]clssgsGroupGetStatus:  communications failed (0/3/-1)

2012-09-11 12:04:37.036: [ CSSCLNT][1077418304]clssgsGroupGetStatus: returning 8

2012-09-11 12:04:37.036: [ USRTHRD][1077418304] clsnomon_status: Communications failure with CSS detected. Waiting for sync to complete...

2012-09-11 12:04:37.036: [GIPCXCPT][1098959168]gipcSendSyncF [clsssServerRPC : clsss.c : 6272]: EXCEPTION[ ret gipcretConnectionLost (12) ]  failed to send on endp 0x8e3e60 [00000000000001b7] { gipcEndpoint : localAddr ‘clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=3165a05b-7e7139a5-18801))‘,
remoteAddr ‘clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_racnode2_)(GIPCID=7e7139a5-3165a05b-18834))‘, numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18834, flags 0x3861e, usrFlags 0x20010 }, addr 0000000000000000, buf 0x4180bd80,
len 80, flags 0x8000000

2012-09-11 12:04:37.036: [ CSSCLNT][1098959168]clsssServerRPC: send failed with err 12, msg type 7

2012-09-11 12:04:37.036: [ CSSCLNT][1098959168]clsssCommonClientExit: RPC failure, rc 3

2012-09-11 12:04:37.139: [ USRTHRD][1097382208] clsnwork_process_work: sync completed

2012-09-11 12:04:37.139: [ USRTHRD][1097382208] clsnSyncComplete: posting omon

?     <GI_HOME>/agent/ohasd/oracssdagent_root/oracssdagent_root.log

2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: got posted
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: shutdown initiated by CSS, requested to sync
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnwork_queue: posting worker thread
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: exiting check loop
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: got HB signal
2012-09-11 12:04:36.400: [ USRTHRD][1097382208] clsnwork_process_work: calling sync
2012-09-11 12:04:36.413: [ USRTHRD][1097382208] clsnwork_process_work: sync completed
2012-09-11 12:04:37.035: [ CSSCLNT][1098959168]clsssRecvMsg: got a disconnect from the server while waiting for message type 27
2012-09-11 12:04:37.035: [ CSSCLNT][1095805248]clsssRecvMsg: got a disconnect from the server while waiting for message type 22
2012-09-11 12:04:37.035: [GIPCXCPT][1098959168]gipcInternalSend: connection not valid for send operation endp 0x2aaab4014900 [00000000000001c0] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=561e3f6b-a0a3602e-18817))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_racnode2_)(GIPCID=a0a3602e-561e3f6b-18834))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18834, flags 0x3861e, usrFlags 0x20010 }, ret gipcretConnectionLost (12)
2012-09-11 12:04:37.035: [ USRTHRD][1095805248] clsnwork_queue: posting worker thread
2012-09-11 12:04:37.035: [ USRTHRD][1095805248] clsnpollmsg_main: exiting check loop
2012-09-11 12:04:37.035: [GIPCXCPT][1098959168]gipcSendSyncF [clsssServerRPC : clsss.c : 6272]: EXCEPTION[ ret gipcretConnectionLost (12) ]  failed to send on endp 0x2aaab4014900 [00000000000001c0] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=561e3f6b-a0a3602e-18817))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_racnode2_)(GIPCID=a0a3602e-561e3f6b-18834))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18834, flags 0x3861e, usrFlags 0x20010 }, addr 0000000000000000, buf 0x4180bd80, len 80, flags 0x8000000
2012-09-11 12:04:37.035: [ CSSCLNT][1098959168]clsssServerRPC: send failed with err 12, msg type 7
2012-09-11 12:04:37.035: [ CSSCLNT][1098959168]clsssCommonClientExit: RPC failure, rc 3

2012-09-11 12:04:37.036: [ CSSCLNT][1077418304]clsssRecvMsg: got a disconnect from the server while waiting for message type 1
2012-09-11 12:04:37.036: [ CSSCLNT][1077418304]clssgsGroupGetStatus:  communications failed (0/3/-1)

2012-09-11 12:04:37.036: [ CSSCLNT][1077418304]clssgsGroupGetStatus: returning 8

2012-09-11 12:04:37.036: [ USRTHRD][1077418304] clsnomon_status: Communications failure with CSS detected. Waiting for sync to complete...
2012-09-11 12:04:37.036: [ USRTHRD][1097382208] clsnwork_process_work: calling sync

As CRSD resources (user resources) failed to stop, crsd.log can be a starting point for further debugging.

时间: 2024-10-24 16:39:46

【翻译自mos文章】为什么GI 的 Rebootless Fencing 会失败?的相关文章

【翻译自mos文章】OGG add Supplemental Logging 时失败,报错为 块损坏(Block Corruption)

OGG add Supplemental Logging 时失败,报错为 块损坏(Block Corruption) 来源于: Add Supplemental Logging Fails Due To Block Corruption (文档 ID 1468322.1) 适用于: Oracle Server - Enterprise Edition - Version 10.2.0.5 to 12cBETA1 [Release 10.2 to 12.1] Information in this

【翻译自mos文章】在11gR2 rac环境中,文件系统使用率紧张,并且lsof显示有很多oraagent_oracle.l10 (deleted)

在11gR2 rac环境中,文件系统使用率紧张,并且lsof显示有很多oraagent_oracle.l10 (deleted) 参考原文: High Space Usage and "lsof" Output Shows Many 'oraagent_oracle.l10 (deleted)' in GI environment (Doc ID 1598252.1) 适用于: Oracle Database - Enterprise Edition - Version 11.2.0.

【翻译自mos文章】找到&#39;cursor: pin S wait on X&#39; 等待事件的阻塞者session(即:持有者session)

找到'cursor: pin S wait on X' 等待事件的阻塞者session(即:持有者session) 来源于: How to Determine the Blocking Session for Event: 'cursor: pin S wait on X' (Doc ID 786507.1) 适用于: Oracle Database - Enterprise Edition - Version 10.2.0.1 to 11.2.0.3 [Release 10.2 to 11.2

【翻译自mos文章】SGA_TARGET与SHMMAX的关系

SGA_TARGET与SHMMAX的关系 参考原文: Relationship Between SGA_TARGET and SHMMAX (文档 ID 1527109.1) 适用于: Oracle Database - Enterprise Edition - Version 10.1.0.2 to 11.2.0.3 [Release 10.1 to 11.2] Information in this document applies to any platform. 目的: 解释了参数文件中

【翻译自mos文章】使用Windows操作系统的Dell Pcserver,Oracle db报错:ORA-8103

翻译自mos文章:使用Windows操作系统的Dell Pcserver,Oracle db报错:ORA-8103 ORA-8103 using Windows platform and DELL servers (Doc ID 1921533.1) Applies to: Oracle Database - Personal Edition - Version 11.1.0.6 to 12.1.0.2 [Release 11.1 to 12.1] Oracle Database - Stand

【翻译自mos文章】使用buffer memory 参数来调整rman的性能。

使用buffer memory 参数来调整rman的性能. 本文翻译自mos文章:RMAN Performance Tuning Using Buffer Memory Parameters (Doc ID 1072545.1) rman 性能调整的目的是分辨一个特定的backup or  restore job的瓶颈. 并使用使用rman命令.初始化参数 或者对physical media的调整来提高整体的性能. 由于数据库容量持续变大,在客户的环境中,几十到几百TB的数据库很常见, serv

【翻译自mos文章】11gR2 OUI 在 PREREQUISITE CHECKS 时 hang住

翻译自mos文章:11gR2 OUI 在 PREREQUISITE CHECKS 时 hang住 适用于: Oracle Server - Enterprise Edition - Version 8.0.6.0 to 11.2.0.2.0 [Release 8.0.6 to 11.2] Information in this document applies to any platform. This can occur on any Unix/Linux platform 症状: 11gR2

【翻译自mos文章】升级到11.2.0.4之后在alert日志中出现 NUMA 警告信息

注:与本文有关的文章为:http://blog.csdn.net/msdnchina/article/details/43763927 升级到11.2.0.4之后在alert日志中出现 NUMA 警告信息 翻译自mos文章:NUMA warning message appear after upgrade to 11.2.0.4 (文档 ID 1600824.1)1 适用于: Oracle Database - Enterprise Edition - Version 11.2.0.4 and

【翻译自mos文章】对于oracle 数据库来说,OGG的抽取进程什么时候到database中获取数据?

对于oracle 数据库来说,OGG的抽取进程什么时候到database中获取数据? 参考原文: When GoldenGate Fetches Data From The Database On Extraction For Oracle (Doc ID 1059583.1) 适用于: Oracle GoldenGate - Version 4.0.0 and later Information in this document applies to any platform. 解决方法: 问