为什么GI 的 Rebootless Fencing 会失败?
参考自:
Why Grid Infrastructure Rebootless Node Fencing Fails (Doc ID 1502282.1)
适用于:
Oracle Server - Enterprise Edition - Version 11.2.0.2 and later
Information in this document applies to any platform.
目的:
Rebootless Fencing 是从11.2.0.2引入的新特性,当evict 发生时,Rebootless Fencing 取代了 11.2.0.2之前的reboot 节点。Rebootless Fencing 会尝试在被驱逐的节点上 gracefully 停止 GI 以避免reboot节点。
若是Rebootless Fencing 失败了,被驱逐的节点会被reboot。本文列示了Rebootless Fencing 失败的通常原因。
细节:
1. Resources fails to stop.
If one or more resources fail to stop, rebootless fencing will fail and node will be rebooted.
In this case, rebootless fencing fails on node2 after network split brain and node 2 is rebooted as expected:
?<GI_HOME>/log/<node>/alert<node>.log from evicted node:
.. 2012-09-11 12:04:34.363 [cssd(18834)]CRS-1610:Network communication with node racnode1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 2.020 seconds 2012-09-11 12:04:36.379 [cssd(18834)]CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /ocw/grid/log/racnode2/cssd/ocssd.log. 2012-09-11 12:04:36.379 [cssd(18834)]CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /ocw/grid/log/racnode2/cssd/ocssd.log 2012-09-11 12:04:36.399 [cssd(18834)]CRS-1652:Starting clean up of CRSD resources. 2012-09-11 12:04:36.586 [crsd(26115)]CRS-5833:Cleaning resource 'zDRMON.sh.racnode2 1 1' failed as part of reboot-less node fencing 2012-09-11 12:04:36.588 [cssd(18834)]CRS-1653:The clean up of the CRSD resources failed. ##>> user resource fails to be cleaned 2012-09-11 12:04:37.042 [ohasd(16821)]CRS-2765:Resource 'ora.evmd' has failed on server 'racnode2'. 2012-09-11 12:04:37.052 [/ocw/grid/bin/scriptagent.bin(27696)]CRS-5822:Agent '/ocw/grid/bin/scriptagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:4:10} in /ocw/grid/log/racnode2/agent/crsd/scriptagent_oracle/scriptagent_oracle.log. 2012-09-11 12:04:37.062 [ohasd(16821)]CRS-2765:Resource 'ora.crsd' has failed on server 'racnode2'. ##>> node rebooted after this message, in some cases, this message won't be there 2012-09-11 12:10:47.356 [ohasd(16677)]CRS-2112:The OLR service started on node racnode2. 2012-09-11 12:10:47.521 [ohasd(16677)]CRS-1301:Oracle High Availability Service started on node racnode2. 2012-09-11 12:10:47.539 [ohasd(16677)]CRS-8011:reboot advisory message from host: racnode2, component: cssagent, with time stamp: L-2012-09-11-12:04:37.140 ##>> reboot advisory shows both cssdagent and cssdmonitor took the action to reboot [ohasd(16677)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS 2012-09-11 12:10:47.594 [ohasd(16677)]CRS-8011:reboot advisory message from host: racnode2, component: cssmonit, with time stamp: L-2012-09-11-12:04:37.139 [ohasd(16677)]CRS-8013:reboot advisory message text: clsnomon_status: need to reboot, unexpected failure 8 received from CSS 2012-09-11 12:10:47.605 [ohasd(16677)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 2 were announced and 0 errors occurred
When resource fails to stop, cssdagent or cssdmonitor or both will try to reboot the node, below is sample log.
? <GI_HOME>/agent/ohasd/oracssdmonitor_root/oracssdmonitor_root.log
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: got posted
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: shutdown initiated by CSS, requested to sync
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnwork_queue: posting worker thread
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: exiting check loop
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: got HB signal
2012-09-11 12:04:36.400: [ USRTHRD][1097382208] clsnwork_process_work: calling sync
2012-09-11 12:04:36.413: [ USRTHRD][1097382208] clsnwork_process_work: sync completed
2012-09-11 12:04:37.035: [ CSSCLNT][1095805248]clsssRecvMsg: got a disconnect from the server while waiting for message type 22
2012-09-11 12:04:37.035: [ CSSCLNT][1098959168]clsssRecvMsg: got a disconnect from the server while waiting for message type 27
2012-09-11 12:04:37.035: [ USRTHRD][1095805248] clsnwork_queue: posting worker thread
2012-09-11 12:04:37.035: [ USRTHRD][1095805248] clsnpollmsg_main: exiting check loop
2012-09-11 12:04:37.035: [GIPCXCPT][1098959168]gipcInternalSend: connection not valid for send operation endp 0x8e3e60 [00000000000001b7] { gipcEndpoint : localAddr ‘clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=3165a05b-7e7139a5-18801))‘, remoteAddr ‘clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_racnode2_)(GIPCID=7e7139a5-3165a05b-18834))‘,
numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18834, flags 0x3861e, usrFlags 0x20010 }, ret gipcretConnectionLost (12)
2012-09-11 12:04:37.035: [ USRTHRD][1097382208] clsnwork_process_work: calling sync
2012-09-11 12:04:37.035: [ CSSCLNT][1077418304]clsssRecvMsg: got a disconnect from the server while waiting for message type 1
2012-09-11 12:04:37.036: [ CSSCLNT][1077418304]clssgsGroupGetStatus: communications failed (0/3/-1)
2012-09-11 12:04:37.036: [ CSSCLNT][1077418304]clssgsGroupGetStatus: returning 8
2012-09-11 12:04:37.036: [ USRTHRD][1077418304] clsnomon_status: Communications failure with CSS detected. Waiting for sync to complete...
2012-09-11 12:04:37.036: [GIPCXCPT][1098959168]gipcSendSyncF [clsssServerRPC : clsss.c : 6272]: EXCEPTION[ ret gipcretConnectionLost (12) ] failed to send on endp 0x8e3e60 [00000000000001b7] { gipcEndpoint : localAddr ‘clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=3165a05b-7e7139a5-18801))‘,
remoteAddr ‘clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_racnode2_)(GIPCID=7e7139a5-3165a05b-18834))‘, numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18834, flags 0x3861e, usrFlags 0x20010 }, addr 0000000000000000, buf 0x4180bd80,
len 80, flags 0x8000000
2012-09-11 12:04:37.036: [ CSSCLNT][1098959168]clsssServerRPC: send failed with err 12, msg type 7
2012-09-11 12:04:37.036: [ CSSCLNT][1098959168]clsssCommonClientExit: RPC failure, rc 3
2012-09-11 12:04:37.139: [ USRTHRD][1097382208] clsnwork_process_work: sync completed
2012-09-11 12:04:37.139: [ USRTHRD][1097382208] clsnSyncComplete: posting omon
? <GI_HOME>/agent/ohasd/oracssdagent_root/oracssdagent_root.log
2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: got posted 2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: shutdown initiated by CSS, requested to sync 2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnwork_queue: posting worker thread 2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: exiting check loop 2012-09-11 12:04:36.400: [ USRTHRD][1095805248] clsnpollmsg_main: got HB signal 2012-09-11 12:04:36.400: [ USRTHRD][1097382208] clsnwork_process_work: calling sync 2012-09-11 12:04:36.413: [ USRTHRD][1097382208] clsnwork_process_work: sync completed 2012-09-11 12:04:37.035: [ CSSCLNT][1098959168]clsssRecvMsg: got a disconnect from the server while waiting for message type 27 2012-09-11 12:04:37.035: [ CSSCLNT][1095805248]clsssRecvMsg: got a disconnect from the server while waiting for message type 22 2012-09-11 12:04:37.035: [GIPCXCPT][1098959168]gipcInternalSend: connection not valid for send operation endp 0x2aaab4014900 [00000000000001c0] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=561e3f6b-a0a3602e-18817))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_racnode2_)(GIPCID=a0a3602e-561e3f6b-18834))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18834, flags 0x3861e, usrFlags 0x20010 }, ret gipcretConnectionLost (12) 2012-09-11 12:04:37.035: [ USRTHRD][1095805248] clsnwork_queue: posting worker thread 2012-09-11 12:04:37.035: [ USRTHRD][1095805248] clsnpollmsg_main: exiting check loop 2012-09-11 12:04:37.035: [GIPCXCPT][1098959168]gipcSendSyncF [clsssServerRPC : clsss.c : 6272]: EXCEPTION[ ret gipcretConnectionLost (12) ] failed to send on endp 0x2aaab4014900 [00000000000001c0] { gipcEndpoint : localAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=)(GIPCID=561e3f6b-a0a3602e-18817))', remoteAddr 'clsc://(ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_racnode2_)(GIPCID=a0a3602e-561e3f6b-18834))', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 18834, flags 0x3861e, usrFlags 0x20010 }, addr 0000000000000000, buf 0x4180bd80, len 80, flags 0x8000000 2012-09-11 12:04:37.035: [ CSSCLNT][1098959168]clsssServerRPC: send failed with err 12, msg type 7 2012-09-11 12:04:37.035: [ CSSCLNT][1098959168]clsssCommonClientExit: RPC failure, rc 3 2012-09-11 12:04:37.036: [ CSSCLNT][1077418304]clsssRecvMsg: got a disconnect from the server while waiting for message type 1 2012-09-11 12:04:37.036: [ CSSCLNT][1077418304]clssgsGroupGetStatus: communications failed (0/3/-1) 2012-09-11 12:04:37.036: [ CSSCLNT][1077418304]clssgsGroupGetStatus: returning 8 2012-09-11 12:04:37.036: [ USRTHRD][1077418304] clsnomon_status: Communications failure with CSS detected. Waiting for sync to complete... 2012-09-11 12:04:37.036: [ USRTHRD][1097382208] clsnwork_process_work: calling sync
As CRSD resources (user resources) failed to stop, crsd.log can be a starting point for further debugging.