【翻译自mos文章】什么是Oracle Clusterware 和RAC中的脑裂

什么是Oracle Clusterware 和RAC中的脑裂

来源于：

What is Split Brain in Oracle Clusterware and Real Application Cluster (文档 ID 1425586.1)

适用于：

Oracle Database - Enterprise Edition - Version 10.1.0.2 and later

Information in this document applies to any platform.

目的：

本文解释了Oracle Clusterware 和RAC中的脑裂，以及与脑裂有关的错误和结果。

细节：

在通用的术语中，脑裂表示数据不一致，这个数据库不一致起源于两个不同的数据集在范围上重叠。

要么由于是server间的网络设计，要么是有故障的环境，该环境基于servers间的互相通讯和统一数据。

有两个组件会经历脑裂：

1. Clusterware 层：

集群节点之间通过私有网络和voting disk维持他们的heartbeats。

当私有网络损坏时，经过misscount setting设定的时间期间之后，集群节点之间不能通过私有网络相互通信，脑裂就会发生。

在这个case中，voting disk 将会被用来决定哪个node 幸存下来，哪个node被evict出集群。通常的voting 结果如下：

 a.The group with more cluster nodes survive
 b.The group with lower node member in case of same number of node(s) available in each group
 c.Some improvement has been made to ensure node(s) with lower load survive in case the eviction is caused by high system load.

通常，当脑裂发生时，在ocssd.log中，会看到类似如下的信息：

[ CSSD]2011-01-12 23:23:08.090 [1262557536] >TRACE: clssnmCheckDskInfo: Checking disk info...
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssnmCheckDskInfo: Aborting local node to avoid splitbrain.
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: : my node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(2)
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR:
###################################
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssscExit: CSSD aborting
###################################

以上信息显示出：从node 2到node 1的通信不工作，因此 node 2 只能看到一个 node（也就是node 2自己）,但是node 1 是工作正常的，并且node 1 能看到集群中的2个node，为了避免脑裂， node 2 aborted itself.

解决方案：请联系网络管理员以检查私有网络层以消除任何的网络问题。

2. RAC(database)layer

为了确保数据一致性，RAC Database中干的每个instance 需要与其他instance 保持heartbeat。 heartbeat 是由后台进程LMON，LMD，LMS和LCK来维持的。

这些进程中的任何一个进程若是经历IPC Send time out将会导致通信重配（communication reconfiguration）和实例驱逐以避免脑裂。

类似于clusterware层面的voting disk，控制文件被用来确定哪个instance 幸存下来，哪个instance 被evict。

The voting result is similar to clusterware voting result. As the result, 1 or more instance(s) will be evicted.

Common messages in instance alert log are similar to:

alert log of instance 1:
---------
Mon Dec 07 19:43:05 2011
IPC Send timeout detected.Sender: ospid 26318
Receiver: inst 2 binc 554466600 ospid 29940
IPC Send timeout to 2.0 inc 8 for msg type 65521 from opid 20
Mon Dec 07 19:43:07 2011
Communications reconfiguration: instance_number 2
Mon Dec 07 19:43:07 2011
Trace dumping is performing id=[cdmp_20091207194307]
Waiting for clusterware split-brain resolution
Mon Dec 07 19:53:07 2011
Evicting instance 2 from cluster
Waiting for instances to leave:
2
...

alert log of instance 2:
---------
Mon Dec 07 19:42:18 2011
IPC Send timeout detected. Receiver ospid 29940
Mon Dec 07 19:42:18 2011
Errors in file
/u01/app/oracle/diag/rdbms/bd/BD2/trace/BD2_lmd0_29940.trc:
Trace dumping is performing id=[cdmp_20091207194307]
Mon Dec 07 19:42:20 2011
Waiting for clusterware split-brain resolution
Mon Dec 07 19:44:45 2011
ERROR: LMS0 (ospid: 29942) detects an idle connection to instance 1
Mon Dec 07 19:44:51 2011
ERROR: LMD0 (ospid: 29940) detects an idle connection to instance 1
Mon Dec 07 19:45:38 2011
ERROR: LMS1 (ospid: 29954) detects an idle connection to instance 1
Mon Dec 07 19:52:27 2011
Errors in file
/u01/app/oracle/diag/rdbms/bd/BD2/trace/PVBD2_lmon_29938.trc
(incident=90153):
ORA-29740: evicted by member 0, group incarnation 10
Incident details in:
/u01/app/oracle/diag/rdbms/bd/BD2/incident/incdir_90153/BD2_lmon_29938_i90153.trc

在上面的例子中, instance 2 LMD0 (pid 29940) is the receiver in IPC Send timeout. There could be various reasons causing IPC Send timeout. For example:

a. Network problem

b. Process hang

c. Bug etc

Please see Top 5 issues for Instance Eviction Document 1374110.1 for more information.

在instance驱逐的案例中, alert log and all background traces需要被检查，以确定根本原因。

Known Issues

1. Bug 7653579 - IPC send timeout in RAC after only short period Document 7653579.8
    Refer: ORA-29740 Instance (ASM/DB) eviction on Solaris SPARC Document 761717.1
    Fixed in: 11.2.0.1, 11.1.0.7.2 PSU and 11.1.0.7 Patch 22 on Windows

2. Unpublished Bug 8267580: Wrong Instance Evicted Under High CPU Load
    Refer: Wrong Instance Evicted Under High CPU Load in 11.1.0.7 Document 1373749.1
    Fixed in: 11.2.0.1

3. Bug 8365141 - DRM quiesce step hang causes instance eviction Document 8365141.8
    Fixed in: 10.2.0.5, 11.1.0.7.3, 11.1.0.7 patch 25 for Windows and 11.2.0.1

4. Bug 7587008 - Hung RAC instance not evicted from cluster Document  7587008.8
    Fixed in: 10.2.0.4.4, 10.2.0.5 and 11.2.0.1, one-off patch available for various 11.1.0.7 release

5. Bug 11890804 - LMHB crashes instance with ORA-29770 after long "control file sequential read" waits Document 11890804.8
    Fixed in 11.2.0.2.5, 11.2.0.3 and 11.2.0.2 Patch 10 on Windows

6. BUG:13732226 - NODE GETS EVICTED WITH REASON CODE 0X2
    BUG:13399435 - KJFCDRMRCFG WAITED 249 SECS FOR LMD TO RECEIVE ALL FTDONES, REQUESTING KILL
    BUG:13503204 - INSTANCE EVICTION DUE TO REASON 0X200000
    Refer: 11gR2: LMON received an instance eviction notification from instance n Document 1440892.1
    Fixed in: 11.2.0.4 and some merge patch available for 11.2.0.2 and 11.2.0.3

时间： 2024-12-05 08:44:12

【翻译自mos文章】什么是Oracle Clusterware 和RAC中的脑裂

【翻译自mos文章】什么是Oracle Clusterware 和RAC中的脑裂的相关文章

【翻译自mos文章】在oracle db 11gR2版本中启用 Oracle NUMA 支持

【翻译自mos文章】对于oracle 数据库来说，OGG的抽取进程什么时候到database中获取数据？

【翻译自mos文章】在Oracle GoldenGate中循环使用ggserr.log的方法

【翻译自mos文章】ABMR：在asm 环境中测试Automatic Block Recover 特性的方法

【翻译自mos文章】在11gR2/12c 的GI中，ORA_CRS_HOME 环境变量必须被unset

【翻译自mos文章】将Oracle 12c数据库从标准版convert到企业版

【翻译自mos文章】在Oracle单机数据库中定义database service

【翻译自mos文章】访问Oracle Database的知名的ODBC 驱动列表

【翻译自mos文章】在Oracle Linux 7上安装11.2.0.4时遇到缺少 pdksh-5.2.14 包