【翻译自mos文章】什么是Oracle Clusterware 和RAC中的脑裂

什么是Oracle Clusterware 和RAC中的脑裂

来源于:

What is Split Brain in Oracle Clusterware and Real Application Cluster (文档 ID 1425586.1)

适用于:

Oracle Database - Enterprise Edition - Version 10.1.0.2 and later

Information in this document applies to any platform.

目的:

本文解释了Oracle Clusterware 和RAC中的脑裂,以及与脑裂有关的错误和结果。

细节:

在通用的术语中,脑裂表示数据不一致,这个数据库不一致起源于两个不同的数据集在范围上重叠。

要么由于是server间的网络设计,要么是有故障的环境,该环境基于servers间的互相通讯和统一数据。

有两个组件会经历脑裂:

1. Clusterware 层:

集群节点之间通过私有网络和voting disk维持他们的heartbeats。

当私有网络损坏时,经过misscount setting设定的时间期间之后, 集群节点之间不能通过私有网络相互通信,脑裂就会发生。

在这个case中,voting disk 将会被用来决定哪个node 幸存下来,哪个node被evict出集群。通常的voting 结果如下:

 a.The group with more cluster nodes survive
 b.The group with lower node member in case of same number of node(s) available in each group
 c.Some improvement has been made to ensure node(s) with lower load survive in case the eviction is caused by high system load.

通常,当脑裂发生时,在ocssd.log中,会看到类似如下的信息:

[ CSSD]2011-01-12 23:23:08.090 [1262557536] >TRACE: clssnmCheckDskInfo: Checking disk info...
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssnmCheckDskInfo: Aborting local node to avoid splitbrain.
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: : my node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(2)
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR:
###################################
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssscExit: CSSD aborting
###################################

以上信息显示出:从node 2到node 1的通信 不工作,因此 node 2 只能看到一个 node(也就是node 2自己),但是node 1 是工作正常的,并且node 1 能看到集群中的2个node,为了避免脑裂, node 2 aborted itself.

解决方案:请联系网络管理员以检查私有网络层以消除任何的网络问题。

2. RAC(database)layer

为了确保数据一致性,RAC Database中干的每个instance 需要 与其他instance 保持heartbeat。 heartbeat 是由后台进程LMON,LMD,LMS和LCK来维持的。

这些进程中的任何一个进程若是经历IPC Send time out将会导致通信重配(communication reconfiguration)和实例驱逐以避免脑裂。

类似于clusterware层面的voting disk,控制文件被用来确定 哪个instance 幸存下来,哪个instance 被evict。

The voting result is similar to clusterware voting result. As the result, 1 or more instance(s) will be evicted.

Common messages in instance alert log are similar to:

alert log of instance 1:
---------
Mon Dec 07 19:43:05 2011
IPC Send timeout detected.Sender: ospid 26318
Receiver: inst 2 binc 554466600 ospid 29940
IPC Send timeout to 2.0 inc 8 for msg type 65521 from opid 20
Mon Dec 07 19:43:07 2011
Communications reconfiguration: instance_number 2
Mon Dec 07 19:43:07 2011
Trace dumping is performing id=[cdmp_20091207194307]
Waiting for clusterware split-brain resolution
Mon Dec 07 19:53:07 2011
Evicting instance 2 from cluster
Waiting for instances to leave:
2
...

alert log of instance 2:
---------
Mon Dec 07 19:42:18 2011
IPC Send timeout detected. Receiver ospid 29940
Mon Dec 07 19:42:18 2011
Errors in file
/u01/app/oracle/diag/rdbms/bd/BD2/trace/BD2_lmd0_29940.trc:
Trace dumping is performing id=[cdmp_20091207194307]
Mon Dec 07 19:42:20 2011
Waiting for clusterware split-brain resolution
Mon Dec 07 19:44:45 2011
ERROR: LMS0 (ospid: 29942) detects an idle connection to instance 1
Mon Dec 07 19:44:51 2011
ERROR: LMD0 (ospid: 29940) detects an idle connection to instance 1
Mon Dec 07 19:45:38 2011
ERROR: LMS1 (ospid: 29954) detects an idle connection to instance 1
Mon Dec 07 19:52:27 2011
Errors in file
/u01/app/oracle/diag/rdbms/bd/BD2/trace/PVBD2_lmon_29938.trc
(incident=90153):
ORA-29740: evicted by member 0, group incarnation 10
Incident details in:
/u01/app/oracle/diag/rdbms/bd/BD2/incident/incdir_90153/BD2_lmon_29938_i90153.trc

在上面的例子中, instance 2 LMD0 (pid 29940) is the receiver in IPC Send timeout. There could be various reasons causing IPC Send timeout. For example:

a. Network problem

b. Process hang

c. Bug etc

Please see Top 5 issues for Instance Eviction Document 1374110.1 for more information.

在instance驱逐的案例中, alert log and all background traces需要被检查,以确定根本原因。

Known Issues

1. Bug 7653579 - IPC send timeout in RAC after only short period Document 7653579.8
    Refer: ORA-29740 Instance (ASM/DB) eviction on Solaris SPARC Document 761717.1
    Fixed in: 11.2.0.1, 11.1.0.7.2 PSU and 11.1.0.7 Patch 22 on Windows

2. Unpublished Bug 8267580: Wrong Instance Evicted Under High CPU Load
    Refer: Wrong Instance Evicted Under High CPU Load in 11.1.0.7 Document 1373749.1
    Fixed in: 11.2.0.1

3. Bug 8365141 - DRM quiesce step hang causes instance eviction Document 8365141.8
    Fixed in: 10.2.0.5, 11.1.0.7.3, 11.1.0.7 patch 25 for Windows and 11.2.0.1

4. Bug 7587008 - Hung RAC instance not evicted from cluster Document  7587008.8
    Fixed in: 10.2.0.4.4, 10.2.0.5 and 11.2.0.1, one-off patch available for various 11.1.0.7 release

5. Bug 11890804 - LMHB crashes instance with ORA-29770 after long "control file sequential read" waits Document 11890804.8
    Fixed in 11.2.0.2.5, 11.2.0.3 and 11.2.0.2 Patch 10 on Windows

6. BUG:13732226 - NODE GETS EVICTED WITH REASON CODE 0X2
    BUG:13399435 - KJFCDRMRCFG WAITED 249 SECS FOR LMD TO RECEIVE ALL FTDONES, REQUESTING KILL
    BUG:13503204 - INSTANCE EVICTION DUE TO REASON 0X200000
    Refer: 11gR2: LMON received an instance eviction notification from instance n Document 1440892.1
    Fixed in: 11.2.0.4 and some merge patch available for 11.2.0.2 and 11.2.0.3
时间: 2024-10-06 00:19:46

【翻译自mos文章】什么是Oracle Clusterware 和RAC中的脑裂的相关文章

【翻译自mos文章】在oracle db 11gR2版本中启用 Oracle NUMA 支持

在oracle db 11gR2版本中启用 Oracle NUMA 支持 参考原文: Enable Oracle NUMA support with Oracle Server Version 11gR2 (文档 ID 864633.1) 适用于: Oracle Database - Enterprise Edition - Version 11.2.0.1 and later Oracle Database - Standard Edition - Version 11.2.0.1 and l

【翻译自mos文章】对于oracle 数据库来说,OGG的抽取进程什么时候到database中获取数据?

对于oracle 数据库来说,OGG的抽取进程什么时候到database中获取数据? 参考原文: When GoldenGate Fetches Data From The Database On Extraction For Oracle (Doc ID 1059583.1) 适用于: Oracle GoldenGate - Version 4.0.0 and later Information in this document applies to any platform. 解决方法: 问

【翻译自mos文章】在Oracle GoldenGate中循环使用ggserr.log的方法

在OGG中循环使用ggserr.log的方法: 参考原文: OGG How Do I Recycle The "ggserr.log" File? (Doc ID 967932.1) 适用于: Oracle GoldenGate - Version 4.0.0 and later Generic Linux 问题 GoldenGate的 ggserr.log 日志文件包括有关 GoldenGate 事件的信息,比如:进程启动,关闭,error ,warning.该文件可能会变的很大.为

【翻译自mos文章】ABMR:在asm 环境中测试Automatic Block Recover 特性的方法

ABMR:在asm 环境中测试Automatic Block Recover 特性的方法 参考原文: ABMR: How to test Automatic Block Recover Feature with ASM setup (Doc ID 1510090.1) 适用于: Oracle Database - Enterprise Edition - Version 11.2.0.3 and later Information in this document applies to any

【翻译自mos文章】在11gR2/12c 的GI中,ORA_CRS_HOME 环境变量必须被unset

在11gR2/12c 的GI中,ORA_CRS_HOME 环境变量必须被unset 来源于: Environment Variable ORA_CRS_HOME MUST be UNSET in 11gR2/12c GI (文档 ID 1502996.1) 适用于: Oracle Database - Enterprise Edition - Version 11.2.0.1 and later Information in this document applies to any platfo

【翻译自mos文章】将Oracle 12c数据库从标准版convert到企业版

将Oracle 12c数据库从标准版convert到企业版 来源于: How to Convert Oracle Database 12c from Standard to Enterprise Edition ? (文档 ID 2046103.1) APPLIES TO: Oracle Database - Enterprise Edition - Version 12.1.0.1 and later Information in this document applies to any pl

【翻译自mos文章】在Oracle单机数据库中定义database service

来源于: Defining a Database Service with a Stand Alone Database (文档 ID 1260134.1) APPLIES TO: Oracle Database - Enterprise Edition - Version 10.2.0.5 to 11.2.0.3 [Release 10.2 to 11.2] Information in this document applies to any platform. GOAL The DBMS_

【翻译自mos文章】访问Oracle Database的知名的ODBC 驱动列表

来源于: List of Well Known ODBC Drivers For Accessing An Oracle Database (文档 ID 1932774.1) 适用于: Oracle ODBC Driver - Version 8.0.5.1 and later Information in this document applies to any platform. 目标: 本文列示了Oracle Database的知名的ODBC 驱动列表,这些驱动可以连接到Oracle Da

【翻译自mos文章】在Oracle Linux 7上安装11.2.0.4时遇到缺少 pdksh-5.2.14 包

在Oracle Linux 7上安装11.2.0.4时遇到缺少 pdksh-5.2.14 包 来源于: Missing pdksh-5.2.14 package during Oracle database 11.2.0.4 install on Oracle Linux 7 (文档 ID 1962046.1) 适用于: Oracle Database - Enterprise Edition - Version 11.2.0.4 to 11.2.0.4 [Release 11.2] Linux