What is Split Brain in Oracle Clusterware and Real Application Cluster (文档 ID 1425586.1)

In this Document

  Purpose
  Scope
  Details
  1. Clusterware layer
  2. Real Application Cluster (database) layer
  Known Issues
  References

APPLIES TO:

Oracle Database - Enterprise Edition - Version 10.1.0.2 and later
Information in this document applies to any platform.

PURPOSE

This note is to explain what is split brain in an Oracle Real Application cluster and what errors/consequences are associated with it.

SCOPE

For DBA and Support engineer.

DETAILS

In generic term, split-brain indicates data inconsistencies originating from the maintenance of two separate data sets with overlap in scope, either because of servers in a network design, or a failure condition based on servers not communicating and unifying their data to each other.

There are two components in Oracle Real Application Cluster implementation could experience split brain.

1. Clusterware layer

Cluster nodes maintain their heartbeat via private network and voting disk. When there is a private network disruption, cluster nodes can not communicate to each other via private network for the time period of misscount setting, split brain will happen. In such case, voting disk will be used to determine which node(s) survive and which node(s) will be evicted. The common voting result will be:

a. The group with more cluster nodes survive
b. The group with lower node member in case of same number of node(s) available in each group
c. Some improvement has been made to ensure node(s) with lower load survive in case the eviction is caused by high system load.

Commonly, one will see messages similar to the followings in ocssd.log when split brain happens:

[ CSSD]2011-01-12 23:23:08.090 [1262557536] >TRACE: clssnmCheckDskInfo: Checking disk info...
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssnmCheckDskInfo: Aborting local node to avoid splitbrain.
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: : my node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(2)
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR:
###################################
[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssscExit: CSSD aborting
###################################

Above messages indicate the communication from node 2 to node 1 is not working, hence node 2 only sees 1 node, but node 1 is working fine and it can see two nodes in the cluster. To avoid splitbrain, node 2 aborted itself.

Solution: Please engage network administrator to check private network layer to eliminate any network fault.

2. Real Application Cluster (database) layer

To ensure data consistency, each instance of a RAC database needs to keep heartbeat with the other instances. The heartbeat is maintained by background processes like LMON, LMD, LMS and LCK. Any of these processes experience IPC Send time out will incur communication reconfiguration and instance eviction to avoid split brain. Controlfile is used similarly to voting disk in clusterware layer to determine which instance(s) survive and which instance(s) evict. The voting result is similar to clusterware voting result. As the result, 1 or more instance(s) will be evicted.

Common messages in instance alert log are similar to:

alert log of instance 1:
---------
Mon Dec 07 19:43:05 2011
IPC Send timeout detected.Sender: ospid 26318
Receiver: inst 2 binc 554466600 ospid 29940
IPC Send timeout to 2.0 inc 8 for msg type 65521 from opid 20
Mon Dec 07 19:43:07 2011
Communications reconfiguration: instance_number 2
Mon Dec 07 19:43:07 2011
Trace dumping is performing id=[cdmp_20091207194307]
Waiting for clusterware split-brain resolution
Mon Dec 07 19:53:07 2011
Evicting instance 2 from cluster
Waiting for instances to leave: 

...

alert log of instance 2:
---------
Mon Dec 07 19:42:18 2011
IPC Send timeout detected. Receiver ospid 29940
Mon Dec 07 19:42:18 2011
Errors in file 
/u01/app/oracle/diag/rdbms/bd/BD2/trace/BD2_lmd0_29940.trc:
Trace dumping is performing id=[cdmp_20091207194307]
Mon Dec 07 19:42:20 2011
Waiting for clusterware split-brain resolution
Mon Dec 07 19:44:45 2011
ERROR: LMS0 (ospid: 29942) detects an idle connection to instance 1
Mon Dec 07 19:44:51 2011
ERROR: LMD0 (ospid: 29940) detects an idle connection to instance 1
Mon Dec 07 19:45:38 2011
ERROR: LMS1 (ospid: 29954) detects an idle connection to instance 1
Mon Dec 07 19:52:27 2011
Errors in file 
/u01/app/oracle/diag/rdbms/bd/BD2/trace/PVBD2_lmon_29938.trc  
(incident=90153):
ORA-29740: evicted by member 0, group incarnation 10
Incident details in: 
/u01/app/oracle/diag/rdbms/bd/BD2/incident/incdir_90153/BD2_lmon_29938_i90153.trc

In above example, instance 2 LMD0 (pid 29940) is the receiver in IPC Send timeout. There could be various reasons causing IPC Send timeout. For example:

a. Network problem
b. Process hang
c. Bug etc

Please see Top 5 issues for Instance Eviction Document 1374110.1 for more information.

In case of instance eviction, alert log and all background traces need to be checked to determine the root cause.

Known Issues

1. Bug 7653579 - IPC send timeout in RAC after only short period Document 7653579.8
    Refer: ORA-29740 Instance (ASM/DB) eviction on Solaris SPARC Document 761717.1
    Fixed in: 11.2.0.1, 11.1.0.7.2 PSU and 11.1.0.7 Patch 22 on Windows

2. Unpublished Bug 8267580: Wrong Instance Evicted Under High CPU Load
    Refer: Wrong Instance Evicted Under High CPU Load in 11.1.0.7 Document 1373749.1
    Fixed in: 11.2.0.1

3. Bug 8365141 - DRM quiesce step hang causes instance eviction Document 8365141.8
    Fixed in: 10.2.0.5, 11.1.0.7.3, 11.1.0.7 patch 25 for Windows and 11.2.0.1

4. Bug 7587008 - Hung RAC instance not evicted from cluster Document  7587008.8
    Fixed in: 10.2.0.4.4, 10.2.0.5 and 11.2.0.1, one-off patch available for various 11.1.0.7 release

5. Bug 11890804 - LMHB crashes instance with ORA-29770 after long "control file sequential read" waits Document 11890804.8
    Fixed in 11.2.0.2.5, 11.2.0.3 and 11.2.0.2 Patch 10 on Windows

6. BUG:13732226 - NODE GETS EVICTED WITH REASON CODE 0X2
    BUG:13399435 - KJFCDRMRCFG WAITED 249 SECS FOR LMD TO RECEIVE ALL FTDONES, REQUESTING KILL
    BUG:13503204 - INSTANCE EVICTION DUE TO REASON 0X200000
    Refer: 11gR2: LMON received an instance eviction notification from instance n Document 1440892.1
    Fixed in: 11.2.0.4 and some merge patch available for 11.2.0.2 and 11.2.0.3

时间: 2024-11-06 07:38:01

What is Split Brain in Oracle Clusterware and Real Application Cluster (文档 ID 1425586.1)的相关文章

Oracle Grid Infrastructure: Understanding Split-Brain Node Eviction (文档 ID 1546004.1)

In this Document   Purpose   Scope   Details   What does "split brain" mean?   Why is this a problem?   How does the clusterware resolve a "split brain" situation?   Identifying a split-brain eviction   Finding the cohort   Understandi

Procwatcher: Script to Monitor and Examine Oracle DB and Clusterware Processes (文档 ID 459694.1)

Applies to: Oracle Database - Enterprise Edition - Version 10.2.0.2 to 12.1.0.1 [Release 10.2 to 12.1] Linux x86 HP-UX PA-RISC (64-bit) IBM AIX on POWER Systems (64-bit) Oracle Solaris on SPARC (64-bit) HP-UX Itanium Linux x86-64 Oracle Server Enterp

How to Analyze Problems Related to Internal Errors (ORA-600) and Core Dumps (ORA-7445) using My Oracle Support (文档 ID 260459.1)

Oracle Database - Enterprise Edition - Version 8.1.7.4 and later Information in this document applies to any platform. **Checked for relevance 06-Apr-2010 **Checked for relevance 17-Apr-2013 *** Checked for relevance on 16-Nov-2011 *** Purpose 1.1 Ab

How to Modify Public Network Information including VIP in Oracle Clusterware (文档 ID 276434.1)

APPLIES TO: Oracle Database - Enterprise Edition - Version 11.2.0.3 to 12.1.0.2 [Release 11.2 to 12.1]Information in this document applies to any platform. PURPOSE The purpose of this note is to illustrate how to change a public hostname, public IP,

Oracle Multitenant Option - 12c Frequently Asked Questions (文档 ID 1511619.1)译文

适用于: 企业版数据库--版本12.1.0.1(12.1) 本文档中的知识对所有平台均适用. 文档目的 文档描写了插接式数据库的许多方面和用法,以更好的理解该产品,同时,该文档也可做为一个快速参考手册. 问答 12c多租户架构中的CDB/PDB概念知识. 多租户架构中的可插接数据库(PDB)是什么意思? 可插接数据库(PDB)是Oracle数据库12c(12.1)中的新特性.可以在一个数据库内部拥有多个可插接数据库.可插接数据库是完全向后兼容的. 为什么要使用多租户选件? 是为了实现以下数据库整

在Oracle电子商务套件版本12.2中创建自定义应用程序(文档ID 1577707.1)

在本文档中 本笔记介绍了在Oracle电子商务套件版本12.2中创建自定义应用程序所需的基本步骤.如果您要创建新表单,报告等,则需要自定义应用程序.它们允许您将自定义编写的文件与Oracle电子商务套件提供的标准种子功能分离.在向您的环境应用修补程序或执行升级时可以保留自定义设置. 自定义数据和索引表空间默认为APPS_TS_TX_DATA和APPS_TS_TX_IDX. 注意:当没有活动的修补程序周期时,应在运行文件系统上执行本文档中描述的过程. 也可以按照此过程更正先前创建的不使用AD Sp

Mysql、Oracle、SQLServer等数据库参考文档免费分享下载

场景 MySQL 是最流行的关系型数据库管理系统,在 WEB 应用方面 MySQL 是最好的 RDBMS(Relational Database Management System:关系数据库管理系统)应用软件之一. SQL Server是由Microsoft开发和推广的关系数据库管理系统(DBMS),它最初是由Microsoft.Sybase和Ashton-Tate三家公司共同开发的,并于1988年推出了第一个OS/2版本. Oracle Database,又名Oracle RDBMS,或简称

Oracle Recommended Patches -- "Oracle JavaVM Component Database PSU" (OJVM PSU) Patches (文档 ID 1929745.1)

From: https://support.oracle.com What is "Oracle JavaVM Component Database PSU" ? Oracle JavaVM Component Database PSU is released as part of the Critical Patch Update program from October 2014 onwards.It consists of two separate patches: One fo

Deploying JRE (Native Plug-in) for Windows Clients in Oracle E-Business Suite Release 12 (文档 ID 393931.1)

In This Document Section 1: Overview Section 2: Pre-Upgrade Steps Section 3: Upgrade and Configuration Section 4: Post-installation Steps Section 5: Known Issues Section 6: Appendices This document covers the procedure to upgrade the version of the J