ORA-00600 [kjctr_pbmsg:badbmsg2]

近日遇到错误ORA-00600 [kjctr_pbmsg:badbmsg2],并且导致RAC节点实例重启

ORA-00600: internal error code, arguments: [kjctr_pbmsg:badbmsg2], [0x9FFFFFFFFC996B58], [0x9FFFFFFFFC9976B8], [], [], [], [], [], [], [], [], []
LMS1 (ospid: 12379): terminating the instance due to error 484

1. 查看日志如下
alert log

Mon Aug 11 23:53:10 2014
Errors in file /oracle/app/oracle/diag/rdbms/cdrdb/orcl/trace/orcl_lms1_12379.trc (incident=1104178):
ORA-00600: internal error code, arguments: [kjctr_pbmsg:badbmsg2], [0x9FFFFFFFFC996B58], [0x9FFFFFFFFC9976B8], [], [], [], [], [], [], [], [], [] 
Incident details in: /oracle/app/oracle/diag/rdbms/cdrdb/orcl/incident/incdir_1104178/orcl_lms1_12379_i1104178.trc
Mon Aug 11 23:53:12 2014
Dumping diagnostic data in directory=[cdmp_20140811235312], requested by (instance=1, osid=12379 (LMS1)), summary=[incident=1104178].
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Mon Aug 11 23:53:13 2014
Sweep [inc][1104178]: completed
Sweep [inc2][1104178]: completed
Errors in file /oracle/app/oracle/diag/rdbms/cdrdb/orcl/trace/orcl_lms1_12379.trc:
ORA-00600: internal error code, arguments: [kjctr_pbmsg:badbmsg2], [0x9FFFFFFFFC996B58], [0x9FFFFFFFFC9976B8], [], [], [], [], [], [], [], [], []
LMS1 (ospid: 12379): terminating the instance due to error 484
Mon Aug 11 23:53:22 2014
ORA-1092 : opitsk aborting process

orcl_lms1_12379_i1104178.trc

Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /oracle/app/oracle/product/11.2.0/dbhome_1
System name: HP-UX
Node name: h7sd05da
Release: B.11.31
Version: U
Machine: ia64
Instance name: orcl
Redo thread mounted by this instance: 1
Oracle process number: 14
Unix process pid: 12379, image: oracleh7sd05da (LMS1)
Dump continued from file: /oracle/app/oracle/diag/rdbms/cdrdb/orcl/trace/orcl_lms1_12379.trc
ORA-00600: internal error code, arguments: [kjctr_pbmsg:badbmsg2], [0x9FFFFFFFFC996B58], [0x9FFFFFFFFC9976B8], [], [], [], [], [], [], [], [], []
========= Dump for incident 1104178 (ORA 600 [kjctr_pbmsg:badbmsg2]) ========
*** 2014-08-11 23:53:10.339
dbkedDefDump(): Starting incident default dumps (flags=0x2, level=3, mask=0x0)
----- SQL Statement (None) -----
Current SQL information unavailable - no cursor.
----- Call Stack Trace -----
    skdstdst <- ksedst <- dbkedDefDump <- ksedmp <- ksfdmp
       <- $cold_dbgexPhaseII <- dbgexProcessError <- dbgeExecuteForError <- dbgePostErrorKGE <- 2352
        <- dbkePostKGE_kgsf <- 128 <- kgeadse <- kgerinv_internal <- kgerinv
         <- kgeasnmierr <- kjctr_pbmsg <- kjctr_rksxp <- kjctrcv <- kjcsrmg
          <- kjmsm <- ksbrdp <- opirip <- opidrv <- sou2o
           <- opimai_real <- ssthrdmain <- main <- main_opd_entry
--------------------- Binary Stack Dump ---------------------

2. 检查patch信息,当前版本是11.2.0.2.1

$ opatch lsinventory 
Installed Top-level Products (1): 
Oracle Database 11g 11.2.0.2.0 
Patch 10248523 : applied on Fri Mar 25 09:33:02 GMT+08:00 2011

3. 根据这个错误搜索相关的文档和BUG,列出下面的相关bug和描述

Bug 18015296 : ORA-600 [KJCTR_PBMSG:BADBMSG2] in 11.2.0.3
 The assert is trigerred because the batch message is invalid/corrupt.  This looks like some form of underlying infrastructure/network issue, Please work with customer to have this checked and tested.
Bug 18771858 : LMS0 TERMINATING THE INSTANCE DUE TO ERROR 484 (ORA-00600 [KJCTR_PBMSG:BADBMSG2] in 11.2.0.3
 From the past bug 16240464 & bug 18015296 , both were closed by dev as not a product defect.
 It was suggested that problem was outside Oracle stack at network level. So please check with CT on same lines to identify network problems (if any) with help from there OS/Net support. Refer Doc ID 563566.1 Troubleshooting gc block lost and Poor Network Performance in a RAC Environment
Bug 16240464 : INSTANCE CRASH WITH ORA-00600 [KJCTR_PBMSG:BADBMSG2] in 11.2.0.3
 This looks like some form of underlying infrastructure/network issue, please work with customer to have this checked and tested.
Bug 17452853 : LNX64-12.1-EF,DB INST CRASH WITH LMS4 HIT ORA-600 [KJCTR_PBMSG:BADBMSG2] in 12.1.0.2
Bug 17049773 Diagnostic enhancement to give additional parameter in error ORA-600 [ kjctr_pbmsg:badbmsg2] in 12.1.0.1
Note: This fix will not address the root cause of the error but the additional information may help with diagnosis of the cause.
Bug 13917456 : LNX64-12.1-UD: ASM LMD HIT ORA-00600 KJCTR_PBMSG:BADBMSG2 IN NON-UPGRADED NODES in 12.1.0.0.2
It may occurred in upgrading stage from 11.2.0.3 to 12.1 . Not related with this SR.

4. 至此,我需要检查问题发生时的AWR,oswatcher和全部的LMS, LMD, LMON,LMHB and DIAG日志,看是否有跟多的信息记录。
同时也通过cluvfy和ORAchk来检查RAC的整体环境。

--. AWR report 22:00~23:00 on Aug 11 from both nodes.
--. Deploy the oswatcher, then collect the current OS information, when the database workload is high.
--. All the LMS, LMD, LMON,LMHB and DIAG from both nodes.
--. CVU output:
      cluvfy stage -pre crsinst -n <node1,node2> -verbose 
--. Please run oraCheck as root.
ORAchk - Health Checks for the Oracle Stack (Doc ID 1268927.2)

5. 在检查AWR的时候,发现有"gc blocks lost",这个错误理论上,如果私网正常的话,是不会出现的,它的出现,基本就可以说明,私网是不稳定的

awrrpt_2_29557_29558.html

Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 29557 11-Aug-14 22:00:45 563 1.3
End Snap: 29558 11-Aug-14 23:01:00 551 1.3
Elapsed: 60.24 (mins)
DB Time: 4,835.90 (mins)
Top 5 Timed Foreground Events
Event Waits Time(s) Avg wait (ms) % DB time Wait Class
db file sequential read 6,269,185 185,621 30 63.97 User I/O
DB CPU 42,433 14.62
gc current grant 2-way 3,251,636 25,671 8 8.85 Cluster
db file scattered read 550,524 9,873 18 3.40 User I/O
gc cr multi block request 637,442 6,790 11 2.34 Cluster
Instance Activity Stats
Statistic Total per Second per Trans
gc blocks lost 269 0.07 0.01 <<<<<<<<<<<<

awrrpt_1_29557_29558.html

Snap Id Snap Time Sessions Cursors/Session
Begin Snap: 29557 11-Aug-14 22:00:44 2470 1.0
End Snap: 29558 11-Aug-14 23:00:59 2500 1.0
Elapsed: 60.25 (mins)
DB Time: 4,549.47 (mins)
Top 5 Timed Foreground Events
Event Waits Time(s) Avg wait (ms) % DB time Wait Class
db file sequential read 8,180,795 154,504 19 56.60 User I/O
DB CPU 44,994 16.48
gc current grant 2-way 3,699,003 29,357 8 10.75 Cluster
db file scattered read 677,065 10,190 15 3.73 User I/O
gc cr multi block request 718,327 7,856 11 2.88 Cluster
Statistic Total per Second per Trans
gc blocks lost 410 0.11 0.01 <<<<<<<<<<<<

6. 对于这个错误,更加证明私网的问题可能性,最终结论如下

The Bugs 16240464 and 18015296 are raised for the similar issue and both the bugs are closed as "Vendor OS Problem".
The bug confirmed that this issue is cause because of logical block corruption during network transfer over the interconnect or Infrastructure issue.

The ORA-00600 [kjctr_pbmsg:badbmsg2] error is purely a result of unstable network.
From the AWR reports it is confirmed that we were seeing block lost during the problematic time frame. This is one of the evidence that network is either saturated or causing packets to be corrupted.

By the way, Checked the AWR report. Found "gc blocks lost".
Please involve the OS team and Network team to identify the root cause of the issue. The below note will helpful for the network issue.
Troubleshooting gc block lost and Poor Network Performance in a RAC Environment (Doc ID 563566.1)

7.  这个问题的处理其实还缺少更有力的证据,就是oswatcher日志,如果有问题出现时的oswatcher日  志,会让私网问题暴露的更清晰,毕竟整个问题分析过程中遇到的"gc blocks lost"和ORA-00600 [kjctr_pbmsg:badbmsg2]错误,都是oracle database角度报出的,并不能让OS的工程师信服,如果oswatcher日志记录当时的TCP和UDP丢包的话,会问题更清晰,责任更明确。

oswatcher的安装使用,请参考文档: OSWatcher (Doc ID 301137.1)

时间: 2024-08-14 17:43:37

ORA-00600 [kjctr_pbmsg:badbmsg2]的相关文章

ora 00600 [kcratr_nab_less_than_odr] [4194]错误

业务场景:公司电缆被挖断,突然断电导致的宕机. [[email protected] ~]$ sqlplus / as sysdba SQL*Plus: Release 11.2.0.1.0 Production on Mon Apr 17 11:09:59 2017 Copyright (c) 1982, 2009, Oracle. All rights reserved. Connected to: Oracle Database 11g Enterprise Edition Release

Oracle 通过ADR工具 收集ORA-600错误信息

 问题描述: 2014-06-10 在点检数据库预警文件时,出现Ora -00600 错误,并且Rman L1 备份失败,查询相关资料,得知是Bug:9835218.于是,提SR寻求Oracle 官方技术支持. Oracle回复如下: Your Service Request has been submitted as anORA-600/ORA-7445 issue based on the problem type you chose when logging the SR. Additio

[Oracle]如何获得出现故障时,客户端的详细连接信息

客户坚持说 只是在 每天早上5点才运行下面的语句: select / * + FULL (TAB001_TT01) * / 'TAB001_TT01', count (*) from u01.TAB001_TT01 group by 'TAB001_TT01' 但是根据 Incident 文件的记载,发生时间是在 2017-09-26 10: 44: 50.166 , 客户怀疑 Oracle的数据库出现了其他的问题. 这样调查就跑偏方向了. (因为总所周知的原因,修改了敏感信息) 从下面这句“M

[Oracle]ORA-600[kdBlkCheckError]LOB坏块处理

客户环境报如下错误: ORA - 00600: Internal error code, arguments: [kdBlkCheckError], [2], [10388], [63068], [], [], [], [], [], [], [], [] alert log 中有这样的信息: Thu Sep 07 19:14:00 2017Corrupt Block Found        CONT = 0, TSN = 1, TSNAME = SYSAUX        RFN = 2,

讨厌麻烦的ora 01722无效数字

webservice开发过程中,数据库由原来的oracle改为现在的sql server.然后重新调试,结果报出ora 01722无效数字的错误. 由于连接oracle数据库的时候并没有问题,所以一开始我以为是数据库不同,导致部分数据类型差异,(但又觉得有点离谱,切换数据库,不至于会导致这种错误吧) 经过排查,总结得出如下: 1.对于两个类型不匹配(一个数字类型,一个非数字类型,同下)的值进行赋值操作;2.两个类型不匹配的值进行比较操作(例如,"=");3.to_number函数中的值

ORACLE RAC 下非缺省端口监听配置(listener.ora tnsnames.ora)

不论是单实例还是RAC,对于非缺省端口下(1521)的监听器,pmon进程不会将service/instance注册到监听器,即不会实现动态注册.与单实例相同,RAC非缺省端口的监听器也是通过设置参数local_listener来达到目的.除此之外,还可以对实例进行远程注册,以达到负载均衡的目的.这是通过一个参数remote_listener来实现. 有关Oracle 网络配置相关基础以及概念性的问题请参考:      配置ORACLE 客户端连接到数据库   配置非默认端口的动态服务注册   

oerr ora 000845解决方法是扩大/dev/shm空间

打开虚拟机发现实例起不来 [[email protected] ~]# su - oraclesq[[email protected] ~]$ sqlplus / as sysdba SQL*Plus: Release 11.2.0.4.0 Production on Tue Aug 2 14:59:54 2016 Copyright (c) 1982, 2013, Oracle.  All rights reserved. Connected to an idle instance. [ema

tnsnames.ora文件说明

目录位置 unix:$ORACLE_HOME/network/admin WINDOW:%ORACLE_HOME%\network\admin 设置相应的环境变量:TNS_ADMIN tnsname.ora文件内容例子 --负载均衡,故障转移 sample2= (DESCRIPTION= (LOAD_BALANCE=on) (FAILOVER=on) (ADDRESS_LIST= (SOURCE_ROUTE=yes) (ADDRESS=(PROTOCOL=tcp)(HOST=host1)(POR

在TNSNAMES.ORA文件中配置本机装的oracle

首先,感谢这两位网友:http://zhidao.baidu.com/link?url=eGYeoEa-EhQdVitSGqjE36uNfVmEsryXH1WUjPue6YvArDSx-Y1N9_rd9Hx6vh-NklyevkcCtAMh1X28fI1Hoq 引子: 我在Oracle SQL Developer工具中创建了一个名为"oa"的连接,然后登陆PLSQL Developer,从本地导入一张表"T_DEPT",打开Oracle SQL Developer,