数据库节点二VIP异常 故障分析

系统环境    硬件平台 &  操作

IBM 570

操作系统版本  AIX 5.3

物理内存  32G

Oracle 产品及版本  10.2.0.5 RAC

业务类型  OLTP

背景概述

交易系统在xx月xx 日,节点二VIP异常下线导致节点二数据库服务失

效。接到请求后,第一时间进行连线处理。故障发生在凌晨 3点,而且

AIX(errpt)、Oracle DB(alert.log )、CRS (crsd.log 、ocssd.log、vip.log、

coredump )等均没有留下太多有效信息,情况非常复杂。

问题详细诊断及分析

一、检查errpt日志:

节点二VIP异常下线时,节点二上无报错产生。

二、检查CRS日志:

节点一:

2012-04-29 03:42:33.180: [  CRSRES][11376]32startRunnable: setting CLIvalues

—  说明:节点一上仅有一行信息反映出故障期间节点一上CRS曾经执行过命令。

节点二:

2012-04-29 03:41:12.308: [  CRSAPP][11263]32CheckResource error forora.xxxxdb02.vip error code =

1

2012-04-29 03:41:12.335: [  CRSRES][11263]32In stateChanged,ora.xxxxdb02.vip target is ONLINE

2012-04-29 03:41:12.335: [  CRSRES][11263]32ora.xxxxdb02.vip on xxxxdb02went OFFLINE unexpectedly

2012-04-29 03:41:12.335: [  CRSRES][11263]32StopResource: setting CLIvalues

2012-04-29 03:41:12.340: [  CRSRES][11263]32Attempting to stop`ora.xxxxdb02.vip` on member

`xxxxdb02`

2012-04-29 03:41:12.893: [  CRSRES][11269]32In stateChanged,ora.xxxxdb.xxxxdb1.xxxxdb2.srv target

is ONLINE

2012-04-29 03:41:12.894: [  CRSRES][11269]32ora.xxxxdb.xxxxdb1.xxxxdb2.srvon xxxxdb02 went OFFLINE

unexpectedly

2012-04-29 03:41:12.894: [  CRSRES][11269]32StopResource: setting CLIvalues

2012-04-29 03:41:12.899: [  CRSRES][11269]32Attempting to stop`ora.xxxxdb.xxxxdb1.xxxxdb2.srv` on

member `xxxxdb02`

2012-04-29 03:41:12.958: [  CRSRES][11263]32Stop of `ora.xxxxdb02.vip` onmember `xxxxdb02`

succeeded.

2012-04-29 03:41:12.971: [  CRSRES][11263]32ora.xxxxdb02.vipRESTART_COUNT=0 RESTART_ATTEMPTS=0

2012-04-29 03:41:12.976: [  CRSRES][11263]32ora.xxxxdb02.vip failed onxxxxdb02 relocating.

2012-04-29 03:41:13.025: [  CRSRES][11263]32StopResource: setting CLIvalues

2012-04-29 03:41:13.029: [  CRSRES][11263]32Attempting to stop

`ora.xxxxdb02.LISTENER_XXXXDB02.lsnr` onmember `xxxxdb02`

2012-04-29 03:41:13.146: [  CRSRES][11269]32Stop of`ora.xxxxdb.xxxxdb1.xxxxdb2.srv` on member

`xxxxdb02` succeeded.

2012-04-29 03:41:13.146: [ CRSRES][11269]32ora.xxxxdb.xxxxdb1.xxxxdb2.srv RESTART_COUNT=0

RESTART_ATTEMPTS=1

2012-04-29 03:41:13.159: [  CRSRES][11269]32Restartingora.xxxxdb.xxxxdb1.xxxxdb2.srv on xxxxdb02

2012-04-29 03:41:13.164: [  CRSRES][11269]32startRunnable: setting CLIvalues

2012-04-29 03:41:13.164: [  CRSRES][11269]32Attempting to start  `ora.xxxxdb.xxxxdb1.xxxxdb2.srv` on

member `xxxxdb02`

2012-04-29 03:41:45.618: [  CRSAPP][11269]32StartResource error forora.xxxxdb.xxxxdb1.xxxxdb2.srv

error code = 1

2012-04-29 03:41:45.799: [  CRSRES][11269]32Start of`ora.xxxxdb.xxxxdb1.xxxxdb2.srv` on member

`xxxxdb02` failed.

2012-04-29 03:41:45.820: [ CRSRES][11269]32ora.xxxxdb.xxxxdb1.xxxxdb2.srv failed on xxxxdb02

relocating.

2012-04-29 03:41:45.885: [  CRSRES][11269]32Cannot relocate ora.xxxxdb.xxxxdb1.xxxxdb2.srvStopping

dependents

2012-04-29 03:41:45.897: [  CRSRES][11269]32StopResource: setting CLIvalues

2012-04-29 03:42:29.483: [  CRSRES][11263]32Stop of `ora.xxxxdb02.LISTENER_XXXXDB02.lsnr` on member

`xxxxdb02` succeeded.

2012-04-29 03:42:29.496: [  CRSRES][11263]32Attempting to start`ora.xxxxdb02.vip` on member

`xxdb01np5`

2012-04-29 03:42:32.036: [  CRSRES][11263]32Start of `ora.xxxxdb02.vip`on member `xxdb01np5`

succeeded.

—  说明:节点二是故障节点,上面信息反映出,由于VIP检测异常导致节点二上VIP被强

制OFFLINE,同时CRS将基于VIP 资源的监听和数据库服务也进行了OFFLINE操作。

三、分析CRS状态:

由于故障期间客户很好的保留了现场环境,因此我们得以现场分析到CRS

状态。

Name           Type           Target    State    Host

------------------------------------------------------------

ora.xxxxdb.db  application   ONLINE    ONLINE    xxdb01np5

ora....sdb1.cs application    ONLINE   OFFLINE

ora....db1.srv application    ONLINE   ONLINE    xxdb01np5

ora....db2.srv application    ONLINE   ONLINE    xxxxdb02

ora....b1.inst application    ONLINE   ONLINE    xxdb01np5

ora....b2.inst application    ONLINE   ONLINE    xxxxdb02

ora....P5.lsnr application    ONLINE   ONLINE    xxdb01np5

ora....np5.gsd application    ONLINE   ONLINE    xxdb01np5

ora....np5.ons application    ONLINE   ONLINE    xxdb01np5

ora....np5.vip application    ONLINE   ONLINE    xxdb01np5

ora....P5.lsnr application    ONLINE   OFFLINE

ora....np5.gsd application    ONLINE   ONLINE    xxxxdb02

ora....np5.ons application    ONLINE   ONLINE    xxxxdb02

ora....np5.vip application    ONLINE   ONLINE    xxdb01np5

—  说明:故障期间,节点二上的数据库(实例)资源仍然在运行,Nodeapp资源正常,

唯一被OFFLINE的是一个Service和listener 资源。但是我们注意到节点二上VIP 资源

已经被节点一接管过去。这说明,节点二的VIP(xxx.xxx.xxx.4 )在排查问题的时候是

没问题的。

四、分析数据库告警日志

Sun Apr 29 03:41:13 BEIST 2012

ALTER SYSTEM SET service_names=‘xxxxdb‘SCOPE=MEMORY SID=‘xxxxdb2‘;

Sun Apr 29 03:41:55 BEIST 2012

ALTER SYSTEM SET service_names=‘‘SCOPE=MEMORY SID=‘xxxxdb2‘;

Sun Apr 29 03:41:55 BEIST 2012

Immediate Kill Session#: 1659, Serial#: 2

Immediate Kill Session: sess:70000038f773bb8  OS pid: 1180972

Sun Apr 29 03:41:55 BEIST 2012

Process OS id : 1180972 alive after kill

Errors in file

Immediate Kill Session#: 1660, Serial#: 2

Immediate Kill Session: sess:7000003847623c0  OS pid: 1179690

Sun Apr 29 03:41:56 BEIST 2012

Process OS id : 1179690 alive after kill

Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

Immediate Kill Session#: 1661, Serial#: 2

Immediate Kill Session: sess:7000003837620c0  OS pid: 1300264

Sun Apr 29 03:41:56 BEIST 2012

Process OS id : 1300264 alive after kill

Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

Immediate Kill Session#: 1662, Serial#: 2

Immediate Kill Session: sess:70000038275bc80  OS pid: 1155624

Sun Apr 29 03:41:57 BEIST 2012

Process OS id : 1155624 alive after kill

Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

Immediate Kill Session#: 1663, Serial#: 2

Immediate Kill Session: sess:700000387762ed0  OS pid: 1000484

Sun Apr 29 03:41:57 BEIST 2012

Process OS id : 1000484 alive after kill

Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

Immediate Kill Session#: 1664, Serial#: 2

Immediate Kill Session: sess:7000003867923d8  OS pid: 1175586

Sun Apr 29 03:41:58 BEIST 2012

Process OS id : 1175586 alive after kill

Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

Immediate Kill Session#: 1665, Serial#: 2

Immediate Kill Session: sess:7000003857a9d30  OS pid: 1296160

Sun Apr 29 03:41:58 BEIST 2012

Process OS id : 1296160 alive after kill

Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

Immediate Kill Session#: 1666, Serial#: 3

Immediate Kill Session: sess:70000038f775130  OS pid: 1176862

Sun Apr 29 03:41:59 BEIST 2012

Process OS id : 1176862 alive after kill

Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

Immediate Kill Session#: 1667, Serial#: 2

Immediate Kill Session: sess:700000384763938  OS pid: 1151516

Sun Apr 29 03:42:00 BEIST 2012

Process OS id : 1151516 alive after kill

Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

Immediate Kill Session#: 1668, Serial#: 2

Immediate Kill Session: sess:700000383763638  OS pid: 693786

Sun Apr 29 03:42:01 BEIST 2012

Process OS id : 693786 alive after kill

Errors in file/oracle/product/10.2.0/admin/xxxxdb/udump/xxxxdb2_ora_110596.trc

Immediate Kill Session#: 1669, Serial#: 2

Immediate Kill Session: sess:70000038275d1f8  OS pid: 1292056

6

—  说明:故障期间,节点二上数据库的Service被OFFLINE,这一点正常。但随后CRS

将连接到节点二上的所有客户端连接都强行Kill掉了。该问题和Oracle Bug 6955040有

关。

五、分析RACGVIP

由于问题原因定位到 VIP资源异常上,而VIP切换到节点一上是正常的,

因此极有可能是由于VIP检查的时候出现异常导致CRS 将VIP OFFLINE。由于

Oracle CRS在进行VIP检测的时候是通过RACGVIP这个 SHELL脚本实现功能

的,因此我们分析和测试了RACGVIP脚本。

我们注意到其中一段脚本,如下:

#Check the status of the interface thro‘ pinging gateway

if[ -n "$DEFAULTGW" ]

then

_RET=1

#get base IP address of the interface

tmpIP=`$LSATTR -El ${_IF} -a netaddr | $AWK ‘{print $2}‘`

#get RX packets numbers

_O1=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print \\$5;exit}}"`

x=$CHECK_TIMES

while [ $x -gt 0 ]

do

if [ -n "$tmpIP" ]

then

$PING -S $tmpIP $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1

else

$PING $PING_TIMEOUT $DEFAULTGW > /dev/null 2>&1

fi

_O2=`$NETSTAT -n -I $_IF | $AWK "{ if (/^$_IF/) {print \\$5;exit}}"`

if [ "$_O1" != "$_O2" ]

then

# RX packets numbers changed

_RET=0

break

fi

$SLEEP 1

x=`$EXPR $x - 1`

done

if [ $_RET -ne 0 ]

then

logx "checkIf: RX packets checked if=$_IF failed"

else

logx "checkIf: RX packets checked if=$_IF OK"

fi

else

logx "checkIf: Default gateway is not defined(host=$HOSTNAME)"

if [ $FAIL_WHEN_DEFAULTGW_NO_FOUND -eq 1 ]

then

_RET=1

else

_RET=0

fi

fi

—  说明:这段脚本说明,CRS在做VIP check的时候机制如下,定期向缺省网关(Default

Gateway)发包,如果发包前后检测到网卡上有包流量(output值前后不一样)则说

明检测网络正常,VIP check通过。发包的操作通过AIX 命令执行,还原所有变量后的

命令如下

—  ping –S xxx.xxx.xxx.2 –c 1–w 1 xxx.xxx.xxx.254

—  参数含义如下:

-S hostname/IP addr

Uses the IP address as the source addressin outgoing ping packets.

-c Count

Specifies the number of echo requests, asindicated by the Count

variable, to be sent (and received).

-w timeout

This option works only with the -c option.It causes ping to wait

for a maximum of ‘timeout‘ seconds for areply (after sending the

last packet).

—  PING_TIMEOUT参数控制了-c 和-w的值,RACGVIP 脚本中设置如下:

# timeout of ping in number of loops

PING_TIMEOUT=" -c 1 -w 1"

—  我们检查了10.2.0.3版本下的设置,发现在10.2.0.3版本下,PING_TIMEOUT参数设置

为如下值:

# timeout of ping in number of loops

PING_TIMEOUT=" -c 1 -w 3"

—  当前PING_TIMEOUT的设置意味着如果在pingdefault gateway 的1秒内没有成功向

外发送出一个包,那么CRS会经过一次SLEEP后再次尝试ping操作,如果还是没有成

功,CRS即宣布网络异常,VIP OFFLINE。

结论

Oracle 10.2.0.5在RACGVIP脚本上的细微改变说明,Oracle认为10.2.0.5版本

下对网络性能及稳定性更高了。但是将PING_TIMEOUT 由1秒更改为3秒,会大

大增加CRS 的敏感度,使得细微的网络异常即会造成CRS 中的 VIP检测失败。

根据上述分析,我们认为节点二故障原因如下:

1、 升级到10.2.0.5后,CRS 对网络延时更加敏感

2、 在4月29 日凌晨3点故障期间,网络出现了及其短暂的延时导致VIP检测

失败,CRS 将VIP以及依赖于VIP的资源OFFLINE。

3、 CRS 在OFFLINE数据库服务的时候命中Bug6955040造成客户端连接被

全部杀光。

故障重现

我方在生产环境模拟了这一段异常情况:

步骤流程

1  打开CRS VIPdebug

2

节点二上找出VIP(xxx.xxx.xxx.4 )使用的网络设备(en10)

以及对应的ServiceIP(xxx.xxx.xxx.2 )

3  交换机上找出对应的端口

4 Block该端口一小段时间

5  观察CRS 和数据库日志

我方我们观察到其中 CRS 日志和数据库日志信息和故障期间完全一致,这

充分证明了前面的判断。

同时,VIP的debug 信息显示如下内容,我们可以从中清晰的看到,由于

网络问题导致发包失败后,CRS 立即将VIP置于 OFFLINE状态:

2012-05-01 20:36:22.560: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

20:36:18 BEIST 2012 [ 1139710 ] Checkinginterface existance

Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] Calling getifbyip

Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] getifbyip:  started for xxx.xxx.xxx.4

2012-05-01 20:36:22.560: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

20:36:18 BEIST 2012 [ 1139710 ] getifbyip:checking if failover is happening (en10)

Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] getifbyip: failover is not happening

(en10)

Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] getifbyip:  returning IP en10

2012-05-01 20:36:22.560: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

20:36:18 BEIST 2012 [ 1139710 ] Completedgetifbyip en10

Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] Completed with initial interface test

Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] Broadcast = 130.0.255.255

2012-05-01 20:36:22.560: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

20:36:18 BEIST 2012 [ 1139710 ] checkIf:start for if=en10

Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] IsIfAlive: start for if=en10

Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] defaultgw:  started

2012-05-01 20:36:22.560: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

20:36:18 BEIST 2012 [ 1139710 ]defaultgw:  completed withxxx.xxx.xxx.254

Tue May 1 20:36:18 BEIST 2012 [ 1139710 ] About to execute com mand:/usr/sbin/ping

-S xxx.xxx.xxx.2  -c 1 -w 1 xxx.xxx.xxx.254

2012-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

20:36:20 BEIST 2012 [ 1139710 ] About toexecute command: /usr/sbin/ping -S

xxx.xxx.xxx.2  -c 1 -w 1 xxx.xxx.xxx.254

Tue May 1 20:36:22 BEIST 2012 [ 1139710 ] IsIfAlive: RX packets checked if=en10

failed

2012-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

20:36:22 BEIST 2012 [ 1139710 ] Interfaceen10 checked failed (host=xxxxdb02)

Tue May 1 20:36:22 BEIST 2012 [ 1139710 ] IsIfAlive: end for if=en10

Tue May 1 20:36:22 BEIST 2012 [ 1139710 ] checkIf: end for if=en10

2012-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

20:36:22 BEIST 2012 [ 1139710 ] PerformingCRS_STAT testing

Tue May 1 20:36:22 BEIST 2012 [ 1139710 ] Completed CRS_STAT testing

Tue May 1 20:36:22 BEIST 2012 [ 1139710 ] Completed second gateway test

2012-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:Tue May  1

20:36:22 BEIST 2012 [ 1139710 ] Interfacetests

Invalid parameters, or failed to bring upVIP (host=xxxxdb02)

2012-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:

clsrcexecut: envORACLE_CONFIG_HOME=/oracle/product/10.2.0/crs_1

2012-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:

clsrcexecut: cmd =/oracle/product/10.2.0/crs_1/bin/racgeut -e _USR_ORA_DEBUG=5 54

/oracle/product/10.2.0/crs_1/bin/racgvipcheck xxxxdb02

2012-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:

clsrcexecut: rc = 1, time = 4.304s

2012-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:

clsrcaction: restyp=0 act_typ=2 stat=1

2012-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:

clsrcaction:  init 0.000s

2012-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb0 2.vip]:

clsrcaction:  action failed, 4.438s

2012-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:

clsrcaction:  post 0.000s

2012-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:

clsrcaction: all 4.438s

2012-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:end for

resource = ora.xxxxdb02.vip, action =check, status = 1, time = 4.442s

2012-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:clsrccln:

exiting ora.xxxxdb02.vip refcount=1

2012-05-01 20:36:22.561: [    RACG][1] [1291828][1][ora.xxxxdb02.vip]:

clsrcprsrgter:gctx->prsrcfgref_clsrcgctx = 0

2012-05-01 20:36:22.664: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

clsrcprsrgini:gctx->prsrcfgref_clsrcgctx = 0

2012-05-01 20:36:22.664: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

clsrcini_ext: starting ora.xxxxdb02.viprefcount=1 global

2012-05-01 20:36:22.665: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:begin for

resource = ora.xxxxdb02.vip, action = stop

2012-05-01 20:36:22.902: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

clsrccssgetnodename: all 0.238s

2012-05-01 20:36:22.902: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

clsrcnodeapps: calling FAILSRVSA

2012-05-01 20:36:22.907: [    RACG][1] [1115450][1][ora.xxxxdb0 2.vip]:

clsrcrundetach: cmd =/oracle/product/10.2.0/crs_1/bin/racgmain ora.xxxxdb02.vip

rundetach 1 failsrvsa xxxxdb02, rc = 0, time = 0.005s

2012-05-01 20:36:22.908: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

clsrcnodeapps: Posting PNWDOWN_EVENT

2012-05-01 20:36:22.908: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

clsrccssgetnodename: all 0.000s

2012-05-01 20:36:22.908: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

clsrccssgetnodename: all 0.000s

2012-05-01 20:36:22.908: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

clsrcssgetrhost: using cached local hostname

2012-05-01 20:36:22.909: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

clsrccssgetlhost: all 0.001s

2012-05-01 20:36:22.935: [    RACG][1] [889364][1][ora.xxxxdb02.vip]:

clsrcprsrgini:gctx->prsrcfgref_clsrcgctx = 0

2012-05-01 20:36:22.935: [    RACG][1] [889364][1][ora.xxxxdb02.vip]:

clsrcini_ext: starting ora.xxxxdb02.viprefcount=1 global

2012-05-01 20:36:22.935: [    RACG][1] [889364][1][ora.xxxxdb02.vip]:begin for

resource = ora.xxxxdb02.vip, action =rundetach

2012-05-01 20:36:22.964: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = RESTART_COUNT=0

2012-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = STATE=ONLINE on xxxxdb02

2012-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf =

2012-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = NAME=ora.xxxxdb.xxxxdb1.xxxxdb2.srv

2012-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = FAILURE_THRESHOLD=0

2012-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = TARGET=ONLINE

2012-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = STATE=ONLINE on xxxxdb02

2012-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf =

2012-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = RESTART_COUNT=0

2012-05-01 20:36:22.982: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = FAILURE_THRESHOLD=0

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = TARGET=ONLINE

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = STATE=ONLINE on xxxxdb02

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = RESTART_ATTEMPTS=5

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = NAME=ora.xxxxdb02.ons

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = TYPE=application

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = RESTART_ATTEMPTS=3

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = RESTART_COUNT=0

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = FAILURE_THRESHOLD=0

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = TARGET=ONLINE

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = STATE=ONLINE on xxxxdb02

15

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf =

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = NAME=ora.xxxxdb02.vip

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = TYPE=application

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = RESTART_ATTEMPTS=0

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = RESTART_COUNT=0

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = FAILURE_THRESHOLD=0

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = TARGET=ONLINE

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = STATE=ONLINE on xxxxdb02

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf =

2012-05-01 20:36:22.983: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcqryapi:

resname = NULL, host = xxxxdb02, time =0.007s

2012-05-01 20:36:23.075: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

clsrcpostevt: EvmEventPost 1 0.000s

2012-05-01 20:36:23.075: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

clsrcposthaevt: forward to EVM

0.166s

2012-05-01 20:36:23.080: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

clsrcnodeapp: prsr num_env = 0

2012-05-01 20:36:23.080: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

clsrcnodeapp: settingORACLE_CONFIG_HOME=/oracle/product/10.2.0/crs_1

2012-05-01 20:36:23.111: [    RACG][1] [889364][1][ora.xxxxdb02.vip]:

clsrcpostevt: EvmEventPost 1 0.008s

2012-05-01 20:36:23.114: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]: clsrcstatcb:

buf = NAME=ora.xxxxdb02.vip

2012-05-01 20:36:23.114: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = TYPE=application

2012-05-01 20:36:23.114: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = RESTART_ATTEMPTS=0

2012-05-01 20:36:23.114: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = RESTART_COUNT=0

2012-05-01 20:36:23.114: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = FAILURE_THRESHOLD=0

2012-05-01 20:36:23.114: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = TARGET=ONLINE

2012-05-01 20:36:23.114: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf = STATE=ONLINE on xxxxdb02

2012-05-01 20:36:23.115: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcstatcb:

buf =

2012-05-01 20:36:23.115: [    RACG][1] [889364][1][ora.xxxxdb02 .vip]:clsrcqryapi:

resname = ora.xxxxdb02.vip, host = NULL,time = 0.004s

2012-05-01 20:36:23.284: [    RACG][1] [1115450][1][ora.xxxxdb0 2.vip]:Tue May  1

20:36:23 BEIST 2012 [ 921812 ] Checkinginterface existance

Tue May 1 20:36:23 BEIST 2012 [ 921812 ] Calling getifbyip

Tue May 1 20:36:23 BEIST 2012 [ 921812 ] getifbyip:  started for xxx.xxx.xxx.4

2012-05-01 20:36:23.284: [    RACG][1] [1115450][1][ora.xxxxdb0 2.vip]:Tue May  1

20:36:23 BEIST 2012 [ 921812 ] getifbyip:checking if failover is happening ()

Tue May 1 20:36:23 BEIST 2012 [ 921812 ] getifbyip: failover i s not happening()

Tue May 1 20:36:23 BEIST 2012 [ 921812 ] Completed getifbyip

2012-05-01 20:36:23.284: [    RACG][1] [1115450][1][ora.xxxxdb0 2.vip]:Tue May  1

20:36:23 BEIST 2012 [ 921812 ] Completedwith initial interface test

2012-05-01 20:36:23.284: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

clsrcexecut: envORACLE_CONFIG_HOME=/oracle/product/10.2.0/crs_1

17

2012-05-01 20:36:23.284: [    RACG][1] [1115450][1][ora.xxxxdb0 2.vip]:

clsrcexecut: cmd =/oracle/product/10.2.0/crs_1/bin/racgeut -e _USR_ORA_DEBUG=5 54

/oracle/product/10.2.0/crs_1/bin/racgvipstop xxxxdb02

2012-05-01 20:36:23.284: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

clsrcexecut: rc = 0, time = 0.204s

2012-05-01 20:36:23.284: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

clsrcposthaevt: reason = failure

2012-05-01 20:36:23.285: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:clsrccln:

exiting ora.xxxxdb02.vip refcount=1

2012-05-01 20:36:23.286: [    RACG][1] [1115450][1][ora.xxxxdb02.vip]:

clsrcprsrgter:gctx->prsrcfgref_clsrcgctx = 0

解决方案

根据分析结果,我们认为,10.2.0.5中CRS 对网络过于敏感,出现网络延时

的时候会对数据库集群产生较大影响,针对目前的情况,我们建议如下:

一、  详查网络问题,极偶然的丢包或延时在网络层面也属于常见情况。

有可能是线缆问题,也可能是交换机、服务器网卡、网络配置等

问题,需要详细检查网络情况。

二、  修改过于敏感的CRS 配置,将发包超时设置为3秒( 10.2.0.5之前的

值):

修改$ORA_CRS_HOME/bin/racgvip 脚本如下部分

# timeout of ping in number of loops

PING_TIMEOUT=" -c 1 -w 1"

修改成如下内容:

# timeout of ping in number of loops

PING_TIMEOUT=" -c 1 -w 3"

三、  由于Bug 6955040是VIP异常后被触发,目前优先解决VIP异常问

题,该Bug 可以忽略。

时间: 2024-10-24 15:38:25

数据库节点二VIP异常 故障分析的相关文章

11g RAC 节点二MMON进程异常

一早发现核心系统的DBtime监控阈值一直在某一个点平移,感觉有点不对劲.因为我们的脚本依托dba_hist_snapshot试图的SNIP来做的.遂进行AWR报告的生成查看其SNAP_ID是否有异常: 21220 19 Sep 2018 09:00 1 21221 19 Sep 2018 10:00 1 21222 19 Sep 2018 11:00 1 21223 19 Sep 2018 12:00 1 21224 19 Sep 2018 13:00 1 21225 19 Sep 2018

探索Oracle之数据库升级二 11.2.0.3升级到11.2.0.4完整步骤

探索Oracle之数据库升级二  11.2.0.3升级到11.2.0.4完整步骤 说明:         这篇文章主要是记录下单实例环境下Oracle 11.2.0.1升级到11.2.0.3的过程,当然RAC的升级是会有所不同.但是他们每个版本之间升级步骤都是差不多的,先升级Database Software,再升级Oracle Instance. Oracle 11.2.0.4的Patchset No:19852360下载需要有Oracle Support才可以.  Patchset包含有7个

学习oracle数据库引航二——后台进程

from 谭怀远--数据库领航 后台进程 一.system monitor(SMON) (非正常关闭后的)实例恢复 二.process monitor(PMON) 回滚事务 释放锁及其它资源 重启死掉的调度器.(在共享服务器中用).分配用户连到哪个服务器上. 在监听器中注册服务信息 三.DBWn和LGWR DBWRn下写入磁盘文件 四.check point(CKPT) 喊DBWn写脏数据 完后会更新DATAFILE的HEADER和控制文件的HEADER.而HEADER中有同步所需要的信息,即C

Android中使用Sqlite数据库 (二) 增删改查

定义接口 package com.example.android_db.service; import java.util.List; import java.util.Map; public interface PersonService { public boolean addPerson(Object[] params); public boolean deletePerson(Object[] params); public boolean updatePerson(Object[] p

mysql数据库 (二) mysql数据库的基本操作

上一次安装好mysql数据库以后并没有登录操作,这次来简单介绍下mysql的基本操作 一.访问mysql数据库(经过安装后的初始化过程,mysql数据库的默认管理员用户名为:"root",密码为空:注意此root并非是linux系统管理员的root账号) 1.登录mysql 未设置密码时,mysql后加"-u"指定用户就可以了 如果root设置了密码,还需要在用户名后面加"-p"选项指定密码 2.执行mysql操作语句 注意:每一条mysql语句

浅尝key-value数据库(二)——MongoDB的优与劣

浅尝key-value数据库(二)——MongoDB的优与劣 MongoDB的名字取自英文单词"humongous"的中间五个字母,是一个C++开发的基于分布式文件存储的数据库开源项目.他的文件存储格式是BSON(Binary JSON),因此可以高效存储二进制数据,例如图像.视频等大对象. 由于我是CentOS x86_64的系统,于是安装MongoDB非常简单: vi /etc/yum.repos.d/mongo.repo [10gen] name=10gen Repository

Windows中使用MySQL 创建数据库(二)

一.数据库操作 1.查看当前存在的所有数据库 mysql>show databases; 2.切换当前使用数据库 mysql>use test 3.授权 mysql>grant all on 数据库名.* to 用户名@主机名 4.创建数据库 mysql>create database company; 其中:company 为要创建的数据库名 二.创建表 1.查看当前数据库中的所有表 mysql>show tables; 2.创建表 create table employe

黑马程序员——面向对象(二)异常机制、包(Package)

一.异常机制 (一)异常概述 1.异常:就是程序在运行时出现不正常情况. 2.异常类:程序在运行时,出现的不正常情况,将情况用java的类的形式进行描述.并封装成对象. (二)异常分类 对异常问题的分类:“严重的的Error”和“不严重的Exception” (1)一种是严重的问题:对于严重的问题,java通过Error类进行描述. (2)一种是不严重的问题:对于非严重的,java通过Exception类进行描述.对于Exception可以使用针对性的处理方式进行处理. (三)异常体系 1.Th

【MS SQL】数据库维护计划之数据库备份(二)

原文:[MS SQL]数据库维护计划之数据库备份(二) 上篇[MS SQL]数据库维护计划之数据库备份(一) 说了数据库备份的一些概念后,这篇以HRP_KQYY数据库备份为例,进行备份计划设置. 考虑备份的时间.性能和硬盘空间等因素,备份方案做下面设置: 1.每周日凌晨1:00执行数据库完成备份: 2.每周一至周六凌晨1:00执行数据库差异备份: 3.每天在00:00:00和23:59:59之间.每4小时执行数据库的日志备份. 建立3个文件夹: HRP_KQYY_diff:存放差异备份文件: H