当 使用DCD 和TCPS时,rman duplicate hang住。
来源于:
RMAN Duplicate hangs when using DCD and TCPS (文档 ID 1676197.1)
适用于:
Oracle Database - Enterprise Edition - Version 11.2.0.1 and later
Information in this document applies to any platform.
症状:
在datafile copy 阶段,RMAN active duplicate for standby hang住。SSL Oracle Net 和Dead Connection Detection (DCD) 正在使用。
这个hang 是 间歇性的(intermittent),也就是说,有时duplicate 是能工作的,在其他时候,会hang住 很多天,直到进程从操作系统和database中kill掉。
rman debug 揭示了下面的信息会repeat:
RMAN-06731: command backup:x% complete, time left HH:MM:SS
样例RMAN debug输出如下:
RMAN-12016: using channel ORA_DISK_8 RMAN-08580: channel ORA_DISK_1: starting datafile copy RMAN-08522: input datafile file number=00012 name=+OFD_DATA/ofmim01q/datafile/ofm_tbs_oaam_indx.272.810048785 ... RMAN-08581: channel ORA_DISK_4: datafile copy complete, elapsed time: 00:00:16 RMAN-08592: output file name=+OFN_DAT/ofmiy01q/datafile/ofm_ias_iau.373.842790419 tag=TAG20140321T065222 RMAN-08581: channel ORA_DISK_7: datafile copy complete, elapsed time: 00:00:16 RMAN-06731: command backup:94.1% complete, time left 00:21:05 // // RMAN-06731 and % complete repeats here // Process is completely stalled RMAN-06731: command backup:94.1% complete, time left 00:21:05
在primary database上,我们可以看到8个session hang住,等待事件"remote db file write" 的wait time会简单的增加
SQL> select SID ,SERIAL# , INST_ID , USERNAME, OSUSER || '@' || MACHINE OSINFO, SUBSTR(PROGRAM,0,20) PROGRAM, 2 TO_CHAR(LOGON_TIME,'yyyy-mm-dd hh24:mi:ss') LOGON_TIME, EVENT, SECONDS_IN_WAIT SIW from gv$session where type <> 'BACKGROUND' and PROGRAM like 'rman%' 3 ORDER BY USERNAME, INST_ID, SID; SID SERIAL# INST_ID USERNAME OSINFO PROGRAM LOGON_TIME EVENT SIW ---- ------- ------- -------- ----------------- -------------------- -------------------- ------------------------------ ---------- 632 5635 2 SYS [email protected] [email protected] (TNS V 2014-04-08 17:29:27 SQL*Net message from client 34 758 2535 2 SYS [email protected] [email protected] (TNS V 2014-04-08 17:29:30 SQL*Net message from client 4 948 441 2 SYS [email protected] [email protected] (TNS V 2014-04-08 17:29:35 remote db file write 52036 1010 369 2 SYS [email protected] [email protected] (TNS V 2014-04-08 17:29:36 remote db file write 35532 1073 215 2 SYS [email protected] [email protected] (TNS V 2014-04-08 17:29:37 remote db file write 52935 1136 291 2 SYS [email protected] [email protected] (TNS V 2014-04-08 17:29:37 remote db file write 54014 1199 753 2 SYS [email protected] [email protected] (TNS V 2014-04-08 17:29:38 remote db file write 41651 1325 1595 2 SYS [email protected] [email protected] (TNS V 2014-04-08 17:29:39 remote db file write 42730 1388 2121 2 SYS [email protected] [email protected] (TNS V 2014-04-08 17:29:39 remote db file write 50771 1451 1351 2 SYS [email protected] [email protected] (TNS V 2014-04-08 17:29:40 remote db file write 47650
hang住的进程必须被从databae和os级别kill掉。
原因:
Unfortunately expire_time + TCPS combination is not supported by oracle as NTZ layer(used for TCPS communication) uses routines that not async-signal-safe.
Using async-signal-safe routines can cause unpredictable results like hang, crash etc.
解决方案:
Do not use DCD with SSL Oracle Net. Remove sqlnet.expire_time from the sqlnet.ora file or set it to 0 (zero).
If you need to keep the connection alive due to firewall issues, consider using the operating system‘s TCP KEEPALIVE parameters instead. eg:
TCP_KEEPIDLE (the amount of time until the first keepalive packet is sent)
TCP_KEEPCNT (the number of probes to send)
TCP_KEEPINTVL (the interval between keepalive packets)
Otherwise, if you need to use DCD, you must use non-SSL Oracle Net.