【翻译自mos文章】当NFS server 宕机后,Oracle 数据库 冻结并且alert 文件里没有任何错误

当NFS server 宕机后,Oracle 数据库 冻结并且alert 文件里没有任何错误

翻译自mos文章:When NFS Server Is Down, Oracle Server Freezes With No Errors In Alert Log File (文档 ID 1316251.1)

适用于:

Oracle Server - Enterprise Edition - Version: 10.2.0.4 and later   [Release: 10.2 and later ]

IBM AIX on POWER Systems (64-bit)

症状:

AIX上的Oracle instance 有一个NFS 挂载点,该挂载点基于backup 的目的。 该挂载点mount时的选项如下:

bg,hard,intr,rsize=32768,wsize=32768,sec=sys,noac,rw

当NFS Server 宕机时,Ooracle RDMBS 冻结,并且alert 日志中没有任何错误。当NFS Sserver 恢复后,database 也正常工作,没有任何问题。

改变:

环境没有改变,仅仅是丢失了NAS connectivity(to NFS Server),因此远程的目录不可访问。

原因:

从上传的sqlplus 和df 的 tusss跟踪来看,我们可以看到statx 命令在 /backup处挂起。

462940: statx("./../../../../backup", 0x0FFFFFFFFFFF5980, 176, 021) (sleeping...)
561338: kread(14, " ÿ ÿ J ø\0\0\0\0\0\0\010".., 64) Err#82 ERESTART
561338: Received signal #2, SIGINT [caught]
561338: sigprocmask(0, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: sigprocmask(1, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: ksetcontext_sigreturn(0x0FFFFFFFFFFF37A0, 0x0000000000000000, 0x00000001100F04F0,
0x800000000000D032, 0x3000000000000000, 0x0000000000000360, 0x0000000000000000, 0x0000000000000000)
561338: kread(14, " ÿ ÿ J ø\0\0\0\0\0\0\010".., 64) Err#82 ERESTART
561338: Received signal #2, SIGINT [caught]
561338: sigprocmask(0, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: sigprocmask(1, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: ksetcontext_sigreturn(0x0FFFFFFFFFFF37A0, 0x0000000000000000, 0x00000001100F04F0,
0x800000000000D032, 0x3000000000000000, 0x0000000000000320, 0x0000000000000000, 0x0000000000000000)
561338: kread(14, " ÿ ÿ J ø\0\0\0\0\0\0\010".., 64) Err#82 ERESTART
561338: Received signal #2, SIGINT [caught]
561338: sigprocmask(0, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: sigprocmask(1, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: ksetcontext_sigreturn(0x0FFFFFFFFFFF37A0, 0x0000000000000000, 0x00000001100F04F0,
0x800000000000D032, 0x3000000000000000, 0x0000000000000310, 0x0000000000000000, 0x0000000000000000)
561338: kread(14, " ÿ ÿ J ø\0\0\0\0\0\0\010".., 64) Err#82 ERESTART
561338: Received signal #2, SIGINT [caught]
561338: sigprocmask(0, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: sigprocmask(1, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: ksetcontext_sigreturn(0x0FFFFFFFFFFF37A0, 0x0000000000000000, 0x00000001100F04F0,
0x800000000000D032, 0x3000000000000000, 0x0000000000000310, 0x0000000000000000, 0x0000000000000000)
561338: kread(14, " ÿ ÿ J ø\0\0\0\0\0\0\010".., 64) Err#82 ERESTART
561338: Received signal #2, SIGINT [caught]
561338: sigprocmask(0, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: sigprocmask(1, 0x0FFFFFFFFFFF3620, 0x0000000000000000) = 0
561338: ksetcontext_sigreturn(0x0FFFFFFFFFFF37A0, 0x0000000000000000, 0x00000001100F04F0,
0x800000000000D032, 0x3000000000000000, 0x0000000000000320, 0x0000000000000000, 0x0000000000000000)
561338: kread(14, " ÿ ÿ J ø\0\0\0\0\0\0\010".., 64) (sleeping...)
462940: statx("./../../../../backup", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../usr", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../lib", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../audit", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../dev", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../etc", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../u", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../lpp", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../mnt", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../proc", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../sbin", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../bin", 0x0FFFFFFFFFFF5980, 176, 021) = 0
462940: statx("./../../../../oracle", 0x0FFFFFFFFFFF5980, 176, 021) = 0

问题在下面的地方:

statx("./../../../../backup", 0x0FFFFFFFFFFF5980, 176, 021) (sleeping...)

Oracle 程序(代码)调用一个Unix 系统调用(system call), ‘getcwd‘来得到当前的工作目录。在这之后,所有的控制权归还给操作系统。

从我们所看到的, 函数‘getcwd‘ 调用‘getwd‘,而‘getwd‘会依次调用‘statx‘。一旦‘statx‘被执行,它就按照 下面的顺序 通过执行‘statx‘ 开始处理directory entries

./
./..
./../..
./../../.. (this goes on until the root directory is reached)

一旦 root 目录(/)被到达,那么对于目录中的每个entry,‘lstat‘会调用 ‘statx‘。Oracle 完全不控制这个处理过程,因此为了防止此情况发生,我们做不了任何事情(这完全是os级别的事情)

解决方法:

从一个类似问题中,IBM 已经建议下面的 action plan  以避免这个问题。来自IBM的回答是:

Here's a solution to avoid the problem described by Oracle:
DO NOT have the NFS mounts directly under /, but put them one level lower. Then, we can use symbolic links to them.

NFS mount point on node  /nfs/backup (/nfs is a directory we'll create, it can have any name) and create a softlink /backup -> /nfs/backup.

$ ln -s /nfs/backup /backup

This will avoid the statx problem without having to make changes in the setup (because /backup is still there).

Additionally you can ask IBM about APAR # IZ85027, IZ85029, IZ85032, IZ86102, IZ87374, IZ90533.

Check with IBM which one applies to your configuration.
时间: 2024-10-19 16:29:13

【翻译自mos文章】当NFS server 宕机后,Oracle 数据库 冻结并且alert 文件里没有任何错误的相关文章

【翻译自mos文章】How to Set or Switch Oracle Homes on Windows (Doc ID 969581.1)

参考原文: How to Set or Switch Oracle Homes on Windows (Doc ID 969581.1) 适用于: Oracle Database - Enterprise Edition - Version 9.2.0.1 to 11.2.0.1.0 [Release 9.2 to 11.2] Microsoft Windows (32-bit) Microsoft Windows Itanium (64-bit) Microsoft Windows x64 (

【翻译自mos文章】OGG add Supplemental Logging 时失败,报错为 块损坏(Block Corruption)

OGG add Supplemental Logging 时失败,报错为 块损坏(Block Corruption) 来源于: Add Supplemental Logging Fails Due To Block Corruption (文档 ID 1468322.1) 适用于: Oracle Server - Enterprise Edition - Version 10.2.0.5 to 12cBETA1 [Release 10.2 to 12.1] Information in this

【翻译自mos文章】SGA_TARGET与SHMMAX的关系

SGA_TARGET与SHMMAX的关系 参考原文: Relationship Between SGA_TARGET and SHMMAX (文档 ID 1527109.1) 适用于: Oracle Database - Enterprise Edition - Version 10.1.0.2 to 11.2.0.3 [Release 10.1 to 11.2] Information in this document applies to any platform. 目的: 解释了参数文件中

【翻译自mos文章】使用Windows操作系统的Dell Pcserver,Oracle db报错:ORA-8103

翻译自mos文章:使用Windows操作系统的Dell Pcserver,Oracle db报错:ORA-8103 ORA-8103 using Windows platform and DELL servers (Doc ID 1921533.1) Applies to: Oracle Database - Personal Edition - Version 11.1.0.6 to 12.1.0.2 [Release 11.1 to 12.1] Oracle Database - Stand

【翻译自mos文章】使用buffer memory 参数来调整rman的性能。

使用buffer memory 参数来调整rman的性能. 本文翻译自mos文章:RMAN Performance Tuning Using Buffer Memory Parameters (Doc ID 1072545.1) rman 性能调整的目的是分辨一个特定的backup or  restore job的瓶颈. 并使用使用rman命令.初始化参数 或者对physical media的调整来提高整体的性能. 由于数据库容量持续变大,在客户的环境中,几十到几百TB的数据库很常见, serv

【翻译自mos文章】11gR2 OUI 在 PREREQUISITE CHECKS 时 hang住

翻译自mos文章:11gR2 OUI 在 PREREQUISITE CHECKS 时 hang住 适用于: Oracle Server - Enterprise Edition - Version 8.0.6.0 to 11.2.0.2.0 [Release 8.0.6 to 11.2] Information in this document applies to any platform. This can occur on any Unix/Linux platform 症状: 11gR2

【翻译自mos文章】在重建控制文件之后,丢失了数据库补充日志信息(Missed Database Supplemental Log Information)

在重建控制文件之后,丢失了数据库补充日志信息(Missed Database Supplemental Log Information) 参考原文: Missed Database Supplemental Log Information After Recreate Controlfile In 10g Database. (Doc ID 1474952.1) 适用于: Oracle Server - Enterprise Edition - Version 10.1.0.2 and late

【翻译自mos文章】Main Note - ogg的 Supplemental Logging and TRANDATA

Main Note - ogg的 Supplemental Logging and TRANDATA 参加原文: Main Note - Supplemental Logging and TRANDATA for OGG (Doc ID 1537838.1) 适用于 Oracle GoldenGate - Version 9.5_EA and later Information in this document applies to any platform. 目的 从oracle redo l

【翻译自mos文章】使用asmcmd命令在本地和远程 asm 实例之间 拷贝asm file的方法

使用asmcmd命令在本地和远程 asm 实例之间 拷贝asm file的方法 参考原文: How to Copy asm files between remote ASM instances using ASMCMD command (Doc ID 785580.1) 适用于: Oracle Database - Enterprise Edition - Version 11.1.0.6 to 11.2.0.2 [Release 11.1 to 11.2] Information in thi