PMON failed to acquire latch, see PMON dump

前几天,一台Oracle数据库(Oracle Database 10g Release 10.2.0.4.0 - 64bit Production)监控出现"PMON failed to acquire latch, see PMON dump"错误,连接数据库出现短暂异常,告警日志中具体错误如下所示:

Tue Dec 20 09:13:16 2016
PMON failed to acquire latch, see PMON dump

Tue Dec 20 09:14:16 2016

PMON failed to acquire latch, see PMON dump

Tue Dec 20 09:15:55 2016

PMON failed to acquire latch, see PMON dump

Tue Dec 20 09:17:15 2016

PMON failed to acquire latch, see PMON dump

Tue Dec 20 09:17:24 2016

WARNING: inbound connection timed out (ORA-3136)

Tue Dec 20 09:18:23 2016

PMON failed to acquire latch, see PMON dump

Tue Dec 20 09:19:24 2016

PMON failed to acquire latch, see PMON dump

在生成的epps_pmon_4988.trc 跟踪文件里面,发现有些详细信息,你会发现PMON进程不能获取‘Child shared pool‘ latch,它被一个pid = 19 ospid=5022的进程给阻塞了。而ospid这个进程是一个Dispatcher的进程。

*** 2016-12-20 09:14:16.575
PMON unable to acquire latch  600edfa8 Child shared pool level=7 child#=1 

        Location from where latch is held: kghfrunp: alloc: session dur: 

        Context saved from call: 0

        state=busy, wlstate=free

    waiters [orapid (seconds since: put on list, posted, alive check)]:

     33 (3, 1482196555, 3)

     10 (3, 1482196555, 3)

     25 (3, 1482196555, 3)

     13 (3, 1482196555, 3)

     waiter count=4

    gotten 861091119 times wait, failed first 7114074 sleeps 1392223

    gotten 0 times nowait, failed: 0

  possible holder pid = 19 ospid=5022

----------------------------------------

SO: 0x40979aec8, type: 2, owner: (nil), flag: INIT/-/-/0x00

  (process) Oracle pid=19, calls cur/top: (nil)/0x409c92608, flag: (80) DISPATCHER

            int error: 0, call error: 0, sess error: 0, txn error 0

  (post info) last post received: 0 0 236

              last post received-location: kmcpdp

              last process to post me: 4097a64a0 106 64

              last post sent: 0 0 229

              last post sent-location: kmcmbf: not KMCVCFTOS

              last process posted by me: 4097a64a0 106 64

  (latch info) wait_event=0 bits=a0

    holding    (efd=4) 600edfa8 Child shared pool level=7 child#=1 

        Location from where latch is held: kghfrunp: alloc: session dur: 

        Context saved from call: 0

        state=busy, wlstate=free

        waiters [orapid (seconds since: put on list, posted, alive check)]:

         33 (3, 1482196555, 3)

         10 (3, 1482196555, 3)

         25 (3, 1482196555, 3)

         13 (3, 1482196555, 3)

         waiter count=4

    holding    (efd=4) 3fff78210 Child library cache level=5 child#=2 

        Location from where latch is held: kghfrunp: clatch: wait: 

        Context saved from call: 0

        state=busy, wlstate=free

        waiters [orapid (seconds since: put on list, posted, alive check)]:

         15 (3, 1482196555, 3)

         17 (3, 1482196555, 3)

         12 (3, 1482196555, 3)

         waiter count=3

    Process Group: DEFAULT, pseudo proc: 0x4098bc190

    O/S info: user: oracle, term: UNKNOWN, ospid: 5022 

    OSD pid info: Unix process pid: 5022, image: [email protected] (D007)

    Short stack dump: 

ksdxfstk()+32<-ksdxcb()+1547<-sspuser()+111<-__restore_rt()+0<-kghfrunp()+1506<-kghfnd()+1389<-kghalo()+587<-kmnsbm()+578<-nsb

al()+428<-nsbalc()+123<-nsdo()+17278<-nsopen()+2315<-nsanswer()+512<-kmnans()+37<-kmdahd()+385<-kmdmai()+5220<-kmmrdp()+564<-o

pirip()+1193<-opidrv()+582<-sou2o()+114<-opimai_real()+317<-main()+116<-__libc_start_main()+244<-_start()+41

Dump of memory from 0x0000000409747C68 to 0x0000000409747E70

409747C60                   00000001 00000000          [........]

409747C70 FE9BEE10 00000003 0000003A 0003129B  [........:.......]

409747C80 FEA7D5D0 00000003 0000003A 0003129B  [........:.......]

409747C90 FE9DAD30 00000003 0000003A 0003129B  [0.......:.......]

        Repeat 2 times

409747CC0 FEAB01F0 00000003 0000003A 0003129B  [........:.......]

409747CD0 FE9DAD30 00000003 0000003A 0003129B  [0.......:.......]

409747CE0 FEA44E70 00000003 0000003A 0003129B  [pN......:.......]

409747CF0 FEAA6FF0 00000003 0000003A 0003129B  [.o......:.......]

409747D00 FEAB8AD0 00000003 0000003A 0003129B  [........:.......]

409747D10 FEA14FF0 00000003 0000003A 0003129B  [.O......:.......]

409747D20 FE9A77F0 00000003 0000003A 0003129B  [.w......:.......]

        Repeat 1 times

409747D40 FEA3CEB0 00000003 0000003A 0003129B  [........:.......]

        Repeat 1 times

409747D60 FE9C64B0 00000003 0000003A 0003129B  [.d......:.......]

        Repeat 1 times

409747D80 FEA062B0 00000003 0000003A 0003129B  [.b......:.......]

        Repeat 3 times

409747DC0 FEAA6FF0 00000003 0000003A 0003129B  [.o......:.......]

409747DD0 FEA8F9D0 00000003 0000003A 0003129B  [........:.......]

409747DE0 FE9F7570 00000003 0000003A 0003129B  [pu......:.......]

409747DF0 FEA91530 00000003 0000003A 0003129B  [0.......:.......]

409747E00 FE9BEE10 00000003 0000003A 0003129B  [........:.......]

409747E10 FE9BB750 00000003 0000003A 0003129B  [P.......:.......]

409747E20 FEA90C10 00000003 0000003A 0003129B  [........:.......]

409747E30 FEA8B9F0 00000003 0000003A 0003129B  [........:.......]

409747E40 FE9C5270 00000003 0000003A 0003129B  [pR......:.......]

409747E50 FEAE12B0 00000003 0000003A 0003129B  [........:.......]

409747E60 FE9C5270 00000003 0000003A 0003129B  [pR......:.......]

由于当时没有出现问题时,并没有及时发现,没有Collect HangAnalyze traces,所以再深入一点的挖掘root case已经很难了。当时手工生成了一个快照(9:26),也就是说9:00 ~ 9:26这段时间生成的快照刚刚覆盖了出现问题的时间段。生成了这个时段的AWR报告,在这个时间段latch:library cache 和latch:shared pool等待事件是主要等待事件。

出现问题的时间段,数据库服务器是比较空闲的。

生成了20-Dec-16 09:11:16到20-Dec-16 09:21:16时段的ASH报告。如下所示,latch:library cache 和latch:shared pool为主要等待事件,但是Avg Active Sessions很小。

所以觉得很有可能是跟Bug有关系,后面在Oracle MetaLink查了一下是否有相关Bug,如下一些相关资料:

Bug 7039896 Spin under kghquiesce_regular_extent holding shared pool latch with AMM

Bug 6488694 - DATABASE HUNG WITH PMON FAILED TO ACQUIRE LATCH MESSAGE

Note 7039896.8 - Bug 7039896 - Spin under kghquiesce_regular_extent holding shared pool latch with AMM

Pmon Failed To Acquire Latch" Messages in Alert Log -Database Hung (文档 ID 468740.1)

Hang (Involving Shared Resource)

A process may hold a shared resource a lot longer than normally expected leading to many other processes having to wait for that resource. Such a resource could be a lock, a library cache pin, a latch etc.. The overall symptom is that typically numerous processes all appear to be stuck, although some processes may continue unhindered if they do not need the blocked resource.

Hang (Process Spins)

A process enters a tight CPU loop so appears to hang but is actually consuming CPU.

Latch Contention

This issue can result in latch contention within the database.

Waits for "latch: shared pool"

我们数据库版本为Oracle Database 10g Release 10.2.0.4.0 - 64bit Production, 所以Bug 7039896是会影响的这个数据库的, 而出现的现象也很符合,但是有一点就是并没有涉及MMAN进程。而且查过V$SGA_RESIZE_OPS,那个时间段并没有相关组件的增长、收缩。另外跟Bug 也非常类似,但是trc文件并没有发现跟MMAN进程有关系。 这个问题还是第一次出现,而且出现过一次后,最近几天都没有出现,所以更加确信是Bug引起的。当然是要找个时间应用Bug 7039896的相关补丁。

另外,在查找这个问题的时候,在官方文档看到一个如何处理、诊断‘PMON failed to acquire latch, see PMON dump‘的详细文档,本想收录于此,不过还是保持为PDF文件较好,需要可从下面链接下载。

SRDC - How to Collect Standard Information for Issues Where ‘PMON failed to acquire latch, see PMON dump‘ Warnings are Seen in the Alert Log (文档 ID 1951971.1)

时间: 2024-09-29 20:46:35

PMON failed to acquire latch, see PMON dump的相关文章

Failed to acquire connection &quot;SAP_PRD_NEW.SAPSR3&quot;. Connection may not be configured correctly or you may not have the right permissions

SQLSERVER JOB无法执行 错误提示: Message Executed as user: WORKGROUP\NSDZHSCMFP01$. Microsoft (R) SQL Server Execute Package Utility  Version 10.50.1600.1 for 32-bit  Copyright (C) Microsoft Corporation 2010. All rights reserved.    Started:  22:37:48  Error:

高并发压力下导致数据库bug

环境信息:  linux 6.1 + oracle11.2.0.3 RAC 问题现象: 学校晚上6点选课,人数大概有3000,7点时,数据库报错如下(数据库到6点多还是可以连接的),数据库hung住了. Tue Dec 16 18:00:33 2014Dumping diagnostic data in directory=[cdmp_20141216180033], requested by (instance=2, osid=24917 (M001)), summary=[incident=

Oracle Latch的学习【原创】

Latch详解 - MaxChou 本文以学习为目的,大部分内容来自网络转载. 什么是Latch 串行化 数据库系统本身是一个多用户并发处理系统,在同一个时间点上,可能会有多个用户同时操作数据库.多个用户同时在相同的物理位置上写数据时,不能发生互相覆盖的情况,这叫做串行化.串行化会降低系统的并发性,但这对于保护数据结构不被破坏来说则是必需的.在Oracle数据库中,通过闩锁(latch).锁定(lock).互斥(mutex)来实行串行化,保护数据结构一致性的. Latch的定义和作用 Oracl

深入理解Oracle中的latch

深入理解Oracle中的latch 串行化 概述 串行化 - 数据库系统本身是一个多用户并发处理系统,在同一个时间点上,可能会有多个用户同时操作数据库, 多个用户同时在相同的物理位置上写数据时,不能发生互相覆盖的情况,这叫做串行化,串行化会降低系统的并发性,但这对于保护数据结构不被破坏来说则是必需的.在Oracle数据库中,通过闩锁(latch)和锁定(lock)来解决这两个问题.闩锁和锁定既有相同点又有不同点.相同点在于它们都是用于实现串行化的资源.而不同点则在于闩锁(Latch)是一个低级别

oracle中lock和latch的用途

本文向各位阐述Oracle的Latch机制,Latch,用金山词霸翻译是门插栓,闭锁,专业术语叫锁存器,我开始接触时就不大明白为什么不写Lock,不都是锁吗?只是翻译不同而以?研究过后才知道两者有很大的区别. Oracle中锁的信息是数据块的一部分,是物理的,并不是逻辑上属于某个表或者某个行.Latch是Oracle提供的轻量级锁资源,他用于快速,短时间的锁定资源,防止多个并发进程同时修改访问某个共享资源,他只工作在内存中, 我们可以不大准确的说,内存中资源的锁叫latch,数据库对象(表,索引

[转帖]Latch

Latch (转) http://blog.csdn.net/tianlesoftware/article/details/5263238 2013-05-24 15:33:09 huashanlunjian 阅读数 13更多 分类专栏: 数据库 版权声明:本文为博主原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明. 本文链接:https://blog.csdn.net/huashanlunjian/article/details/84435068 一.  Latc

ORA_ERROR大全

转自:http://blog.csdn.net/haiross/article/details/12839229 常见错误:-60 ORA00060: deadlock detected while waiting for resource 一般错误:  - 1 ORA00001: unique constraint (.) violated  -17 ORA00017: session requested to set trace event  -18 ORA00018: maximum nu

RAC集群节点故障模拟测试

RAC节点故障模拟测试 重启单个RAC 节点模拟测试模拟操作步骤使用shutdown –Fr的方式重启节点,查看系统反应和数据库重新启动的时间.预期测试结果重启单个节点,vip将会切换到另外一个节点.系统重新启动之后,节点上的集群服务和数据库将会自动启动,重新加入集群.Vip也将切换回原始节点.测试过程记录使用shutdown 命令重启第三节点第三节点关闭之后查看crs服务状态RAC02:oracle:db2 > crs_stat -tName           Type           

AbstractQueuedSynchronizer源码分析

/* * ORACLE PROPRIETARY/CONFIDENTIAL. Use is subject to license terms. * * * * * * * * * * * * * * * * * * * * */ /* * * * * * * Written by Doug Lea with assistance from members of JCP JSR-166 * Expert Group and released to the public domain, as expl