最近, 为了解决客户某个重要的数据库Crash的问题, 从v97fp9升级到了v97fp10.
过了几天, 活动日志满了.早上被客户call起, 急忙赶到现场.
发现是某个应用长时间hold住了最早的事务日志, force不了, 应该是hang住了.
收了数据. 10分钟就找到了RCA. 中了APAR IT08059.
IT08059:
Interrupting a lock escalation may result in a latch not being
released, which in turn may cause subsequent latch contention,
resulting in performance degradation.
The problem can occur on version 9.7 fix pack 10, version 10.1
fix pack 5, or version 10.5 fix pack 5 only.
Analysis of the performance problem will show a latch wait on
the SQLP_LTRN_CHAIN__entry_latch. The latch may or may not have
an owner, and the owner may or may not be the same transaction
that is requesting the latch.
Waiting on latch type: (SQLO_LT_SQLP_LTRN_CHAIN__entry_latch) -
Address: (7000001c2638260), Line: 106, File: sqlplsc.C
Local Fix:
Try to avoid lock escalation or interrupting a lock escalation.
这个APAR是怎么来的呢? 原来是为了修复APAR IT03126带来的, 真是坑爹啊.
IT03126: APPLICATION PERFORMING LOCK ESCALATION CANNOT BE FORCED
Error description:
An application that is holding a large number of locks may take
a long time to complete lock escalation. As a result, other
applications may experience lock waits, timeouts, deadlocks.
Lock escalation is not interruptable, and as a result, the
system may appear to be hanging. Currently, the only options
are to forcibly bring down DB2, or wait until the lock
escalation completes.
This APAR will add an enhancement to allow the lock escalation
process to be interrupted, so that an agent that is in lock
escalation can be forced.
诊断数据:
2015-04-01-09.21.53.295718+480 E565684697A825 LEVEL: Warning
PID : 49152100 TID : 37376 PROC : db2sysc 1
INSTANCE: db2inst1 NODE : 001 DB : ******
APPHDL : 0-9316 APPID: 10.1.72.116.9563.150401013601
AUTHID : ******
EDUID : 37376 EDUNAME: db2agntp (*****) 1
FUNCTION: DB2 UDB, data management, sqldEscalateLocks, probe:2
MESSAGE : ADM5500W The database manager is performing lock escalation. The
affected application is named "****.exe", and is associated with
the workload name "****" and application ID
"*****.9563.150401013601" at member "1". The total number of
locks currently held is "101489", and the target number of locks to
hold is "50744". Reason code: "1"
--锁升级发生在21:53秒~21:55秒之间, 其间有中断的操作.
2015-04-01-09.21.54.915969+480 I565686072A537 LEVEL: Error
PID : 21430854 TID : 74023 PROC : db2sysc 0
INSTANCE: db2inst1 NODE : 000 DB : ******
APPHDL : 0-9316 APPID: 10.1.72.116.9563.150401013601
AUTHID : ******
EDUID : 74023 EDUNAME: db2agent (******) 0
FUNCTION: DB2 UDB, common communication, sqlcctcptest, probe:11
MESSAGE : Detected client termination
DATA #1 : Hexdump, 2 bytes
0x0A000000473E78B2 : 0036
<StackTrace>
-------Frame------ ------Function + Offset------
0x0900000024DF105C sqloXlatchConflict + 0x23C
0x0900000024DF1320 [email protected]@clone0 + 0x78
0x0900000024E4A980 sqloxltc_track[email protected]glueBED + 0xE0
0x0900000024323B80 sqlplrm__FP8sqeAgent + 0x140
0x0900000024FAF494 sqlpxrbk__FP8sqeAgentP15SQLXA_CALL_INFOPiP9SQLP_GXIDPP11sqlo_xlatch + 0x24
0x09000000247E53F4 sqlrrbck_dps__FP8sqlrr_cbiN22P15SQLXA_CALL_INFOP9SQLP_GXID + 0x7F0
0x0900000025480E90 sqlrr_tran_router__FP8sqlrr_cb + 0x558
0x09000000253F8210 sqlrr_subagent_router__FP8sqeAgentP12SQLE_DB2RA_T + 0x32C
0x0900000023EF1BA8 sqleSubRequestRouter__FP8sqeAgentPUiT2 + 0x5F4
0x0900000023EEEA4C sqleProcessSubRequest__FP8sqeAgent + 0x764
0x09000000245B5D4C RunEDU__8sqeAgentFv + 0x2EC
0x0900000024D49AB8 EDUDriver__9sqzEDUObjFv + 0xDC
0x0900000024D3E18C sqloEDUEntry + 0x254
</StackTrace>
<LatchInformation>
Waiting on latch type: (SQLO_LT_SQLP_LTRN_CHAIN__entry_latch) - Address: (a000200000869e0), Line: 441, File: sqlplrm.C
Holding Latch type: (SQLO_LT_SQLP_TENTRY__tranEntryLatch) - Address: (a00020000086900), Line: 748, File: /view/db2_v97fp10_aix64_s141015/vbs/engn/include/sqlpi_inlines.h HoldCount: 1
Holding Latch type: (SQLO_LT_SQLP_LTRN_CHAIN__entry_latch) - Address: (a000200000869e0), Line: 458, File: sqlplcl.C HoldCount: 1
Holding Latch type: (SQLO_LT_SQLP_LTRN__cursor_latch) - Address: (a00020000086fbc), Line: 430, File: sqlplrm.C HoldCount: 1
</LatchInformation>
APAR IT08059只在v97fp10,v10.5fp5里有哦, 这都是当前最新的fixpack. Local Fix看起来不怎么样,我们采取的方案只能是回退到fp9了, 并向实验室申请了fp9的special build.