oracle使用opatch auto的方式安装gi psu时需要一个节点一个节点来,昨晚的升级中,因为误操作而是两节点同时安装gi psu,最终在补丁安装完成后,无法拉起crs。
选择进行补丁的rollback,结果悲剧的发现rollback的前提是需要crs启动的状态,无奈之下只能进行备份文件的恢复了。
不过因为意识的疏忽,压缩$oracle_home目录和$grid_home目录时没有使用root用户,导致部分文件没有备份出来。
以后打类似的psu,有两个注意点:
第一,一定要一个节点一个节点打;
第二,一定要用root用户将$grid_home和$oracle_home 以及oraInventory目录压缩打包,一般不会用到它们,但是一旦需要使用,那就是最后的手段了。
以下是部分记录
在23:20时发现节点二在节点一还没有打完gi psu的情况下,开始安装gi psu了,应该是我的误操作导致。
23:40分左右,节点一,节点二依次打完补丁,出现以下提示
Oracle Grid Infrastructure stack start initiated but failed to complete at /tmp/p17735354_112030_Linux-x86-64/17592127/files/crs/install/crsconfig_lib.pm line 11645.
opatch是卡在执行以下脚本时出错rootcrs.pl -patch
启动crs失败,查看相应ohasd日志,发现以下错误,根据该错误找文档
Unable to start OHASD after apply of PSU patch. CLSU-00103: error location: usrgetgrp12 [1562797.1]
Unable to start CRS after applying Grid Infrastructure Patch [1200582.1]
问题类似,建议是使用软件备份来回滚
错误日志
ESOURCES] to : []
2014-06-18 23:43:59.284: [ CRSOCR][1143286080] {0:0:2} Multi Write Batch processing...
2014-06-18 23:43:59.284: [ AGFW][1141184832] {0:0:2} Agfw Proxy Server received the message: CMD_COMPLETED[Proxy] ID 20482:71
2014-06-18 23:43:59.284: [ AGFW][1141184832] {0:0:2} Agfw Proxy Server replying to the message: CMD_COMPLETED[Proxy] ID 20482:71
2014-06-18 23:43:59.286: [UiServer][1153792320] {0:0:4} processMessage called
2014-06-18 23:43:59.286: [UiServer][1153792320] {0:0:4} Sending message to PE. ctx= 0x2aaab00297c0, Client PID: 27058
2014-06-18 23:43:59.286: [UiServer][1153792320] {0:0:4} Sending command to PE: 2
2014-06-18 23:43:59.287: [ CRSPE][1151691072] {0:0:4} Processing PE command id=103. Description: [Stat Resource : 0x2aaaac0ec830]
2014-06-18 23:43:59.287: [ CRSPE][1151691072] {0:0:4} Expression Filter : (((NAME == ora.crsd) OR (NAME == ora.cssd)) OR (NAME ==
ora.evmd))
2014-06-18 23:43:59.289: [UiServer][1153792320] {0:0:4} Done for ctx=0x2aaab00297c0
2014-06-18 23:43:59.297: [UiServer][1155893568] Closed: remote end failed/disc.
2014-06-18 23:43:59.400: [ CRSOCR][1143286080] {0:0:2} Multi Write Batch done.
2014-06-18 23:43:59.400: [ CRSPE][1151691072] {0:0:2} Resource Autostart completed for gdgz-ps-tszc-db04-x3950
2014-06-18 23:43:59.455: [UiServer][1155893568] CS(0x2aaab002ba70)set Properties ( root,0x7f43170)
2014-06-18 23:43:59.455: [UiServer][1155893568] SS(0x7ebdc60)Accepted client connection: saddr =(ADDRESS=(PROTOCOL=ipc)(DEV=639)(KEY
=OHASD_UI_SOCKET))daddr = (ADDRESS=(PROTOCOL=ipc)(KEY=OHASD_UI_SOCKET))
2014-06-18 23:43:59.466: [UiServer][1153792320] {0:0:5} processMessage called
1点10分开始通过软件备份还原,1点四十,节点一软件正常,节点二响应缓慢
节点一启动时出现以下问题
startup
ORA-01078: failure in processing system parameters
ORA-01565: error in identifying file ‘+DATA/oltp/spfileoltp.ora‘
ORA-17503: ksfdopn:2 failed to open file +DATA/oltp/spfileoltp.ora
ORA-12547: TNS: lost contact
ORA-12537 / ORA-12547 or TNS-12518 if Listener (including SCAN Listener) and Database are Owned by Different OS User (文档 ID 1069517.1)
通过以上文档,定位是$GRID_HOME/bin/oracle的权限不对
正常的是-rwsr-s--x,而该库上面是-rwxr-x--x
chmod 6751 oracle 通过该命令修改权限后成功启动
三点半时候,应用那边尝试连接,发现两个问题
ora-03113,和ora-12516
这边将节点一重启后,发现起不来了
相应ohasd日志如下:
2014-06-19 04:41:40.708: [ default][616989264] OHASD Daemon Starting. Command string :restart
2014-06-19 04:41:40.713: [ default][616989264] Initializing OLR
2014-06-19 04:41:40.714: [ OCROSD][616989264]utopen:6m‘: failed in stat OCR file/disk /oracle/app/11.2.0.3/grid/cdata/gdgz-ps-tszc-
db03-x3950.olr, errno=2, os err string=No such file or directory
2014-06-19 04:41:40.714: [ OCROSD][616989264]utopen:7: failed to open any OCR file/disk, errno=2, os err string=No such file or dir
ectory
2014-06-19 04:41:40.714: [ OCRRAW][616989264]proprinit: Could not open raw device
2014-06-19 04:41:40.714: [ OCRAPI][616989264]a_init:16!: Backend init unsuccessful : [26]
2014-06-19 04:41:40.714: [ CRSOCR][616989264] OCR context init failure. Error: PROCL-26: Error while accessing the physical storag
e Operating System error [No such file or directory] [2]
2014-06-19 04:41:40.715: [ default][616989264] Created alert : (:OHAS00106:) : OLR initialization failed, error: PROCL-26: Error wh
ile accessing the physical storage Operating System error [No such file or directory] [2]
2014-06-19 04:41:40.715: [ default][616989264][PANIC] OHASD exiting; Could not init OLR
2014-06-19 04:41:40.715: [ default][616989264] Done
四点半的时候节点二软件恢复完全,但是出现相同错误
2014-06-19 04:00:22.064
[ohasd(28201)]CRS-0704:Oracle High Availability Service aborted due to Oracle Local Registry error [PROCL-26: Error while accessing
the physical storage Operating System error [No such file or directory] [2]]. Details at (:OHAS00106:) in /oracle/app/11.2.0.3/grid/
log/gdgz-ps-tszc-db04-x3950/ohasd/ohasd.log.
发现是因为丢失olr文件导致该情况发生
因为该文件是root权限,所以当初没有备份成功
庆幸的是,节点二的旧目录重命名了,但是还存在。从节点二的旧软件目录中拿出该文件,放在相应目录,节点二启动成功
同时把该文件复制到节点一相应目录,修改名字后节点一也启动成功
之后有发现应用依然无法连接
如上述,将$ORACLE_HOME/bin/oracle的权限修改后重启数据库正常
这次故障给了我很大的感慨,搞db的必须要有那种泰山崩于前而面不改色的心态,两点到四点这段时间,我多数处于脑子空白状态,不知道自己干了啥,很有可能节点一上本来存在的ocr就是在这段时间被我删除,从而导致节点一重启报错的。
路漫漫其修远兮,和那些中高级相比,我只能是初级只能那么点工资不是没有道理的。
加油
一次误操作导致的gi psu升级失败