【ceph故障排查】ceph集群添加了一个osd之后,该osd的状态始终为down

背景

ceph集群添加了一个osd之后,该osd的状态始终为down。

错误提示

状态查看如下

1、查看osd tree
[[email protected] Asia]# ceph osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.05388 root default
-2 0.01469     host node1
 0 0.00490         osd.0       up  1.00000          1.00000
 1 0.00490         osd.1       up  1.00000          1.00000
 2 0.00490         osd.2       up  1.00000          1.00000
-3 0.01959     host node2
 4 0.00490         osd.4       up  1.00000          1.00000
 5 0.00490         osd.5       up  1.00000          1.00000
 6 0.00490         osd.6       up  1.00000          1.00000
 7 0.00490         osd.7       up  1.00000          1.00000
-4 0.01959     host node3
 8 0.00490         osd.8       up  1.00000          1.00000
 9 0.00490         osd.9       up  1.00000          1.00000
 3 0.00490         osd.3       up  1.00000          1.00000
10 0.00490         osd.10      up  1.00000          1.00000
11       0 osd.11            down        0          1.00000
[[email protected] Asia]#
2、查看osd状态
[[email protected] /]# systemctl status [email protected]
● [email protected] - Ceph object storage daemon
   Loaded: loaded (/usr/lib/systemd/system/[email protected]; enabled-runtime; vendor preset: disabled)
   Active: failed (Result: start-limit) since Sun 2018-09-09 22:15:25 EDT; 4h 57min ago
  Process: 10331 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=1/FAILURE)

Sep 09 22:15:05 node1 systemd[1]: [email protected]: control process exited, code=exited status=1
Sep 09 22:15:05 node1 systemd[1]: Failed to start Ceph object storage daemon.
Sep 09 22:15:05 node1 systemd[1]: Unit [email protected] entered failed state.
Sep 09 22:15:05 node1 systemd[1]: [email protected] failed.
Sep 09 22:15:25 node1 systemd[1]: [email protected] holdoff time over, scheduling restart.
Sep 09 22:15:25 node1 systemd[1]: start request repeated too quickly for [email protected]
Sep 09 22:15:25 node1 systemd[1]: Failed to start Ceph object storage daemon.
Sep 09 22:15:25 node1 systemd[1]: Unit [email protected] entered failed state.
Sep 09 22:15:25 node1 systemd[1]: [email protected] failed.

3、启动osd
[[email protected] /]# systemctl start [email protected]
Job for [email protected] failed because the control process exited with error code. See "systemctl status [email protected]" and "journalctl -xe" for details.

4、查看错误
[email protected] /]# journalctl -xe
Sep 10 03:12:52 node1 polkitd[723]: Unregistered Authentication Agent for unix-process:10473:4129481 (system bus name :1.52, object p
Sep 10 03:13:12 node1 systemd[1]: [email protected] holdoff time over, scheduling restart.
Sep 10 03:13:12 node1 systemd[1]: Starting Ceph object storage daemon...
-- Subject: Unit [email protected] has begun start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit [email protected] has begun starting up.
Sep 10 03:13:12 node1 ceph-osd-prestart.sh[10483]: OSD data directory /var/lib/ceph/osd/ceph-11 does not exist; bailing out.
Sep 10 03:13:12 node1 systemd[1]: [email protected]: control process exited, code=exited status=1
Sep 10 03:13:12 node1 systemd[1]: Failed to start Ceph object storage daemon.
-- Subject: Unit [email protected] has failed
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
--
-- Unit [email protected] has failed.
--
-- The result is failed.
Sep 10 03:13:12 node1 systemd[1]: Unit [email protected] entered failed state.
Sep 10 03:13:12 node1 systemd[1]: [email protected] failed.

其实我也不知道上卖弄的错误是什么原因,但是根据我的记录,这个osd添加的时候,整个集群处于ERR的状态。

错误解决

添加osd时集群状态如下:

[[email protected] ceph]# ceph -s
    cluster 8eaa3f15-0946-4500-b018-6d31d1cc69f6
     health HEALTH_ERR
            clock skew detected on mon.node2, mon.node3
            121 pgs are stuck inactive for more than 300 seconds
            121 pgs peering
            121 pgs stuck inactive
            121 pgs stuck unclean
            Monitor clock skew detected
     monmap e1: 3 mons at {node1=192.168.209.100:6789/0,node2=192.168.209.101:6789/0,node3=192.168.209.102:6789/0}
            election epoch 266, quorum 0,1,2 node1,node2,node3
     osdmap e5602: 12 osds: 11 up, 11 in; 120 remapped pgs
            flags sortbitwise,require_jewel_osds
      pgmap v16259: 128 pgs, 1 pools, 0 bytes data, 0 objects
            1421 MB used, 54777 MB / 56198 MB avail
                 120 remapped+peering
                   7 active+clean
                   1 peering

当集群处于ERR的时候,添加osd是会有问题的。所以我决定重新添加一次osd(目前ceph集群的状态为ok)

删除步骤如下:

1、集群状态如下
[[email protected] ceph]# ceph -s
    cluster 8eaa3f15-0946-4500-b018-6d31d1cc69f6
     health HEALTH_OK
     monmap e1: 3 mons at {node1=192.168.209.100:6789/0,node2=192.168.209.101:6789/0,node3=192.168.209.102:6789/0}
            election epoch 292, quorum 0,1,2 node1,node2,node3
     osdmap e5664: 12 osds: 12 up, 12 in
            flags sortbitwise,require_jewel_osds
      pgmap v16508: 128 pgs, 1 pools, 0 bytes data, 0 objects
            1500 MB used, 59806 MB / 61307 MB avail
                 128 active+clean
2、将down的osd踢出ceph集群
[[email protected] /]# ceph osd out osd.11
osd.11 is already out.
3、将down的osd删除
[[email protected] /]# ceph osd rm osd.11
removed osd.11
4、将down的osd从CRUSH中删除
[[email protected] /]# ceph osd crush rm osd.11
device ‘osd.11‘ does not appear in the crush map
5、删除osd的认证信息
[[email protected] /]# ceph auth del osd.11
updated

添加步骤如下:

1、擦除磁盘
[[email protected] /]# ceph-disk zap /dev/sde
Caution: invalid backup GPT header, but valid main header; regenerating
backup header from main header.

Warning! Main and backup partition tables differ! Use the ‘c‘ and ‘e‘ options
on the recovery & transformation menu to examine the two tables.

Warning! One or more CRCs don‘t match. You should repair the disk!

****************************************************************************
Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
verification and recovery are STRONGLY recommended.
****************************************************************************
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
Creating new GPT entries.
The operation has completed successfully.
[[email protected] /]# 

2、创建osd
[[email protected] ceph]# ceph-deploy --overwrite-conf osd create node1:/dev/sde
[ceph_deploy.conf][DEBUG ] found configuration file at: /root/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.39): /usr/bin/ceph-deploy --overwrite-conf osd create node1:/dev/sde
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username                      : None
[ceph_deploy.cli][INFO  ]  block_db                      : None
[ceph_deploy.cli][INFO  ]  disk                          : [(‘node1‘, ‘/dev/sde‘, None)]
[ceph_deploy.cli][INFO  ]  dmcrypt                       : False
[ceph_deploy.cli][INFO  ]  verbose                       : False
[ceph_deploy.cli][INFO  ]  bluestore                     : None
[ceph_deploy.cli][INFO  ]  block_wal                     : None
[ceph_deploy.cli][INFO  ]  overwrite_conf                : True
[ceph_deploy.cli][INFO  ]  subcommand                    : create
[ceph_deploy.cli][INFO  ]  dmcrypt_key_dir               : /etc/ceph/dmcrypt-keys
[ceph_deploy.cli][INFO  ]  quiet                         : False
[ceph_deploy.cli][INFO  ]  cd_conf                       : <ceph_deploy.conf.cephdeploy.Conf instance at 0x7f06a8343248>
[ceph_deploy.cli][INFO  ]  cluster                       : ceph
[ceph_deploy.cli][INFO  ]  fs_type                       : xfs
[ceph_deploy.cli][INFO  ]  filestore                     : None
[ceph_deploy.cli][INFO  ]  func                          : <function osd at 0x7f06a8394050>
[ceph_deploy.cli][INFO  ]  ceph_conf                     : None
[ceph_deploy.cli][INFO  ]  default_release               : False
[ceph_deploy.cli][INFO  ]  zap_disk                      : False
[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks node1:/dev/sde:
[node1][DEBUG ] connected to host: node1
[node1][DEBUG ] detect platform information from remote host
[node1][DEBUG ] detect machine type
[node1][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO  ] Distro info: CentOS Linux 7.5.1804 Core
[ceph_deploy.osd][DEBUG ] Deploying osd to node1
[node1][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[ceph_deploy.osd][DEBUG ] Preparing host node1 disk /dev/sde journal None activate True
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: /usr/sbin/ceph-disk -v prepare --cluster ceph --fs-type xfs -- /dev/sde
[node1][WARNIN] command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=fsid
[node1][WARNIN] command: Running command: /usr/bin/ceph-osd --check-allows-journal -i 0 --log-file $run_dir/$cluster-osd-check.log --cluster ceph --setuser ceph --setgroup ceph
[node1][WARNIN] command: Running command: /usr/bin/ceph-osd --check-wants-journal -i 0 --log-file $run_dir/$cluster-osd-check.log --cluster ceph --setuser ceph --setgroup ceph
[node1][WARNIN] command: Running command: /usr/bin/ceph-osd --check-needs-journal -i 0 --log-file $run_dir/$cluster-osd-check.log --cluster ceph --setuser ceph --setgroup ceph
[node1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sde uuid path is /sys/dev/block/8:64/dm/uuid
[node1][WARNIN] set_type: Will colocate journal with data on /dev/sde
[node1][WARNIN] command: Running command: /usr/bin/ceph-osd --cluster=ceph --show-config-value=osd_journal_size
[node1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sde uuid path is /sys/dev/block/8:64/dm/uuid
[node1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sde uuid path is /sys/dev/block/8:64/dm/uuid
[node1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sde uuid path is /sys/dev/block/8:64/dm/uuid
[node1][WARNIN] command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
[node1][WARNIN] command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
[node1][WARNIN] command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
[node1][WARNIN] command: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs
[node1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sde uuid path is /sys/dev/block/8:64/dm/uuid
[node1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sde uuid path is /sys/dev/block/8:64/dm/uuid
[node1][WARNIN] ptype_tobe_for_name: name = journal
[node1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sde uuid path is /sys/dev/block/8:64/dm/uuid
[node1][WARNIN] create_partition: Creating journal partition num 2 size 5120 on /dev/sde
[node1][WARNIN] command_check_call: Running command: /usr/sbin/sgdisk --new=2:0:+5120M --change-name=2:ceph journal --partition-guid=2:22ea9667-570d-4697-b9dc-21968d31c445 --typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sde
[node1][DEBUG ] The operation has completed successfully.
[node1][WARNIN] update_partition: Calling partprobe on created device /dev/sde
[node1][WARNIN] command_check_call: Running command: /usr/bin/udevadm settle --timeout=600
[node1][WARNIN] command: Running command: /usr/bin/flock -s /dev/sde /usr/sbin/partprobe /dev/sde
[node1][WARNIN] command_check_call: Running command: /usr/bin/udevadm settle --timeout=600
[node1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sde uuid path is /sys/dev/block/8:64/dm/uuid
[node1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sde uuid path is /sys/dev/block/8:64/dm/uuid
[node1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sde2 uuid path is /sys/dev/block/8:66/dm/uuid
[node1][WARNIN] prepare_device: Journal is GPT partition /dev/disk/by-partuuid/22ea9667-570d-4697-b9dc-21968d31c445
[node1][WARNIN] prepare_device: Journal is GPT partition /dev/disk/by-partuuid/22ea9667-570d-4697-b9dc-21968d31c445
[node1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sde uuid path is /sys/dev/block/8:64/dm/uuid
[node1][WARNIN] set_data_partition: Creating osd partition on /dev/sde
[node1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sde uuid path is /sys/dev/block/8:64/dm/uuid
[node1][WARNIN] ptype_tobe_for_name: name = data
[node1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sde uuid path is /sys/dev/block/8:64/dm/uuid
[node1][WARNIN] create_partition: Creating data partition num 1 size 0 on /dev/sde
[node1][WARNIN] command_check_call: Running command: /usr/sbin/sgdisk --largest-new=1 --change-name=1:ceph data --partition-guid=1:e9aecd36-93a6-456a-b05f-b8097d16d88d --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be --mbrtogpt -- /dev/sde
[node1][DEBUG ] Warning: The kernel is still using the old partition table.
[node1][DEBUG ] The new table will be used at the next reboot.
[node1][DEBUG ] The operation has completed successfully.
[node1][WARNIN] update_partition: Calling partprobe on created device /dev/sde
[node1][WARNIN] command_check_call: Running command: /usr/bin/udevadm settle --timeout=600
[node1][WARNIN] command: Running command: /usr/bin/flock -s /dev/sde /usr/sbin/partprobe /dev/sde
[node1][WARNIN] command_check_call: Running command: /usr/bin/udevadm settle --timeout=600
[node1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sde uuid path is /sys/dev/block/8:64/dm/uuid
[node1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sde uuid path is /sys/dev/block/8:64/dm/uuid
[node1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sde1 uuid path is /sys/dev/block/8:65/dm/uuid
[node1][WARNIN] populate_data_path_device: Creating xfs fs on /dev/sde1
[node1][WARNIN] command_check_call: Running command: /usr/sbin/mkfs -t xfs -f -i size=2048 -- /dev/sde1
[node1][DEBUG ] meta-data=/dev/sde1              isize=2048   agcount=4, agsize=327615 blks
[node1][DEBUG ]          =                       sectsz=512   attr=2, projid32bit=1
[node1][DEBUG ]          =                       crc=1        finobt=0, sparse=0
[node1][DEBUG ] data     =                       bsize=4096   blocks=1310459, imaxpct=25
[node1][DEBUG ]          =                       sunit=0      swidth=0 blks
[node1][DEBUG ] naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
[node1][DEBUG ] log      =internal log           bsize=4096   blocks=2560, version=2
[node1][DEBUG ]          =                       sectsz=512   sunit=0 blks, lazy-count=1
[node1][DEBUG ] realtime =none                   extsz=4096   blocks=0, rtextents=0
[node1][WARNIN] mount: Mounting /dev/sde1 on /var/lib/ceph/tmp/mnt.5St2Fg with options noatime,inode64
[node1][WARNIN] command_check_call: Running command: /usr/bin/mount -t xfs -o noatime,inode64 -- /dev/sde1 /var/lib/ceph/tmp/mnt.5St2Fg
[node1][WARNIN] command: Running command: /usr/sbin/restorecon /var/lib/ceph/tmp/mnt.5St2Fg
[node1][WARNIN] populate_data_path: Preparing osd data dir /var/lib/ceph/tmp/mnt.5St2Fg
[node1][WARNIN] command: Running command: /usr/sbin/restorecon -R /var/lib/ceph/tmp/mnt.5St2Fg/ceph_fsid.10803.tmp
[node1][WARNIN] command: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.5St2Fg/ceph_fsid.10803.tmp
[node1][WARNIN] command: Running command: /usr/sbin/restorecon -R /var/lib/ceph/tmp/mnt.5St2Fg/fsid.10803.tmp
[node1][WARNIN] command: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.5St2Fg/fsid.10803.tmp
[node1][WARNIN] command: Running command: /usr/sbin/restorecon -R /var/lib/ceph/tmp/mnt.5St2Fg/magic.10803.tmp
[node1][WARNIN] command: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.5St2Fg/magic.10803.tmp
[node1][WARNIN] command: Running command: /usr/sbin/restorecon -R /var/lib/ceph/tmp/mnt.5St2Fg/journal_uuid.10803.tmp
[node1][WARNIN] command: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.5St2Fg/journal_uuid.10803.tmp
[node1][WARNIN] adjust_symlink: Creating symlink /var/lib/ceph/tmp/mnt.5St2Fg/journal -> /dev/disk/by-partuuid/22ea9667-570d-4697-b9dc-21968d31c445
[node1][WARNIN] command: Running command: /usr/sbin/restorecon -R /var/lib/ceph/tmp/mnt.5St2Fg
[node1][WARNIN] command: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/tmp/mnt.5St2Fg
[node1][WARNIN] unmount: Unmounting /var/lib/ceph/tmp/mnt.5St2Fg
[node1][WARNIN] command_check_call: Running command: /bin/umount -- /var/lib/ceph/tmp/mnt.5St2Fg
[node1][WARNIN] get_dm_uuid: get_dm_uuid /dev/sde uuid path is /sys/dev/block/8:64/dm/uuid
[node1][WARNIN] command_check_call: Running command: /usr/sbin/sgdisk --typecode=1:4fbd7e29-9d25-41b8-afd0-062c0ceff05d -- /dev/sde
[node1][DEBUG ] The operation has completed successfully.
[node1][WARNIN] update_partition: Calling partprobe on prepared device /dev/sde
[node1][WARNIN] command_check_call: Running command: /usr/bin/udevadm settle --timeout=600
[node1][WARNIN] command: Running command: /usr/bin/flock -s /dev/sde /usr/sbin/partprobe /dev/sde
[node1][WARNIN] command_check_call: Running command: /usr/bin/udevadm settle --timeout=600
[node1][WARNIN] command_check_call: Running command: /usr/bin/udevadm trigger --action=add --sysname-match sde1
[node1][INFO  ] Running command: systemctl enable ceph.target
[node1][INFO  ] checking OSD status...
[node1][DEBUG ] find the location of an executable
[node1][INFO  ] Running command: /bin/ceph --cluster=ceph osd stat --format=json
[node1][WARNIN] there is 1 OSD down
[node1][WARNIN] there is 1 OSD out
[ceph_deploy.osd][DEBUG ] Host node1 is now ready for osd use.

3、查看osd
[[email protected] ceph]# ceph osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 0.05878 root default
-2 0.01959     host node1
 0 0.00490         osd.0       up  1.00000          1.00000
 1 0.00490         osd.1       up  1.00000          1.00000
 2 0.00490         osd.2       up  1.00000          1.00000
11 0.00490         osd.11      up  1.00000          1.00000
-3 0.01959     host node2
 4 0.00490         osd.4       up  1.00000          1.00000
 5 0.00490         osd.5       up  1.00000          1.00000
 6 0.00490         osd.6       up  1.00000          1.00000
 7 0.00490         osd.7       up  1.00000          1.00000
-4 0.01959     host node3
 8 0.00490         osd.8       up  1.00000          1.00000
 9 0.00490         osd.9       up  1.00000          1.00000
 3 0.00490         osd.3       up  1.00000          1.00000
10 0.00490         osd.10      up  1.00000          1.00000 

4、查看osd状态
[[email protected] ceph]# systemctl status [email protected]
● [email protected] - Ceph object storage daemon
   Loaded: loaded (/usr/lib/systemd/system/[email protected]; enabled-runtime; vendor preset: disabled)
   Active: active (running) since Mon 2018-09-10 03:20:37 EDT; 20min ago
 Main PID: 11379 (ceph-osd)
   CGroup: /system.slice/system-ceph\x2dosd.slice/[email protected]
           └─11379 /usr/bin/ceph-osd -f --cluster ceph --id 11 --setuser ceph --setgroup ceph

Sep 10 03:20:36 node1 systemd[1]: [email protected] holdoff time over, scheduling restart.
Sep 10 03:20:36 node1 systemd[1]: Starting Ceph object storage daemon...
Sep 10 03:20:37 node1 ceph-osd-prestart.sh[11325]: create-or-move updating item name ‘osd.11‘ weight 0.0049 at location {hos...sh map
Sep 10 03:20:37 node1 systemd[1]: Started Ceph object storage daemon.
Sep 10 03:20:38 node1 ceph-osd[11379]: starting osd.11 at :/0 osd_data /var/lib/ceph/osd/ceph-11 /var/lib/ceph/osd/ceph-11/journal
Sep 10 03:21:13 node1 ceph-osd[11379]: 2018-09-10 03:21:13.399072 7f09b5797ac0 -1 osd.11 0 log_to_monitors {default=true}
Hint: Some lines were ellipsized, use -l to show in full.
[[email protected] ceph]#

在添加osd的中途,集群会短暂出现ERR的状态。

[[email protected] ceph]# ceph -s
    cluster 8eaa3f15-0946-4500-b018-6d31d1cc69f6
     health HEALTH_ERR
            11 pgs are stuck inactive for more than 300 seconds
            13 pgs peering
            11 pgs stuck inactive
            11 pgs stuck unclean
     monmap e1: 3 mons at {node1=192.168.209.100:6789/0,node2=192.168.209.101:6789/0,node3=192.168.209.102:6789/0}
            election epoch 292, quorum 0,1,2 node1,node2,node3
     osdmap e5664: 12 osds: 12 up, 12 in
            flags sortbitwise,require_jewel_osds
      pgmap v16499: 128 pgs, 1 pools, 0 bytes data, 0 objects
            1519 MB used, 59788 MB / 61307 MB avail
                  98 active+clean
                  17 activating
                  13 peering

原文地址:http://blog.51cto.com/xiaowangzai/2173309

时间: 2024-11-10 01:45:52

【ceph故障排查】ceph集群添加了一个osd之后,该osd的状态始终为down的相关文章

RAC OCR盘故障导致的集群重启恢复

一.事故说明 最近出现了一次OCR盘的故障导致Oracle集群件宕机的事故,后以独占模式启动集群,并使用ocr备份恢复了OCR文件以及重新设置了vote disk,然后关闭集群,重启成功. 因此在此处进行事故重现以吸取教训. 二.重现步骤 测试RAC环境中只有+OCR和+DATA两个ASM磁盘组. 1.做好ocr的手工备份 [[email protected] ~]# ocrconfig -export /home/oracle/ocr.bak 紧急情况下没有ocr的备份也不要紧,在$CRS_H

为CDH 5.7集群添加Kerberos身份验证及Sentry权限控制

4. 为CDH 5集群添加Kerberos身份验证4.1 安装sentry1.点击"操作","添加服务":2.选择sentry,并"继续": 3.选择一组依赖关系 4.确认新服务的主机分配 5.配置存储数据库:在mysql中创建对应用户和数据库: mysql>create database sentry default character set utf8 collate utf8_general_ci; mysql>grant al

CDH集群添加节点服务器步骤

1.解压安装包到/opt目录下 tar -zxvf cloudera--el6-cm5.7.1_x86_64.tar.gz -C /opt/ 2.修改agent下的config.ini文件,将server_host设置为server所在服务器主机名 vim /opt/cm-5.7.1/etc/cloudera-scm-agent/config.ini 3.创建cloudera-scm用户 useradd --system --home=/opt/cm-5.7.1/run/cloudera-scm

向Kubernetes集群添加/删除Node

向Kubernetes集群添加/移除Node Minion Node操作前准备 #关闭防火墙 systemctl stop firewalld #禁止防火墙开机启动 systemctl disable firewalld #检查selinux getenforce Disabled #端口检查 Kubernetes集群中添加Node在kubeadm init初始化操作完成时,系统最后给出了将节点加入集群的命令: kubeadm join 10.0.0.39:6443 --token 4g0p8w

为docker ceph集群添加mon

查看ceph集群 ceph -s     cluster 4ae8b795-a8b2-4904-9573-f6f658838db3      health HEALTH_OK      monmap e1: 1 mons at {mon0=10.64.0.4:6789/0}             election epoch 1, quorum 0 mon0      osdmap e14: 3 osds: 3 up, 3 in       pgmap v27: 64 pgs, 1 pools

CEPH -S集群报错TOO MANY PGS PER OSD

背景 集群状态报错,如下: # ceph -s cluster 1d64ac80-21be-430e-98a8-b4d8aeb18560 health HEALTH_WARN <-- 报错的地方 too many PGs per OSD (912 > max 300) monmap e1: 1 mons at {node1=109.105.115.67:6789/0} election epoch 4, quorum 0 node1 osdmap e49: 2 osds: 2 up, 2 in

Ceph手动修复mon 集群

目录 一.背景介绍 二. 解决过程 一.背景介绍 ceph 版本为L版,集群由于异常断电,导致文件丢失,ceph mon 数据文件store.db/目录下的sst 文件丢失,所以无法正常启动. 本集群有三台mon节点,其中有一台mon 节点的服务可以正常启动,另外两台无法正常启动. 二. 解决过程 因为判断可能出现文件丢失导致的mon无法启动,所以决定重做另两台mon来解决问题 1.本环境中control3的mon是好的,control1和control2是坏的 在control3上导出monm

【转】Hadoop集群添加磁盘步骤

转自:http://blog.csdn.net/huyuxiang999/article/details/17691405 一.实验环境 : 1.硬件:3台DELL服务器,CPU:2.27GHz*16,内存:16GB,一台为master,另外2台为slave. 2.系统:均为CentOS6.3 3.Hadoop版本:CDH4.5,选用的mapreduce版本不是yarn,而是mapreduce1,整个集群在cloudera manager的监控下,配置时也是通过manager来配置(通过更改配置

Hadoop集群添加新节点步骤

1.在新节点中进行操作系统配置,包括主机名.网络.防火墙和无密码登录等. 2.在所有节点/etc/host文件中添加新节点 3.把namenode的有关配置文件复制到该节点 4.修改master节点slaves文件,增加改节点 5.单独启动该节点上的datanode和nodemanager $hadoop-daemon.sh start datanode(在新增加节点启动 datanode) $yarn-daemon.sh start nodemanager 运行start-balancer.s