ceph集群报错:HEALTH_ERR 1 pgs inconsistent; 1 scrub errors

报错信息如下:

[[email protected] ~]# ceph health detail

HEALTH_ERR 1 pgs inconsistent; 1 scrub errors;

pg 2.37c is active+clean+inconsistent, acting [75,6,35]

1 scrub errors

报错信息总结:

问题PG:2.37c

OSD编号:75,6,35

执行常规修复:

ceph pg repair 2.37c

查看修复结果:

[[email protected] ~]# ceph health detail

HEALTH_ERR 1 pgs inconsistent; 1 scrub errors

pg 2.37c is active+clean+inconsistent, acting [75,6,35]

1 scrub errors

问题依然存在,异常pg没有修复;

然后执行:

要洗刷一个pg组,执行命令:

ceph pg scrub 2.37c

ceph pg deep-scrub  2.37c

ceph pg repair 2.37c

以上命令执行后均未修复,依然报上面的错误,查看相关osd 日志报错如下:

2017-07-24 17:31:10.585305 7f72893c4700  0 log_channel(cluster) log [INF] : 2.37c repair starts

2017-07-24 17:31:10.710517 7f72893c4700 -1 log_channel(cluster) log [ERR] : 2.37c repair 1 errors, 0 fixed

此时已经被坑了好久了,决定修复pg 设置的三块osd ,执行命令如下:

ceph osd repair 75

ceph osd repair 6

ceph osd repair 35

修复命令执行后等待一段时间,osd 修复完成,发现错误依然存在!!!!!!!!!此时想做下面两个操作,

1:找到pg object信息,把主osd 上面的数据删掉,让后让集群修复;

2:修改pg现在使用的主osd信息,现在是osd 75 ,改成别的磁盘(没找到方法修改);

此时看到ceph社区的一个bug 信息:

http://tracker.ceph.com/issues/12577

发现有些尝试有人已经做过了,而且又是一个bug!!!!!!!!!!

最后决定用一个最粗暴的方法解决,关闭有问题pg 所使用的主osd 75

查询pg 使用主osd信息

ceph pg  2.37c query |grep primary

"blocked_by": [],

"up_primary": 75,

"acting_primary": 75

执行操作如下:

systemctl stop [email protected]

此时ceph开始数据恢复,将osd75 上面的数据在其它节点恢复,等待一段时间,发现数据滚动完成,执行命令查看集群状态。

[[email protected] ~]# ceph health detail

HEALTH_ERR 1 pgs inconsistent; 1 scrub errors

pg 2.37c is active+clean+inconsistent, acting [8,38,17]

1 scrub errors

看到上面的信息,心都要碎了!为啥还是这样?不报希望的执行以下常规修复!

[[email protected] ~]# ceph pg repair 2.37c

‘instructing pg 2.37c on osd.8 to repair

然后查看集群状态:

[[email protected] ~]# ceph health detail

HEALTH_OK

药药彻克闹!好了。。。。。。。。啥也不说了,下班!

时间: 2024-10-23 23:43:33

ceph集群报错:HEALTH_ERR 1 pgs inconsistent; 1 scrub errors的相关文章

CEPH -S集群报错TOO MANY PGS PER OSD

背景 集群状态报错,如下: # ceph -s cluster 1d64ac80-21be-430e-98a8-b4d8aeb18560 health HEALTH_WARN <-- 报错的地方 too many PGs per OSD (912 > max 300) monmap e1: 1 mons at {node1=109.105.115.67:6789/0} election epoch 4, quorum 0 node1 osdmap e49: 2 osds: 2 up, 2 in

ceph集群报 Monitor clock skew detected 错误问题排查,解决

ceph集群报 Monitor clock skew detected 错误问题排查,解决           告警信息如下: [[email protected] ceph]# ceph -w    cluster ddc1b10b-6d1a-4ef9-8a01-d561512f3c1d     health HEALTH_WARN            clock skew detected on mon.ceph-100-81, mon.ceph-100-82            Mon

ceph 集群报 mds cluster is degraded 故障排查

ceph 集群版本: ceph -vceph version 10.2.7 (50e863e0f4bc8f4b9e31156de690d765af245185) ceph -w 查看服务状态: mds cluster is degraded      monmap e1: 3 mons at {ceph-6-11=172.16.6.11:6789/0,ceph-6-12=172.16.6.12:6789/0,ceph-6-13=172.16.6.13:6789/0}             el

redis集群报错:(error) CLUSTERDOWN The cluster is down

更换了电脑,把原来的电脑上的虚拟机复制到了新电脑上,启动虚拟机上的centos系统,然后启动redis集群(redis5版本),发现集群可以启动,redis进程也有,但是连接集群中的任意节点就报错,如下 查看单个节点的集群配置信息: 发现是因为原来的集群配置信息导致的错误,需要将每个redis实例下的dump.rdb 和nodes.conf文件删除,然后重新创建集群 重新创建集群: 首先启动6个redis实例 2 .进入任意一个redis实例,执行集群创建命令: ./redis-cli --cl

Java客户端连接kafka集群报错

往kafka集群发送消息时,报错如下: page_visits-1: 30005 ms has passed since batch creation plus linger time 加入log4j.properties,设置为DEBUG级别,错误如下: 2017-06-03 17:33:31,417 DEBUG [org.apache.kafka.clients.NetworkClient] - Error connecting to node 2 at kafka-cluster-64bi

maven项目中使用redis集群报错: java.lang.NumberFormatException: For input string: &quot;[email&#160;protected]&quot;

Caused by: org.springframework.beans.BeanInstantiationException: Failed to instantiate [redis.clients.jedis.JedisCluster]: Constructor threw exception; nested exception is java.lang.NumberFormatException: For input string: "[email protected]" at

kafka集群报错

bin/kafka-server-start.sh config/server.properties ,问题来了 : [[email protected] kafka_2.12-0.10.2.0]# Exception in thread "main" java.lang.UnsupportedClassVersionError: kafka/Kafka : Unsupported major.minor version 52.0 at java.lang.ClassLoader.de

idea连接spark集群报错解析:Caused by: java.lang.ClassCastException

cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.aggregate.SortAggregateExec.aggregateExpressions of type scala.collection.Seq in instance of org.apache.spark.sql.execution.aggregate

ceph集群增加pg报错

描述: 一个正在运行ceph集群增加pg时报错. 报错信息: Error E2BIG: specified pg_num 4096 is too large (creating 2048 new PGs on ~60 OSDs exceeds per-OSD max of 32) 解决: 由于一次增加pg数量太多导致,尝试一次增加少量pg解决此问题.或者需要调整ceph集群参数mon_osd_max_split_count默认值为32,意思为每个osd最大32个pg,调整完成后重启mon或在线调