使用happybase访问HBase出现Broken pipe问题---两个“惊天”大bug

来源
使用happybase通过thrift接口向HBase读取、写入数据时,出现Broken pipe的错误。排查步骤:

1、查看hbase的日志:

Java HotSpot(TM) 64-Bit Server VM warning: Using incremental CMS is deprecated and will likely be removed in a future release
17/05/12 18:08:41 INFO util.VersionInfo: HBase 1.2.0-cdh5.10.1
17/05/12 18:08:41 INFO util.VersionInfo: Source code repository file:///data/jenkins/workspace/generic-package-centos64-7-0/topdir/BUILD/hbase-1.2.0-cdh5.10.1 revision=Unknown
17/05/12 18:08:41 INFO util.VersionInfo: Compiled by jenkins on Mon Mar 20 02:46:09 PDT 2017
17/05/12 18:08:41 INFO util.VersionInfo: From source with checksum c6d9864e1358df7e7f39d39a40338b4e
17/05/12 18:08:41 INFO thrift.ThriftServerRunner: Using default thrift server type
17/05/12 18:08:41 INFO thrift.ThriftServerRunner: Using thrift server type threadpool
17/05/12 18:08:42 WARN impl.MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties
17/05/12 18:08:42 INFO impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
17/05/12 18:08:42 INFO impl.MetricsSystemImpl: HBase metrics system started
17/05/12 18:08:42 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
17/05/12 18:08:42 INFO http.HttpRequestLog: Http request log for http.requests.thrift is not defined
17/05/12 18:08:42 INFO http.HttpServer: Added global filter ‘safety‘ (class=org.apache.hadoop.hbase.http.HttpServer$QuotingInputFilter)
17/05/12 18:08:42 INFO http.HttpServer: Added global filter ‘clickjackingprevention‘ (class=org.apache.hadoop.hbase.http.ClickjackingPreventionFilter)
17/05/12 18:08:42 INFO http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter) to context thrift
17/05/12 18:08:42 INFO http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter) to context static
17/05/12 18:08:42 INFO http.HttpServer: Added filter static_user_filter (class=org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter) to context logs
17/05/12 18:08:42 INFO http.HttpServer: Jetty bound to port 9095
17/05/12 18:08:42 INFO mortbay.log: jetty-6.1.26.cloudera.4
17/05/12 18:08:42 WARN mortbay.log: Can‘t reuse /tmp/Jetty_0_0_0_0_9095_thrift____.vqpz9l, using /tmp/Jetty_0_0_0_0_9095_thrift____.vqpz9l_5120175032480185058
17/05/12 18:08:43 INFO mortbay.log: Started [email protected]:9095
17/05/12 18:08:43 INFO thrift.ThriftServerRunner: starting TBoundedThreadPoolServer on /0.0.0.0:9090 with readTimeout 300000ms; min worker threads=128, max worker threads=1000, max queued requests=1000
.
.
.
/05/08 15:05:51 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x645132bf connecting to ZooKeeper ensemble=cdh-master-slave1:2181,cdh-slave2:2181,cdh-slave3:2181
17/05/08 15:05:51 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=cdh-master-slave1:2181,cdh-slave2:2181,cdh-slave3:2181 sessionTimeout=60000 watcher=hconnection-0x64513-master-slave1:2181,cdh-slave2:2181,cdh-slave3:2181, baseZNode=/hbase
17/05/08 15:05:51 INFO zookeeper.ClientCnxn: Opening socket connection to server cdh-slave3/192.168.10.219:2181. Will not attempt to authenticate using SASL (unknown error)
17/05/08 15:05:51 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.10.23:43170, server: cdh-slave3/192.168.10.219:2181
17/05/08 15:05:51 INFO zookeeper.ClientCnxn: Session establishment complete on server cdh-slave3/192.168.10.219:2181, sessionid = 0x35bd74a77802148, negotiated timeout = 60000
[[email protected] example]$ 17/05/08 15:32:50 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x35bd74a77802148
17/05/08 15:32:51 INFO zookeeper.ZooKeeper: Session: 0x35bd74a77802148 closed
17/05/08 15:32:51 INFO zookeeper.ClientCnxn: EventThread shut down
17/05/08 15:38:53 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0xb876351 connecting to ZooKeeper ensemble=cdh-master-slave1:2181,cdh-slave2:2181,cdh-slave3:2181
17/05/08 15:38:53 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=cdh-master-slave1:2181,cdh-slave2:2181,cdh-slave3:2181 sessionTimeout=60000 watcher=hconnection-0xb8763510x0, quorum=cdh-master-slave1:2181,cdh-slave2:2181,cdh-slave3:2181, baseZNode=/hbase
17/05/08 15:38:53 INFO zookeeper.ClientCnxn: Opening socket connection to server cdh-master-slave1/192.168.10.23:2181. Will not attempt to authenticate using SASL (unknown error)
17/05/08 15:38:53 INFO zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.10.23:35526, server: cdh-master-slave1/192.168.10.23:2181
17/05/08 15:38:53 INFO zookeeper.ClientCnxn: Session establishment complete on server cdh-master-slave1/192.168.10.23:2181, sessionid = 0x15ba3ddc6cc90d4, negotiated timeout = 60000

初步推断是hbase设置了某个超时时间,导致连接断开

2、查看官方文档,但是没有发现很有意义的timeout参数

3、Google相似问题

查看相似的内容:

Uploaded image for project: ‘HBase‘
HBaseHBASE-14926
Hung ThriftServer; no timeout on read from client; if client crashes, worker thread gets stuck reading
Agile Board
 Export
Details
Type: Bug
Status:RESOLVED
Priority: Major
Resolution: Fixed
Affects Version/s:
2.0.0, 1.2.0, 1.1.2, 1.3.0, 1.0.3, 0.98.16
Fix Version/s:
2.0.0, 1.2.0, 1.3.0, 0.98.17
Component/s:
Thrift
Labels:None
Hadoop Flags:
Reviewed
Release Note: Adds a timeout to server read from clients. Adds new configs hbase.thrift.server.socket.read.timeout for setting read timeout on server socket in milliseconds. Default is 60000;
Description
Thrift server is hung. All worker threads are doing this:
"thrift-worker-0" daemon prio=10 tid=0x00007f0bb95c2800 nid=0xf6a7 runnable [0x00007f0b956e0000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:152)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
        - locked <0x000000066d859490> (a java.io.BufferedInputStream)
        at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
        at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
        at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
        at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
        at org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:601)
        at org.apache.thrift.protocol.TCompactProtocol.readMessageBegin(TCompactProtocol.java:470)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27)
        at org.apache.hadoop.hbase.thrift.TBoundedThreadPoolServer$ClientConnnection.run(TBoundedThreadPoolServer.java:289)
        at org.apache.hadoop.hbase.thrift.CallQueue$Call.run(CallQueue.java:64)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
They never recover.
I don‘t have client side logs.
We‘ve been here before: HBASE-4967 "connected client thrift sockets should have a server side read timeout" but this patch only got applied to fb branch (and thrift has changed since then)
ps:来源https://issues.apache.org/jira/browse/HBASE-14926

4、Google “hbase.thrift.server.socket.read.timeout”

可以看到一个网页内容:

问题背景

测试环境是三台服务器搭建的Hadoop分布式环境。Hadoop版本是:hadoop-2.7.3;Hbase-1.2.4;
zookeeper-3.4.9。
使用thrift c++接口向hbase中写入数据,每次都是刚开始写入正常,过一段时间就开始报错。
但之前使用的hbase-0.94.27版本就没遇到过该问题,配置也相同,一直用的好好地。

thrift接口报错

解决办法

通过抓包可以看出,hbase server响应了RST包,导致连接中断。
通过 bin/hbase thrift start -threadpool命令可以readTimeout的设置为60s。

thriftpool

经过验证却是和这个设置有关,配置中没有配置过该项,通过查看代码发现60s是默认值,如果没有配置即按照以该值为准。

因此在conf/hbase-site.xml中添加上配置即可:

<property>
         <name>hbase.thrift.server.socket.read.timeout</name>
         <value>6000000</value>
         <description>eg:milisecond</description>
</property>

ps:来源http://blog.csdn.net/wwlhz/article/details/56012053

所以添加参数后,重启hbase thrift,发现问题解决

5、查看源码,可以看到

#https://github.com/apache/hbase/blob/master/hbase-thrift/src/main/java/org/apache/hadoop/hbase/thrift/ThriftServerRunner.java

...
  public static final String THRIFT_SERVER_SOCKET_READ_TIMEOUT_KEY =
    "hbase.thrift.server.socket.read.timeout";
  public static final int THRIFT_SERVER_SOCKET_READ_TIMEOUT_DEFAULT = 60000;
.
.
.
      int readTimeout = conf.getInt(THRIFT_SERVER_SOCKET_READ_TIMEOUT_KEY,
          THRIFT_SERVER_SOCKET_READ_TIMEOUT_DEFAULT);
      TServerTransport serverTransport = new TServerSocket(
          new TServerSocket.ServerSocketTransportArgs().
              bindAddr(new InetSocketAddress(listenAddress, listenPort)).
              backlog(backlog).
              clientTimeout(readTimeout));

问题解决~~~

6、然而问题解决了吗?

实际上还是有问题,一段时间发现连续scan大概20多分钟后,连接又被断开了,又是一次艰难的搜索,发现是hbase该版本自带的问题,它将所有连接(不管有没有在使用)都默认为idle的状态,然后有个hbase.thrift.connection.max-idletime的配置,所以我将此项配置为31104000(一年),如果是在CDH中,应该在管理页面配置,如图:

遇到问题一般步骤:
技术进步型:
1、查看日志,查看报错的地方,初步定位问题
2、查看官方文档
3、Google相似的问题,或者查看源码去定位问题

快速解决问题型:
1、查看日志,查看报错的地方,初步定位问题
2、Google相似问题
3、查看官方文档,或者查看源码



参考:

原文地址:http://blog.51cto.com/13103353/2107257

时间: 2024-09-30 07:37:08

使用happybase访问HBase出现Broken pipe问题---两个“惊天”大bug的相关文章

Could not create pool connection. The DBMS driver exception was: Io 异常: Broken pipe

现场同事反馈:中间件weblogic连不上数据库Oracle,发回日志可以看到: Caused by: weblogic.common.ResourceException: weblogic.common.ResourceException: Could not create pool connection. The DBMS driver exception was: Io 异常: Broken pipe at weblogic.jdbc.common.internal.ConnectionE

java.net.SocketException: Broken pipe 异常可能的原因

org.apache.catalina.connector.ClientAbortException: java.net.SocketException: Broken pipe at org.apache.catalina.connector.OutputBuffer.realWriteBytes(OutputBuffer.java:410) at org.apache.tomcat.util.buf.ByteChunk.flushBuffer(ByteChunk.java:480) at o

openTSDB ConnectionManager: Unexpected exception from downstream java.io.IOException: Broken pipe

openTSDB有这种错误: ConnectionManager: Unexpected exception from downstream for [id: 0xf85323a8, /10.65.30.12:3874 => /10.65.150.117:4242] java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcher.write0(Native Method) ~[na:1.6.0_27] at sun.nio.ch.Soc

TNS-12518 &amp; Linux Error:32:Broken pipe

最近一周,有一台ORACLE数据库服务器的监听服务在凌晨2点过几分的时间点突然崩溃,以前从没有出现过此类情况,但是最近一周出现了两次这种情况,检查时发现了如下一些信息: $ lsnrctl services   LSNRCTL for Linux: Version 10.2.0.4.0 - Production on 12-DEC-2014 08:22:34   Copyright (c) 1991, 2007, Oracle.  All rights reserved.   Connectin

ssh 链接服务器出现 Write failed: Broken pipe

方法来源 Ssh “Write failed: Broken pipe” - NewInstance I started to get this error “Write failed: Broken pipe” when I was leaving the terminal alone.Since plumbers are still expensive, despite the crisis, I decided to fix the pipe by myself.Very easy, ju

oacore的application log报broken pipe

后台oacore的application log常常报这样的错误 14/06/26 09:28:34.917 html: Servlet error java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcher.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) at sun.nio.ch.IOUtil.writeFr

SSH &quot;Write failed:Broken pipe&quot;超时相关问题修复

博主热衷各种互联网技术,常啰嗦,时常伴有强迫症,常更新,觉得文章对你有帮助的可以关注我. 转载请注明"深蓝的镰刀" 在使用ssh时经常因为一段时间没有操作ssh,便出现了"Write failed:Broken pipe"的错误提示,然后就要重新再登录ssh的情况. 原因就是超过了服务器需要客户端发出响应的时间,解决办法只要配置一下服务器上的SSH即可. 打开/etc/ssh/sshd_config vim /etc/ssh/sshd_config 配置以下两项 C

Python的问题解决: IOError: [Errno 32] Broken pipe

遇到一个很奇怪的问题, web.py代码里面报错IOError: [Errno 32] Broken pipe 启动命令: nohup python xxx.py > xxx.log & ssh登录到机器上, 启动, 不会出现远程ssh执行启动脚本, 就会出现IOError问题 查看进程pid, ll /proc/<pid>/fd 发现, stderr也就是fd为2的文件, 竟然是个pipe, 是个broken pipe, 错误的地方找到了 猜测可能是ssh登录过去, nohup

dpkg-deb: error: subprocess paste was killed by signal (Broken pipe)

ubuntu 12.04 安装nginx 开始使用默认的apt-get源进行安装 安装版本的1.1.9 后来想更新nginx的版本,直接apt -get remove nginx 然后继续安装报错,如下: ---------------------------------------------------------------------- dpkg: error processing /var/cache/apt/archives/nginx_1.8.1-1~precise_amd64.d