使用httpclient抓取时,netstat 发现很多time_wait连接

http://wiki.apache.org/HttpComponents/FrequentlyAskedConnectionManagementQuestions

1. Connections in TIME_WAIT State

After running your HTTP application, you use the netstat command and detect a lot of connections in stateTIME_WAIT. Now you wonder why these connections are not cleaned up.

1.1. What is the TIME_WAIT State?

The TIME_WAIT state is a protection mechanism in TCP. The side thatcloses a socket connection orderly will keep the connection in stateTIME_WAIT for some time, typically between 1 and 4 minutes. Thishappens afterthe connection is closed. It does notindicate a cleanup problem. The TIME_WAIT state protects against lossof data and data corruption. It is there to help you. For technicaldetails, have a look at the Unix Socket FAQ, section 2.7.

1.2. Some Connections Go To TIME_WAIT, Others Not

If a connection is orderly closed by your application, it will go tothe TIME_WAIT state. If a connection is orderly closed by the server,the server keeps it in TIME_WAIT and your client doesn‘t. If aconnection is reset or otherwise dropped by your application in anon-orderly fashion, it will not go to TIME_WAIT.

Unfortunately, it will not always be obvious to you whether aconnection is closed orderly or not. This is because connections arepooled and kept open for re-use by default. HttpClient 3.x, HttpClient4, and also the standard Java HttpURLConnection do that for you. Mostapplications will simply execute requests, then read from the responsestream, and finally close that stream. 
Closing the response stream is notthe same thing as closing the connection! Closing the response streamreturns the connection to the pool, but it will be kept open ifpossible. This saves a lot of time if you send another request to thesame host within a few seconds, or even minutes.

Connection pools have a limited number of connections. A pool mayhave 5 connections, or 100, or maybe only 1. When you send a request toa host, and there is no open connection to that host in the pool, a newconnection needs to be opened. But if the pool is already full, an openconnection has to be closed before a new one can be opened. In thiscase, the old connection will be closed orderly and go to the TIME_WAITstate. 
When your application exits and the JVM terminates, the open connections in the pools will not be closed orderly. They are reset or cancelled, without going to TIME_WAIT. To avoid this, you should call theshutdownmethod of the connection pools your application is using beforeexiting. The standard Java HttpURLConnection has no public method toshutdown it‘s connection pool.

1.3. Running Out Of Ports

Some applications open and orderly close a lot of connections withina short time, for example when load-testing a server. A connection instate TIME_WAIT will prevent that port number from being re-used foranother connection. That is not an error, it is the purpose ofTIME_WAIT.

TCP is configured at the operating system level, not through Java.Your first action should be to increase the number of ephemeral portson the machine. Windows in particular has a rather low default for theephemeral ports. The PerformanceWiki has tuning tips for the common operating systems, have a look at the respective Network section. 
Only if increasing the number of ephemeral ports does not solve yourproblem, you should consider decreasing the duration of the TIME_WAITstate. You probably have to reduce the maximum lifetime of IP packets,as the duration of TIME_WAIT is typically twice that timespan to allowfor a round-trip delay. Be aware that this will affect all applications running on the machine. Don‘t ask us how to do it, we‘re not the experts for network tuning.

There are some ways to deal with the problem at the applicationlevel. One way is to send a "Connection: close" header with eachrequest. That will tell the server to close the connection, so it goesto TIME_WAIT on the other side. Of course this also disables thekeep-alive feature of connection pooling and thereby degradesperformance. If you are running load tests against a server, theuntypical behavior of your application may distort the test results.[[BR] Another way is to not orderly close connections. There is a trickto set SO_LINGER to a special value, which will cause the connection tobe reset instead of orderly closed. Note that the HttpClient API willnot support that directly, you‘ll have to extend or modify some classesto implement this hack. 
Yet another way is to re-use ports thatare still blocked by a connection in TIME_WAIT. You can do that byspecifying the SO_REUSEADDR option when opening a socket. Java 1.4introduced the methodSocket.setReuseAddress for this purpose. You will have to extend or modify some classes of HttpClient for this too, but at least it‘s not a hack.

1.4. Further Reading

Unix Socket FAQ

java.net.Socket.setReuseAddress

Discussion on the HttpClient mailing list in December 2007

PerformanceWiki

netstat command line tool

http://www.softlab.ntua.gr/facilities/documentation/unix/unix-socket-faq/unix-socket-faq-2.html#ss2.7

2.7 Please explain the TIME_WAIT state.

Remember that TCP guarantees all data transmitted will be delivered, if atall possible. When you close a socket, the server goes into a TIME_WAITstate, just to be really really sure that all the data has gone through. When a socket is closed, both sides agree by sending messages to eachother that they will send no more data. This, it seemed to me was goodenough, and after the handshaking is done, the socket should be closed. The problem is two-fold. First, there is no way to be sure that the last ack was communicated successfully. Second, there may be "wandering duplicates" left on the net that must be dealt with if they are delivered.

Andrew Gierth (an[email protected]) helped to explain the closing sequence in the following usenet posting:

Assume that a connection is in ESTABLISHED state, and the client is aboutto do an orderly release. The client‘s sequence no. is Sc, and the server‘sis Ss. The pipe is empty in both directions.

   Client                                                   Server
   ======                                                   ======
   ESTABLISHED                                              ESTABLISHED
   (client closes)
   ESTABLISHED                                              ESTABLISHED
                <CTL=FIN+ACK><SEQ=Sc><ACK=Ss> ------->>
   FIN_WAIT_1
                <<-------- <CTL=ACK><SEQ=Ss><ACK=Sc+1>
   FIN_WAIT_2                                               CLOSE_WAIT
                <<-------- <CTL=FIN+ACK><SEQ=Ss><ACK=Sc+1>  (server closes)
                                                            LAST_ACK
                <CTL=ACK>,<SEQ=Sc+1><ACK=Ss+1> ------->>
   TIME_WAIT                                                CLOSED
   (2*msl elapses...)
   CLOSED

Note: the +1 on the sequence numbers is because the FIN counts as one byte of data. (The above diagram is equivalent to fig. 13 from RFC 793).

Now consider what happens if the last of those packets is dropped in thenetwork. The client has done with the connection; it has no more data orcontrol info to send, and never will have. But the server does not knowwhether the client received all the data correctly; that‘s what the lastACK segment is for. Now the server may or may not care whether theclient got the data, but that is not an issue for TCP; TCP is a reliablerotocol, and must distinguish between an orderly connection closewhere all data is transferred, and a connection abortwhere data mayor may not have been lost.

So, if that last packet is dropped, the server will retransmit it (it is,after all, an unacknowledged segment) and will expect to see a suitableACK segment in reply. If the client went straight to CLOSED, the onlypossible response to that retransmit would be a RST, which would indicateto the server that data had been lost, when in fact it had not been.

(Bear in mind that the server‘s FIN segment may, additionally, containdata.)

DISCLAIMER: This is my interpretation of the RFCs (I have read all theTCP-related ones I could find), but I have not attempted to examineimplementation source code or trace actual connections in order toverify it. I am satisfied that the logic is correct, though.

More commentarty from Vic:

The second issue was addressed by Richard Stevens ([email protected],author of "Unix Network Programming", see1.5 Where can I get source code for the book [book title]?).I have put together quotes from someof his postings and email which explain this. I have brought togetherparagraphs from different postings, and have made as few changes as possible.

From Richard Stevens ([email protected]):

If the duration of the TIME_WAIT state were just to handle TCP‘s full-duplex close, then the time would be much smaller, and it would be some function of the current RTO (retransmission timeout), not the MSL (the packet lifetime).

A couple of points about the TIME_WAIT state.

  • The end that sends the first FIN goes into the TIME_WAIT state, because thatis the end that sends the final ACK. If the other end‘s FIN is lost, orif the final ACK is lost, having the end that sends the first FIN maintain state about the connection guarantees that it has enough information to retransmit the final ACK.
  • Realize that TCP sequence numbers wrap around after 2**32 bytes have been transferred. Assume a connection between A.1500 (host A, port 1500) and B.2000. During the connection one segment is lost and retransmitted. But the segment is not really lost, it is held by some intermediate router and then re-injected into the network. (This is called a "wandering duplicate".) But in the time between the packet being lost & retransmitted, and then reappearing, the connection is closed (without any problems) and then another connection is established between the same host, same port (that is, A.1500 and B.2000; this is called another "incarnation" of the connection). But the sequence numbers chosen for the new incarnation just happen to overlap with the sequence number of the wandering duplicate that is about to reappear. (This is indeed possible, given the way sequence numbers are chosen for TCP connections.) Bingo, you are about to deliver the data from the wandering duplicate (the previous incarnation of the connection) to the new incarnation of the connection. To avoid this, you do not allow the same incarnation of the connection to be reestablished until the TIME_WAIT state terminates.Even the TIME_WAIT state doesn‘t complete solve the second problem, given what is called TIME_WAIT assassination. RFC 1337 has more details.
  • The reason that the duration of the TIME_WAIT state is 2*MSL is that the maximum amount of time a packet can wander around a network is assumed to be MSL seconds. The factor of 2 is for the round-trip. The recommended value for MSL is 120 seconds, but Berkeley-derived implementations normally use 30 seconds instead. This means a TIME_WAIT delay between 1 and 4 minutes. Solaris 2.x does indeed use the recommended MSL of 120 seconds.

A wandering duplicate is a packet that appeared to be lost and wasretransmitted. But it wasn‘t really lost ... some router had problems,held on to the packet for a while (order of seconds, could be a minuteif the TTL is large enough) and then re-injects the packet back intothe network. But by the time it reappears, the application that sentit originally has already retransmitted the data contained in that packet.

Because of these potential problems with TIME_WAIT assassinations, one should not avoid the TIME_WAIT state by setting the SO_LINGER option to send an RST instead of the normal TCP connection termination (FIN/ACK/FIN/ACK). The TIME_WAIT state is there for a reason; it‘s your friend and it‘s there to help you :-)

I have a long discussion of just this topic in my just-released "TCP/IPIllustrated, Volume 3". The TIME_WAIT state is indeed, one of the mostmisunderstood features of TCP.

I‘m currently rewriting "Unix Network Programming" (see1.5 Where can I get source code for the book [book title]?). and will include lots more on this topic, as it is often confusing and misunderstood.

An additional note from Andrew:

Closing a socket: if SO_LINGER has not been called on a socket, thenclose() is not supposed to discard data. This is true on SVR4.2 (and,apparently, on all non-SVR4 systems) but apparently not on SVR4; theuse of eithershutdown() or SO_LINGER seems to be required toguarantee delivery of all data.

http://blog.csdn.net/liuxuejin/article/details/8552677

时间: 2024-10-30 06:34:32

使用httpclient抓取时,netstat 发现很多time_wait连接的相关文章

记录一个简单的HttpClient抓取页面内容

现如今的网络时代,HTTP协议如此重要,随着java的发展,也越来越多的人采用java直接通过HTTP协议访问网络资源,虽然java.net提供了基本的访问HTTP协议的基本功能,但是对于大部分应用程序来说,仍旧还有许多功能不能够灵活使用:HttpClient是Apache Jakarta Common 下的子项目,一个提供访问HTTP协议的java工具包,提供了更多.更快捷.丰富的方法,HttpClient主要常用的功能有:实现了所有 HTTP 的方法(GET,POST,PUT,HEAD,DE

利用HttpClient抓取话费详单等信息

由于项目需要,需要获取授权用户的在运营商(中国移动.中国联通.中国电信)那里的个人信息.话费详单.月汇总账单信息(需要指出的是电信用户的个人信息无法从网上营业厅获取).抓取用户信息肯定是要模仿用户登录授权,然后爬取自己需要的东西.自然想到了利用HttpClient. 关于HttpClient的介绍可以到官网上面查看.不过需要指出的是HttpClient 项目从3.1的版本的时候就停止了更新,而是被含有HttpClient和HttpCore两个核心模块的HttpComponents 项目所取代,后

zabbix proxy 服务器 netstat 出现大量Time_Wait连接问题

问题描述: 监控系统云网关监控几万个TCP port的存活情况, 最近发现有几个端口出现告警闪断情况,怀疑因为运行TCP检查的 zabbix proxy 服务器 tcp参数配置不合理. netstat 发现有大量TIME_WAIT t连接. # netstat -n | awk '/^tcp/ {++y[$NF]} END {for(w in y) print w, y[w]}' TIME_WAIT 9584SYN_SENT 2FIN_WAIT1 2FIN_WAIT2 3ESTABLISHED

使用HttpClient抓取网站首页

HttpClient是Apache开发的第三方Java库,可以用来进行网络爬虫的开发,相关API的可以在http://hc.apache.org/httpcomponents-client-ga/httpclient/apidocs/查看. import java.io.BufferedReader; import java.io.InputStreamReader; import org.apache.http.client.methods.*; import org.apache.http.

用Python进行网页抓取

引言 从网页中提取信息的需求日益剧增,其重要性也越来越明显.每隔几周,我自己就想要到网页上提取一些信息.比如上周我们考虑建立一个有关各种数据科学在线课程的欢迎程度和意见的索引.我们不仅需要找出新的课程,还要抓取对课程的评论,对它们进行总结后建立一些衡量指标.这是一个问题或产品,其功效更多地取决于网页抓取和信息提取(数据集)的技术,而非以往我们使用的数据汇总技术. 网页信息提取的方式 从网页中提取信息有一些方法.使用API可能被认为是从网站提取信息的最佳方法.几乎所有的大型网站,像Twitter.

数据抓取的艺术(三):抓取Google数据之心得

本来是想把这部分内容放到前一篇<数据抓取的艺术(二):数据抓取程序优化>之中.但是随着任务的完成,我越来越感觉到其中深深的趣味,现总结如下: (1)时间     时间是一个与抓取规模相形而生的因素,数据规模越大,时间消耗往往越长.所以程序优化变得相当重要,要知道抓取时间越长,出错的可能性就越大,这还不说程序需要人工干预的情境.一旦运行中需要人工干预,时间越长,干预次数越多,出错的几率就更大了.在数据太多,工期太短的情况下,使用多线程抓取,也是一个好办法,但这会增加程序复杂度,对最终数据准确性产

数据抓取的艺术(三)

原文地址:http://blog.chinaunix.net/uid-22414998-id-3696649.html 本来是想把这部分内容放到前一篇<数据抓取的艺术(二):数据抓取程序优化>之中.但是随着任务的完成,我越来越感觉到其中深深的趣味,现总结如下: (1)时间     时间是一个与抓取规模相形而生的因素,数据规模越大,时间消耗往往越长.所以程序优化变得相当重要,要知道抓取时间越长,出错的可能性就越大,这还不说程序需要人工干预的情境.一旦运行中需要人工干预,时间越长,干预次数越多,出

Nutch学习笔记——抓取过程简析

Nutch学习笔记二--抓取过程简析 学习环境: ubuntu 概要: Nutch 是一个开源Java 实现的搜索引擎.它提供了我们运行自己的搜索引擎所需的全部工具.包括全文搜索和Web爬虫. 通过nutch,诞生了hadoop.tika.gora. 先安装SVN和Ant环境.(通过编译源码方式来使用nutch) apt-get install ant apt-get install subversion [email protected]:~/data/nutch$ svn co https:

建站指南:百度认为什么样的网站更有抓取和收录价值2012-06-20

建站指南:百度认为什么样的网站更有抓取和收录价值2012-06-20 百度认为什么样的网站更有抓取和收录价值呢?我们从下面几个方面简单介绍.鉴于技术保密以及网站运营的差异等其他原因,以下内容仅供站长参考,具体的收录策略包括但不仅限于所述内容. 第一方面:网站创造高品质的内容,能为用户提供独特的价值. 百度作为搜索引擎,最终的目的是满足用户的搜索需求,所以要求网站内容首先能满足用户的需求,现今互联网上充斥了大量同质的内容,在同样能满足用户需求的前提下,如果您网站提供的内容是独一无二的或者是具有一定