Apache的HTTP客户端开源项目——HttpClient
本文为HttpClient 4.3.6附带的Tutorial的部分翻译,仅到达需要的抓取网页页面大小的目的,以及二级、三级页面大小
Preface:
HTTP也许已经成为现在最重要的互联网协议。越来越多的应用需要HTTP的支持
尽管java.net包提供了通过HTTP协议获取资源的基本交互手段,但是功能和灵活性上都不能满足现在应用软件的需要。HttpClient想要通过强大高效的库,弥补客户端程序的HTTP请求交互上的空白。
HttpClient提供强大的基于HTTP协议的交互手段,希望能给所有想处理HTTP请求的应用程序提供支持。
The Hyper-Text Transfer Protocol (HTTP) is perhaps the most significant protocol used on the Internet
today. Web services, network-enabled appliances and the growth of network computing continue to
expand the role of the HTTP protocol beyond user-driven web browsers, while increasing the number
of applications that require HTTP support.
Although the java.net package provides basic functionality for accessing resources via HTTP, it doesn‘t
provide the full flexibility or functionality needed by many applications. HttpClient seeks to fill this
void by providing an efficient, up-to-date, and feature-rich package implementing the client side of
the most recent HTTP standards and recommendations.
Designed for extension while providing robust support for the base HTTP protocol, HttpClient may
be of interest to anyone building HTTP-aware client applications such as web browsers, web service
clients, or systems that leverage or extend the HTTP protocol for distributed communication.
HttpClient scope:
- 基于HttpCore【http://hc.apache.org/httpcomponents-core/index.html】的客户端的HTTP交互库函数
- 基于阻塞的I/O
- 内容不可知
- Client-side HTTP transport library based on HttpCore [http://hc.apache.org/httpcomponents-core/index.html]
- Based on classic (blocking) I/O
- Content agnostic
What HttpClient is NOT:
HttpClient不是一个浏览器,这是一个客户行为的HTTP转换库。HttpClient的目的是发送和接受HTTP消息。HttpClient不会去处理请求,处理HTML页面内的javascript,没有特别设置时也不会去猜测内容的类型等与HTTP转换无关的行为。
HttpClient is NOT a browser. It is a client side HTTP transport library. HttpClient‘s purpose is
to transmit and receive HTTP messages. HttpClient will not attempt to process content, execute
javascript embedded in HTML pages, try to guess content type, if not explicitly set, or reformat
request / redirect location URIs, or other functionality unrelated to the HTTP transport.
Chapter 1.Fundamentals
1.1. Request execution
HttpClient的核心功能是执行HTTP方法。执行HTTP方法包含了一个或多个HTTP request / HTTP response 交互。这些交互常常都在HttpClient内部被完成了。用户需要提供需要执行的request object,HttpClient就会根据request去请求目标服务器,并且返回响应的response object;如果未成功,则返回一个异常。
The most essential function of HttpClient is to execute HTTP methods. Execution of an HTTP method
involves one or several HTTP request / HTTP response exchanges, usually handled internally by
HttpClient. The user is expected to provide a request object to execute and HttpClient is expected to
transmit the request to the target server return a corresponding response object, or throw an exception
if execution was unsuccessful.
通常情况下,HttpClient API的入口将会是HttpClient定义的如上约定的接口。
简单的事例:
Quite naturally, the main entry point of the HttpClient API is the HttpClient interface that defines the
contract described above.
Here is an example of request execution process in its simplest form:
CloseableHttpClient httpclient = HttpClients.createDefault(); HttpGet httpget = new HttpGet("http://localhost/"); CloseableHttpResponse response = httpclient.execute(httpget); try{ <...> }finally{ response.close(); }
1.1.1. HTTP request
所有的HTTP请求中都会有一行标示请求类型、请求URI和HTTP协议版本。
All HTTP requests have a request line consisting a method name, a request URI and an HTTP protocol
version.
HttpClient 支持HTTP/1.1中规定的所有请求类型:GET,HEAD,POST,PUT,DELETE,TRACE和OPTIONS。每一个请求类型都有一个单独的类对应:HttpGet,HttpHead,HttpPost,HttpPut,HttpDelete,HttpTrace和HttpOpions。
HttpClient supports out of the box all HTTP methods defined in the HTTP/1.1 specification: GET,
HEAD, POST, PUT, DELETE, TRACE and OPTIONS. There is a specific class for each method type.: HttpGet,
HttpHead, HttpPost, HttpPut, HttpDelete, HttpTrace, and HttpOptions.
请求的URI是一个 Uniform Resource Identifier,明确了一个和请求对应的资源。HTTP请求的URIs中包含了协议调度,主机名,端口,资源路径,optional query和optional fragment。
The Request-URI is a Uniform Resource Identifier that identifies the resource upon which to apply
the request. HTTP request URIs consist of a protocol scheme, host name, optional port, resource path,
optional query, and optional fragment.
HttpGet httpget = new HttpGet( "http://www.google.com/search?hl=en&q=httpclient&btnG=Google+Search&aq=f&oq=");
HttpClient提供了一个URIBuilder通用类来简单的定义和修改请求URI。
URI uri = new URIBuilder() .setScheme("http") .setHost("www.google.com") .setPath("/search") .setParameter("q", "httpclient") .setParameter("btnG", "Google Search") .setParameter("aq", "f") .setParameter("oq", "") .build(); HttpGet httpget = new HttpGet(uri);System.out.println(httpget.getURI());
stdout >
http://www.google.com/search?q=httpclient&btnG=Google+Search&aq=f&oq=
1.1.2. HTTP response
HTTP的响应是由服务器收到请求并解析后,返回的消息。消息的首行包含了协议版本,其后是状态码,示例如下:
HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK, "OK"); System.out.println(response.getProtocolVersion()); System.out.println(response.getStatusLine().getStatusCode()); System.out.println(response.getStatusLine().getReasonPhrase()); System.out.println(response.getStatusLine().toString());
stdout >
HTTP/1.1 200 OK HTTP/1.1 200 OK
1.1.3. Working with message headers
一个HTTP消息可以包含多个消息头属性,如内容长度,内容类型等等。HttpClient提供一个方法取回,添加,移除和枚举消息头。
HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK, "OK"); response.addHeader("Set-Cookie", "c1=a; path=/; domain=localhost"); response.addHeader("Set-Cookie", "c2=b; path=\"/\", c3=c; domain=\"localhost\""); Header h1 = response.getFirstHeader("Set-Cookie"); System.out.println(h1); Header h2 = response.getLastHeader("Set-Cookie"); System.out.println(h2); Header[] hs = response.getHeaders("Set-Cookie"); System.out.println(hs.length);
stdout >
Set-Cookie: c1=a; path=/; domain=localhost Set-Cookie: c2=b; path="/", c3=c; domain="localhost" 2
获得所有消息头最有效的方法是使用HeaderIterator接口
HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK, "OK"); response.addHeader("Set-Cookie", "c1=a; path=/; domain=localhost"); response.addHeader("Set-Cookie", "c2=b; path=\"/\", c3=c; domain=\"localhost\""); HeaderIterator it = response.headerIterator("Set-Cookie"); while (it.hasNext()) { System.out.println(it.next()); }
stdout >
Set-Cookie: c1=a; path=/; domain=localhost Set-Cookie: c2=b; path="/", c3=c; domain="localhost"
它同时也提供了一个方便的方法将HTTP的消息解析成单个的消息头元素
HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK, "OK"); response.addHeader("Set-Cookie", "c1=a; path=/; domain=localhost"); response.addHeader("Set-Cookie", "c2=b; path=\"/\", c3=c; domain=\"localhost\""); HeaderElementIterator it = new BasicHeaderElementIterator( response.headerIterator("Set-Cookie")); while (it.hasNext()) { HeaderElement elem = it.nextElement(); System.out.println(elem.getName() + " = " + elem.getValue()); NameValuePair[] params = elem.getParameters(); for (int i = 0; i < params.length; i++) { System.out.println(" " + params[i]); } }
stdout >
c1 = a path=/ domain=localhost c2 = b path=/ c3 = c domain=localhost
HTTP entity
HTTP消息根据Request或者Response的不同携带不同的内容实体。实体不是必须的。当实体定义为request时,Request请求会使用实体。HTTP特别定义了两种定义为request方法的实体:POST和PUT。Response通常被要求包含一个内容实体。在这里定义了几种异常,如:responses to HEAD method, 204 No Content, 304 Not Modified, 205 Reset Content responses.
HTTPClient 根据实体内容的来源将实体分为三种:
- streamed: 内容是从一个流中取得的,或是在飞行途中生成的。特别的,这个分类包含了HTTP response中接收到的实体。流中的实体通常是不可重复的。
- self-contained: 内容是在内存或其他独立于连接或其他实体的。Self-contained实体通常是可以重复的。这种实体通常用于包含HTTP request。
- wrapping: 内容是从其他实体中获得的。
这种分类对于连接管理来说是非常重要的当内容从一个HTTP response取出。对于一个被应用创建并只使用HttpClient发送的request实体来说,streamed和self-contained的不同是挺重要的。在这种情况下,通常考虑将不重复使用的实体作为streamed,可重复的作为self-contained。
Repeatable entities
当一个实体可以被重复时,意味着这个实体可以被多次读取。只有self-contained实体可以被多次读取(如ByteArrayEntity或StringEntity)
Using HTTP entities
一个实体可以被表示为字节流或字符流,它支持字符编码。
当执行一个包含内容的request时,或者是request成功后,response的body被用来存放送回的结果时,实体将会被创建。
你可以使用HttpEntity#getContent()方法,返回一个java.io.InputStream输入流来读取实体中的内容;或者使用HttpEntity#writeTo(OutputStream)方法将所有的内容写入一个提供的输出流。
当用户获取了一个incoming实体后,可以使用方法HttpEntity#getContentType() and HttpEntity#getContentLength() 获取一些常用的metadata如Content-Type和Content-Length头(如果存在)。因为Content-Type头中包含了字符编码和内容类别,HttpEntity#getContentEncoding()方法通常被用来读取这些信息。如果头不可读,则长度返回-1,Content-Type返回NULL。如果头可读,则头文件的对象被返回。
当创建一个outgoing实体,这些meta data需要在创建时提供。
StringEntity myEntity = new StringEntity("important message", ContentType.create("text/plain", "UTF-8")); System.out.println(myEntity.getContentType()); System.out.println(myEntity.getContentLength()); System.out.println(EntityUtils.toString(myEntity)); System.out.println(EntityUtils.toByteArray(myEntity).length);
stdout >
Content-Type: text/plain; charset=utf-8 17 important message 17
Ensuring release of low level resources
为了保证合适的释放资源,要求使用者需要关闭连接实体的Content stream以及response本身。
CloseableHttpClient httpclient = HttpClients.createDefault(); HttpGet httpget = new HttpGet("http://localhost/"); CloseableHttpResponse response = httpclient.execute(httpget); try { HttpEntity entity = response.getEntity(); if (entity != null) { InputStream instream = entity.getContent(); try { // do something useful } finally { instream.close(); } } } finally { response.close(); }
关闭content stream和关闭response的不同点在于,前者会尝试保持连接,后者会立刻关闭并断开连接。
请注意,HttpEntity#writeTo(OutputStream)方法,当内容完全写入OutputStream后,同样需要合理的释放资源。同理,使用HttpEntity#getContent()方法获得java.io.InputStream后,也需要在finally代码块中加入释放资源的语句。
当使用流操作实体时,使用者可以用EntityUtils#consume(HttpEntity)方法来确保实体的内容已被全部读取完毕以及确保下层流已被关闭。
然而有这么一种情况,当一个实体的一小部分response内容需要被取回,重复读取剩余部分和连接重复使用,会造成消耗过高,这种情况下可以通过关闭response终止content stream。
CloseableHttpClient httpclient = HttpClients.createDefault(); HttpGet httpget = new HttpGet("http://localhost/"); CloseableHttpResponse response = httpclient.execute(httpget); try { HttpEntity entity = response.getEntity(); if (entity != null) { InputStream instream = entity.getContent(); int byteOne = instream.read(); int byteTwo = instream.read(); // Do not need the rest } } finally { response.close(); }
连接将会不可用,但是所有被占用的资源将会被合理释放。
Consuming entity content
推荐的获取内容的方法是通过使用HttpEntity#getContent()方法或HttpEntity#writeTo(OutputStream)方法。HttpClient使用EntityUtils类,将几种容易的读取方法提供给用户使用。用户可以通过String或byte数组获取完整的内容而不用直接读取java.io.InputStream。然而,EntityUtils是强烈不建议使用的。除非response实体从一个可信任的HTTP服务器获取并且被限制长度。
CloseableHttpClient httpclient = HttpClients.createDefault(); HttpGet httpget = new HttpGet("http://localhost/"); CloseableHttpResponse response = httpclient.execute(httpget); try { HttpEntity entity = response.getEntity(); if (entity != null) { long len = entity.getContentLength(); if (len != -1 && len < 2048) { System.out.println(EntityUtils.toString(entity)); } else { // Stream content out } } } finally { response.close(); }
在某些情况下,一个实体需要被多次读取。这种情况下实体的内容必须在某种程度上在内存或硬盘上可以被缓存。最简单的方法是使用BufferedHttpEntity类。这个类可以将内容存入一个内存中的缓存。其他方法实体的容器都要求有一个现成的实体可以使用。
CloseableHttpResponse response = <...> HttpEntity entity = response.getEntity(); if (entity != null) { entity = new BufferedHttpEntity(entity); }
Producing entity content
HttpClient提供几个类能够高效的通过流获得HTTP连接中的内容。这些类的实例可以将实体的内容包含入outgoingHTTP request如POST和PUT。HttpClient提供了几种常见的数据容器,如String, byte array, input stream, and file: StringEntity, ByteArrayEntity, InputStreamEntity, and FileEntity.
File file = new File("somefile.txt"); FileEntity entity = new FileEntity(file, ContentType.create("text/plain", "UTF-8")); HttpPost httppost = new HttpPost("http://localhost/action.do"); httppost.setEntity(entity);
请注意,InputStreamEntity是不可重用的,因为底层数据流只能被读取一次。通常推荐实现HttpEntity类,这是一个self-contained;而不是使用通常的InputStreamEntity。FileEntity是个不错的选择。
Response handlers
最简单和最方便