HttpClient Tutorial

Apache的HTTP客户端开源项目——HttpClient

本文为HttpClient 4.3.6附带的Tutorial的部分翻译,仅到达需要的抓取网页页面大小的目的,以及二级、三级页面大小

Preface:

HTTP也许已经成为现在最重要的互联网协议。越来越多的应用需要HTTP的支持

尽管java.net包提供了通过HTTP协议获取资源的基本交互手段,但是功能和灵活性上都不能满足现在应用软件的需要。HttpClient想要通过强大高效的库,弥补客户端程序的HTTP请求交互上的空白。

HttpClient提供强大的基于HTTP协议的交互手段,希望能给所有想处理HTTP请求的应用程序提供支持。

The Hyper-Text Transfer Protocol (HTTP) is perhaps the most significant protocol used on the Internet
today. Web services, network-enabled appliances and the growth of network computing continue to
expand the role of the HTTP protocol beyond user-driven web browsers, while increasing the number
of applications that require HTTP support.
Although the java.net package provides basic functionality for accessing resources via HTTP, it doesn‘t
provide the full flexibility or functionality needed by many applications. HttpClient seeks to fill this
void by providing an efficient, up-to-date, and feature-rich package implementing the client side of
the most recent HTTP standards and recommendations.
Designed for extension while providing robust support for the base HTTP protocol, HttpClient may
be of interest to anyone building HTTP-aware client applications such as web browsers, web service
clients, or systems that leverage or extend the HTTP protocol for distributed communication.

HttpClient scope:

  • 基于HttpCore【http://hc.apache.org/httpcomponents-core/index.html】的客户端的HTTP交互库函数
  • 基于阻塞的I/O
  • 内容不可知
  • Client-side HTTP transport library based on HttpCore [http://hc.apache.org/httpcomponents-core/index.html]
  • Based on classic (blocking) I/O
  • Content agnostic

What HttpClient is NOT:

HttpClient不是一个浏览器,这是一个客户行为的HTTP转换库。HttpClient的目的是发送和接受HTTP消息。HttpClient不会去处理请求,处理HTML页面内的javascript,没有特别设置时也不会去猜测内容的类型等与HTTP转换无关的行为。

HttpClient is NOT a browser. It is a client side HTTP transport library. HttpClient‘s purpose is
to transmit and receive HTTP messages. HttpClient will not attempt to process content, execute
javascript embedded in HTML pages, try to guess content type, if not explicitly set, or reformat
request / redirect location URIs, or other functionality unrelated to the HTTP transport.

Chapter 1.Fundamentals

1.1. Request execution

HttpClient的核心功能是执行HTTP方法。执行HTTP方法包含了一个或多个HTTP request / HTTP response 交互。这些交互常常都在HttpClient内部被完成了。用户需要提供需要执行的request object,HttpClient就会根据request去请求目标服务器,并且返回响应的response object;如果未成功,则返回一个异常。

The most essential function of HttpClient is to execute HTTP methods. Execution of an HTTP method
involves one or several HTTP request / HTTP response exchanges, usually handled internally by
HttpClient. The user is expected to provide a request object to execute and HttpClient is expected to
transmit the request to the target server return a corresponding response object, or throw an exception
if execution was unsuccessful.

通常情况下,HttpClient API的入口将会是HttpClient定义的如上约定的接口。

简单的事例:

Quite naturally, the main entry point of the HttpClient API is the HttpClient interface that defines the
contract described above.
Here is an example of request execution process in its simplest form:

CloseableHttpClient httpclient = HttpClients.createDefault();
HttpGet httpget = new HttpGet("http://localhost/");
CloseableHttpResponse response = httpclient.execute(httpget);
try{
   <...>
}finally{
   response.close();
}

1.1.1. HTTP request

所有的HTTP请求中都会有一行标示请求类型、请求URI和HTTP协议版本。

All HTTP requests have a request line consisting a method name, a request URI and an HTTP protocol
version.

HttpClient 支持HTTP/1.1中规定的所有请求类型:GET,HEAD,POST,PUT,DELETE,TRACE和OPTIONS。每一个请求类型都有一个单独的类对应:HttpGet,HttpHead,HttpPost,HttpPut,HttpDelete,HttpTrace和HttpOpions。

HttpClient supports out of the box all HTTP methods defined in the HTTP/1.1 specification: GET,
HEAD, POST, PUT, DELETE, TRACE and OPTIONS. There is a specific class for each method type.: HttpGet,
HttpHead, HttpPost, HttpPut, HttpDelete, HttpTrace, and HttpOptions.

请求的URI是一个 Uniform Resource Identifier,明确了一个和请求对应的资源。HTTP请求的URIs中包含了协议调度,主机名,端口,资源路径,optional query和optional fragment。

The Request-URI is a Uniform Resource Identifier that identifies the resource upon which to apply
the request. HTTP request URIs consist of a protocol scheme, host name, optional port, resource path,
optional query, and optional fragment.

HttpGet httpget = new HttpGet(
"http://www.google.com/search?hl=en&q=httpclient&btnG=Google+Search&aq=f&oq=");

HttpClient提供了一个URIBuilder通用类来简单的定义和修改请求URI。

URI uri = new URIBuilder()
.setScheme("http")
.setHost("www.google.com")
.setPath("/search")
.setParameter("q", "httpclient")
.setParameter("btnG", "Google Search")
.setParameter("aq", "f")
.setParameter("oq", "")
.build();
HttpGet httpget = new HttpGet(uri);System.out.println(httpget.getURI());

stdout >

http://www.google.com/search?q=httpclient&btnG=Google+Search&aq=f&oq=

1.1.2. HTTP response

HTTP的响应是由服务器收到请求并解析后,返回的消息。消息的首行包含了协议版本,其后是状态码,示例如下:

HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1,
HttpStatus.SC_OK, "OK");
System.out.println(response.getProtocolVersion());
System.out.println(response.getStatusLine().getStatusCode());
System.out.println(response.getStatusLine().getReasonPhrase());
System.out.println(response.getStatusLine().toString());

stdout >

HTTP/1.1
200
OK
HTTP/1.1 200 OK

1.1.3. Working with message headers

一个HTTP消息可以包含多个消息头属性,如内容长度,内容类型等等。HttpClient提供一个方法取回,添加,移除和枚举消息头。

HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1,
HttpStatus.SC_OK, "OK");
response.addHeader("Set-Cookie",
"c1=a; path=/; domain=localhost");
response.addHeader("Set-Cookie",
"c2=b; path=\"/\", c3=c; domain=\"localhost\"");
Header h1 = response.getFirstHeader("Set-Cookie");
System.out.println(h1);
Header h2 = response.getLastHeader("Set-Cookie");
System.out.println(h2);
Header[] hs = response.getHeaders("Set-Cookie");
System.out.println(hs.length);

stdout >

Set-Cookie: c1=a; path=/; domain=localhost
Set-Cookie: c2=b; path="/", c3=c; domain="localhost"
2

获得所有消息头最有效的方法是使用HeaderIterator接口

HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1,
HttpStatus.SC_OK, "OK");
response.addHeader("Set-Cookie",
"c1=a; path=/; domain=localhost");
response.addHeader("Set-Cookie",
"c2=b; path=\"/\", c3=c; domain=\"localhost\"");
HeaderIterator it = response.headerIterator("Set-Cookie");
while (it.hasNext()) {
System.out.println(it.next());
}

stdout >

Set-Cookie: c1=a; path=/; domain=localhost
Set-Cookie: c2=b; path="/", c3=c; domain="localhost"

它同时也提供了一个方便的方法将HTTP的消息解析成单个的消息头元素

HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1,
HttpStatus.SC_OK, "OK");
response.addHeader("Set-Cookie",
"c1=a; path=/; domain=localhost");
response.addHeader("Set-Cookie",
"c2=b; path=\"/\", c3=c; domain=\"localhost\"");
HeaderElementIterator it = new BasicHeaderElementIterator(
response.headerIterator("Set-Cookie"));
while (it.hasNext()) {
HeaderElement elem = it.nextElement();
System.out.println(elem.getName() + " = " + elem.getValue());
NameValuePair[] params = elem.getParameters();
for (int i = 0; i < params.length; i++) {
System.out.println(" " + params[i]);
}
}

stdout >

c1 = a
path=/
domain=localhost
c2 = b
path=/
c3 = c
domain=localhost

HTTP entity

HTTP消息根据Request或者Response的不同携带不同的内容实体。实体不是必须的。当实体定义为request时,Request请求会使用实体。HTTP特别定义了两种定义为request方法的实体:POST和PUT。Response通常被要求包含一个内容实体。在这里定义了几种异常,如:responses to HEAD method, 204 No Content, 304 Not Modified, 205 Reset Content responses.

HTTPClient 根据实体内容的来源将实体分为三种:

  • streamed:    内容是从一个流中取得的,或是在飞行途中生成的。特别的,这个分类包含了HTTP response中接收到的实体。流中的实体通常是不可重复的。
  • self-contained:    内容是在内存或其他独立于连接或其他实体的。Self-contained实体通常是可以重复的。这种实体通常用于包含HTTP request。
  • wrapping:    内容是从其他实体中获得的。

这种分类对于连接管理来说是非常重要的当内容从一个HTTP response取出。对于一个被应用创建并只使用HttpClient发送的request实体来说,streamed和self-contained的不同是挺重要的。在这种情况下,通常考虑将不重复使用的实体作为streamed,可重复的作为self-contained。

Repeatable entities

当一个实体可以被重复时,意味着这个实体可以被多次读取。只有self-contained实体可以被多次读取(如ByteArrayEntity或StringEntity)

Using HTTP entities

一个实体可以被表示为字节流或字符流,它支持字符编码。

当执行一个包含内容的request时,或者是request成功后,response的body被用来存放送回的结果时,实体将会被创建。

你可以使用HttpEntity#getContent()方法,返回一个java.io.InputStream输入流来读取实体中的内容;或者使用HttpEntity#writeTo(OutputStream)方法将所有的内容写入一个提供的输出流。

当用户获取了一个incoming实体后,可以使用方法HttpEntity#getContentType() and HttpEntity#getContentLength() 获取一些常用的metadata如Content-Type和Content-Length头(如果存在)。因为Content-Type头中包含了字符编码和内容类别,HttpEntity#getContentEncoding()方法通常被用来读取这些信息。如果头不可读,则长度返回-1,Content-Type返回NULL。如果头可读,则头文件的对象被返回。

当创建一个outgoing实体,这些meta data需要在创建时提供。

StringEntity myEntity = new StringEntity("important message",
ContentType.create("text/plain", "UTF-8"));
System.out.println(myEntity.getContentType());
System.out.println(myEntity.getContentLength());
System.out.println(EntityUtils.toString(myEntity));
System.out.println(EntityUtils.toByteArray(myEntity).length);

stdout >

Content-Type: text/plain; charset=utf-8
17
important message
17

Ensuring release of low level resources

为了保证合适的释放资源,要求使用者需要关闭连接实体的Content stream以及response本身。

CloseableHttpClient httpclient = HttpClients.createDefault();
HttpGet httpget = new HttpGet("http://localhost/");
CloseableHttpResponse response = httpclient.execute(httpget);
try {
HttpEntity entity = response.getEntity();
if (entity != null) {
InputStream instream = entity.getContent();
try {
// do something useful
} finally {
instream.close();
}
}
} finally {
response.close();
}

关闭content stream和关闭response的不同点在于,前者会尝试保持连接,后者会立刻关闭并断开连接。

请注意,HttpEntity#writeTo(OutputStream)方法,当内容完全写入OutputStream后,同样需要合理的释放资源。同理,使用HttpEntity#getContent()方法获得java.io.InputStream后,也需要在finally代码块中加入释放资源的语句。

当使用流操作实体时,使用者可以用EntityUtils#consume(HttpEntity)方法来确保实体的内容已被全部读取完毕以及确保下层流已被关闭。

然而有这么一种情况,当一个实体的一小部分response内容需要被取回,重复读取剩余部分和连接重复使用,会造成消耗过高,这种情况下可以通过关闭response终止content stream。

CloseableHttpClient httpclient = HttpClients.createDefault();
HttpGet httpget = new HttpGet("http://localhost/");
CloseableHttpResponse response = httpclient.execute(httpget);
try {
HttpEntity entity = response.getEntity();
if (entity != null) {
InputStream instream = entity.getContent();
int byteOne = instream.read();
int byteTwo = instream.read();
// Do not need the rest
}
} finally {
response.close();
}

连接将会不可用,但是所有被占用的资源将会被合理释放。

Consuming entity content

推荐的获取内容的方法是通过使用HttpEntity#getContent()方法或HttpEntity#writeTo(OutputStream)方法。HttpClient使用EntityUtils类,将几种容易的读取方法提供给用户使用。用户可以通过String或byte数组获取完整的内容而不用直接读取java.io.InputStream。然而,EntityUtils是强烈不建议使用的。除非response实体从一个可信任的HTTP服务器获取并且被限制长度。

CloseableHttpClient httpclient = HttpClients.createDefault();
HttpGet httpget = new HttpGet("http://localhost/");
CloseableHttpResponse response = httpclient.execute(httpget);
try {
HttpEntity entity = response.getEntity();
if (entity != null) {
long len = entity.getContentLength();
if (len != -1 && len < 2048) {
System.out.println(EntityUtils.toString(entity));
} else {
// Stream content out
}
}
} finally {
response.close();
}

在某些情况下,一个实体需要被多次读取。这种情况下实体的内容必须在某种程度上在内存或硬盘上可以被缓存。最简单的方法是使用BufferedHttpEntity类。这个类可以将内容存入一个内存中的缓存。其他方法实体的容器都要求有一个现成的实体可以使用。

CloseableHttpResponse response = <...>
HttpEntity entity = response.getEntity();
if (entity != null) {
entity = new BufferedHttpEntity(entity);
}

Producing entity content

HttpClient提供几个类能够高效的通过流获得HTTP连接中的内容。这些类的实例可以将实体的内容包含入outgoingHTTP request如POST和PUT。HttpClient提供了几种常见的数据容器,如String, byte array, input stream, and file: StringEntity, ByteArrayEntity, InputStreamEntity, and FileEntity.

File file = new File("somefile.txt");
FileEntity entity = new FileEntity(file,
ContentType.create("text/plain", "UTF-8"));
HttpPost httppost = new HttpPost("http://localhost/action.do");
httppost.setEntity(entity);

请注意,InputStreamEntity是不可重用的,因为底层数据流只能被读取一次。通常推荐实现HttpEntity类,这是一个self-contained;而不是使用通常的InputStreamEntity。FileEntity是个不错的选择。

Response handlers

最简单和最方便

时间: 2024-10-17 10:09:10

HttpClient Tutorial的相关文章

HttpClient教程

HttpClient教程 2017-03-29 官方文档:http://hc.apache.org/httpcomponents-client-ga/ HttpClient Tutorial翻译文档: 第一章 基础 第二章 连接管理 第三章 HTTP状态管理 第四章 HTTP认证 第五章 HTTP客户端服务 第六章 高级主题 封装HttpClient 轻松把玩HttpClient之封装HttpClient工具类

HttpClient and FileUpload

All communication over the Internet happens using a standard set of protocols, such as File Transfer Protocol (FTP), Simple Mail Transfer Protocol (SMTP), Post Office Protocol (POP), Hypertext Transfer Protocol (HTTP), and so on. HTTP is one of the m

HttpClient Timeout

1. Overview This tutorial will show how to configure a timeout with the Apache HttpClient 4. If you want to dig deeper and learn other cool things you can do with the HttpClient – head on over to the main HttpClient tutorial. 2. Configure Timeouts vi

java 发送 http请求——HttpClient

使用HttpClient来发送Http请求 引入两个包:[1]org.apache.httpcomponents.httpclient_x.x.x.jar  [2]org.apache.httpcomponents.httpcore_x.x.x.jar 下载链接:Apache HttpComponents - HttpComponents Downloads 参考文档:[1]HttpClient Tutorial  [2]HttpClient Example 1 package http; 2

重要网络资源地址

[A JAVA] JAVASE API:http://docs.oracle.com/javase/7/docs/api/ [B JavaEE] JAVAEE官方文档:http://www.oracle.com/technetwork/java/javaee/documentation/index.html JAVAEE API:http://docs.oracle.com/javaee/7/api/ HTML/JS/XML/Web Service/JSON教程:http://www.w3sch

【Stackoverflow问题精选】如何使用java.net.URLConnection收发HTTP请求

问题 如何使用java.net.URLConnection收发HTTP请求呢?处理Http请求,有哪些最佳实践? 讨论: 精华回答 首先声明,下面的代码,都是基本的例子.更严谨的话,还应加入处理各种异常的代码(如IOExceptions.NullPointerException.ArrayIndexOutOfBoundsException) 准备 首先,需要设置请求的URL以及charset(编码):另外还需要哪些参数,则取决于各自url的要求. String url = "http://exa

阿帕奇 Http 组件(Apache HttpComponents)- Apache 翻译过来好像都不认识了吧

Apache HttpComponents 太阳火神的美丽人生 (http://blog.csdn.net/opengl_es) 本文遵循"署名-非商业用途-保持一致"创作公用协议 转载请保留此句:太阳火神的美丽人生 -  本博客专注于 敏捷开发及移动和物联设备研究:iOS.Android.Html5.Arduino.pcDuino,否则,出自本博客的文章拒绝转载或再转载,谢谢合作. 标题也就没啥可翻的了,就是 Apache 提供的免费开源的 Http 组件库.Apache 冷不丁翻译

HttpClient官文总结-Guide

打开Commons HttpClient-3.x的官网会发现,这个项目已经停止更新,取代它的是Apache HttpComponents项目的HttpClient和HttpCore模块,所以重点就关注新的工程. 在HttpClient模块中,官方目前用到的最新版本是HC4.5. 首先给出了简单的例子GET/POST,但这个例子并不能直接放到实际场景中使用,具体查看注释说明: package org.apache.http.examples.client; import java.util.Arr

Java HttpClient使用小结

1.使用连接池 虽说http协议时无连接的,但毕竟是基于tcp的,底层还是需要和服务器建立连接的.对于需要从同一个站点抓取大量网页的程序,应该使用连接池,否则每次抓取都和Web站点建立连接.发送请求.获得响应.释放连接,一方面效率不高,另一方面稍不小心就会疏忽了某些资源的释放.导致站点拒绝连接(很多站点会拒绝同一个ip的大量连接.防止DOS攻击). 连接池的例程如下: [java] view plain copy SchemeRegistry schemeRegistry = new Schem