爬虫学习日记1

理解URL

一、URI

什么是uri？web上每种可用资源，如html文档、图像、视频、程序等都是由一个通用资源标志符URI（Universal Resource Identifer）进行定位。

URI通常由三部分组成：

访问资源的命名机制；
存放资源的主机名；
资源自身的名称，由路径表示。

如下面的URI：

http://www.webmonkey.com.cn/html/html40/

我们可以这样理解：这是一个通过HTML协议访问的资源，位于主机www.webmonkey.com.cn上，通过路径“/html/html40”访问。

二、URL

URL是URI的一个子集。是统一资源定位符（Universal Resource Locator）的缩写，URL是Internet上描述信息资源的字符串，主要用在各种WWW客户程序和服务器程序上。

URL的格式由三部分组成：

协议（或称为服务方式）
存有该资源的主机IP地址（有时包括端口）
主机资源的具体地址，如目录和文件名

HTTP协议的URL示例

例：http://www.baidu.com/talk/talk.htm

其计算机域名为www.baidu.com,超级文本文件（文件类型为".html"）是在目录"/talk"下的"talk.htm"

文件的URL

例：file://ftp.youku.com/pub/files/foobar.txt

上面这个URL代表存放在主句file://ftp.youku.com上的"pub/files/"目录下的一个文件，文件名为"foobar.txt"。

通过URL抓取网页内容

上面讲了URL的构成，下面主要阐述根据URL抓取网页。所谓网页抓取就是把URL地址重指定的网络资源从网络流中读取出来，然后保存到本地。类似于使用程序模拟浏览器功能，把URL作为HTTP请求的内容发送到服务器，然后读取服务器的响应资源。

GET方式：

通过URL地址获取URL对象

java.net.URL url=new URL(path);
通过URL对象获取网络流
InputStream stream=url.openStream();

在实际项目中，网络环境比较复杂，只用java.net包中的API来模拟浏览器客户端的工作代码量非常大，需要处理HTTP返回的状态码，设置HTTP代理，处理HTTPS协议等工作，为了便于应用程序的开发，实际开发时常常使用Apache的HTTP客户端开源项目HttpClient。例如：

创建一个客户端，类似打开一个浏览器

HttpClient httpClient=new org.apache.commons.httpclient.HttpClient();
创建一个get方法，类似于在浏览器地址中输入一个地址
GetMethod getMethod=new org.apache.commons.httpclient.methods.GetMethod(path);//path为URL字符串
执行，返回响应状态码
int statusCode = httpClient.executeMethod(getMethod);
只处理状态码为200（请求成功）的请求
statusCode == HttpStatus.SC_OK
获取请求返回的内容流
InputStream input = getMethod.getResponseBodyAsStream();
获取文件输出流
String filename ="输出路径"+输出文件名；

OutputStream output = new FileOutputStream(filename);
输出到文件
int tempByte = -1;

while ((tempByte = input.read()) > 0) {

output.write(tempByte);

}
关闭输入输出流
input.close();

output.close();

下面代码可直接运行：

package spider;

import java.io.FileOutputStream;

import java.io.InputStream;

import java.io.OutputStream;

import org.apache.commons.httpclient.HttpClient;

import org.apache.commons.httpclient.HttpStatus;

import org.apache.commons.httpclient.methods.GetMethod;

/**

* @author CallMeWhy

public class Spider {

private static HttpClient httpClient = new HttpClient();

/**

* @param path

* 目标网页的链接

* @return 返回布尔值，表示是否正常下载目标页面

* @throws Exception

* 读取网页流或写入本地文件流的IO异常

public static boolean downloadPage(String path) throws Exception {

// 定义输入输出流

InputStream input = null;

OutputStream output = null;

// 得到 post 方法

GetMethod getMethod = new GetMethod(path);

// 执行，返回状态码

int statusCode = httpClient.executeMethod(getMethod);

// 针对状态码进行处理

// 简单起见，只处理返回值为 200 的状态码

if (statusCode == HttpStatus.SC_OK) {

input = getMethod.getResponseBodyAsStream();

// 通过对URL的得到文件名

String filename = path.substring(path.lastIndexOf(‘/‘) + 1)

+ ".html";

// 获得文件输出流

output = new FileOutputStream(filename);

// 输出到文件

int tempByte = -1;

while ((tempByte = input.read()) > 0) {

output.write(tempByte);

}

// 关闭输入流

if (input != null) {

input.close();

}

// 关闭输出流

if (output != null) {

output.close();

}

return true;

}

return false;

}

public static void main(String[] args) {

try {

// 抓取百度首页，输出

Spider.downloadPage("https://www.baidu.com");

} catch (Exception e) {

e.printStackTrace();

}

POST方式：

package spider;

import java.io.FileOutputStream;

import java.io.IOException;

import java.io.InputStream;

import java.io.OutputStream;

import org.apache.commons.httpclient.HttpClient;

import org.apache.commons.httpclient.HttpException;

import org.apache.commons.httpclient.HttpStatus;

import org.apache.commons.httpclient.NameValuePair;

import org.apache.commons.httpclient.methods.PostMethod;

public class PostSpider {

private static HttpClient httpClient=new HttpClient();

//设置代理服务器

static{

//代理服务器IP地址和端口

httpClient.getHostConfiguration().setProxy("127.0.0.1", 8080);

}

public static boolean downloadPage(String path) throws HttpException,IOException{

boolean flag=false;

InputStream input=null;

OutputStream output=null;

PostMethod postMethod=new PostMethod(path);

//设置post方法的参数

NameValuePair[] postData=new NameValuePair[2];

postData[0]=new NameValuePair("name","xxxxxx");

postData[1]=new NameValuePair("password","xxxxxx");

postMethod.addParameters(postData);

//执行返回状态码

int statusCode=httpClient.executeMethod(postMethod);

//针对状态码进行处理（也可以处理其它状态码，这里只处理200的状态码）

if(statusCode==HttpStatus.SC_OK){

input=postMethod.getResponseBodyAsStream();

//文件名

String filename = path.substring(path.lastIndexOf(‘/‘) + 1)

+ ".html";

//获得文件输出流

output=new FileOutputStream(filename);

//输出到文件

int tempByte=-1;

while((tempByte=input.read())>0){

output.write(tempByte);

}

//关闭输入输出流

if(input!=null){

input.close();

}

if(output!=null){

output.close();

}

flag=true;

}

return flag;

}

public static void main(String[] args) {

try {

PostSpider.downloadPage("https://www.baidu.com");

} catch (Exception e) {

e.printStackTrace();

}

上面需要改动的是代理服务器、参数

时间： 2024-11-03 21:52:29

爬虫学习日记1

爬虫学习日记1的相关文章

java爬虫学习日记2-宽度优先爬虫代码实现

学习日记之状态模式和Effective C++

学习日记

学习日记之解释器模式和Effective C++

学习日记之中介者模式和Effective C++

学习日记之职责链模式和Effective C++

学习日记之单例模式和Effective C++

学习日记之迭代器模式和Effective C++

学习日记之适配器模式和Effective C++