配置Nutch模拟浏览器以绕过反爬虫限制

原文链接：http://yangshangchuan.iteye.com/blog/2030741

当我们配置Nutch抓取 http://yangshangchuan.iteye.com 的时候，抓取的所有页面内容均为：您的访问请求被拒绝 ...... 这是最简单的反爬虫策略（该策略简单地读取HTTP请求头User-Agent的值来判断是人（浏览器）还是机器爬虫），我们只需要简单地配置Nutch来模拟浏览器（simulate web browser）就可以绕过这种限制。

在nutch-default.xml中有5项配置是和User-Agent相关的：

Xml代码

<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP ‘From‘ request
header and User-Agent header. A good practice is to mangle this
address (e.g. ‘info at example dot com‘) to avoid spamming.
</description>
</property>
<property>
<name>http.agent.name</name>
<value></value>
<description>HTTP ‘User-Agent‘ request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.agent.version</name>
<value>Nutch-1.7</value>
<description>A version string to advertise in the User-Agent
header.</description>
</property>

<property>
  <name>http.agent.description</name>
  <value></value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>
<property>
  <name>http.agent.url</name>
  <value></value>
  <description>A URL to advertise in the User-Agent header.  This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>
<property>
  <name>http.agent.email</name>
  <value></value>
  <description>An email address to advertise in the HTTP ‘From‘ request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. ‘info at example dot com‘) to avoid spamming.
  </description>
</property>
<property>
  <name>http.agent.name</name>
  <value></value>
  <description>HTTP ‘User-Agent‘ request header. MUST NOT be empty -
  please set this to a single word uniquely related to your organization.
  NOTE: You should also check other related properties:
	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version
  and set their values appropriately.
  </description>
</property>
<property>
  <name>http.agent.version</name>
  <value>Nutch-1.7</value>
  <description>A version string to advertise in the User-Agent
   header.</description>
</property>

在类nutch1.7/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java中可以看到这5项配置是如何构成User-Agent的：

Java代码

this.userAgent = getAgentString( conf.get("http.agent.name"),
conf.get("http.agent.version"),
conf.get("http.agent.description"),
conf.get("http.agent.url"),
conf.get("http.agent.email") );

this.userAgent = getAgentString( conf.get("http.agent.name"),
        conf.get("http.agent.version"),
        conf.get("http.agent.description"),
        conf.get("http.agent.url"),
        conf.get("http.agent.email") );

Java代码

private static String getAgentString(String agentName,
String agentVersion,
String agentDesc,
String agentURL,
String agentEmail) {
if ( (agentName == null) || (agentName.trim().length() == 0) ) {
// TODO : NUTCH-258
if (LOGGER.isErrorEnabled()) {
LOGGER.error("No User-Agent string set (http.agent.name)!");
}
}
StringBuffer buf= new StringBuffer();
buf.append(agentName);
if (agentVersion != null) {
buf.append("/");
buf.append(agentVersion);
}
if ( ((agentDesc != null) && (agentDesc.length() != 0))
|| ((agentEmail != null) && (agentEmail.length() != 0))
|| ((agentURL != null) && (agentURL.length() != 0)) ) {
buf.append(" (");
if ((agentDesc != null) && (agentDesc.length() != 0)) {
buf.append(agentDesc);
if ( (agentURL != null) || (agentEmail != null) )
buf.append("; ");
}
if ((agentURL != null) && (agentURL.length() != 0)) {
buf.append(agentURL);
if (agentEmail != null)
buf.append("; ");
}
if ((agentEmail != null) && (agentEmail.length() != 0))
buf.append(agentEmail);
buf.append(")");
}
return buf.toString();
}

  private static String getAgentString(String agentName,
                                       String agentVersion,
                                       String agentDesc,
                                       String agentURL,
                                       String agentEmail) {

    if ( (agentName == null) || (agentName.trim().length() == 0) ) {
      // TODO : NUTCH-258
      if (LOGGER.isErrorEnabled()) {
        LOGGER.error("No User-Agent string set (http.agent.name)!");
      }
    }

    StringBuffer buf= new StringBuffer();

    buf.append(agentName);
    if (agentVersion != null) {
      buf.append("/");
      buf.append(agentVersion);
    }
    if ( ((agentDesc != null) && (agentDesc.length() != 0))
    || ((agentEmail != null) && (agentEmail.length() != 0))
    || ((agentURL != null) && (agentURL.length() != 0)) ) {
      buf.append(" (");

      if ((agentDesc != null) && (agentDesc.length() != 0)) {
        buf.append(agentDesc);
        if ( (agentURL != null) || (agentEmail != null) )
          buf.append("; ");
      }

      if ((agentURL != null) && (agentURL.length() != 0)) {
        buf.append(agentURL);
        if (agentEmail != null)
          buf.append("; ");
      }

      if ((agentEmail != null) && (agentEmail.length() != 0))
        buf.append(agentEmail);

      buf.append(")");
    }
    return buf.toString();
  }

在类nutch1.7/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java中使用User-Agent请求头，这里的http.getUserAgent()返回的userAgent就是HttpBase.java中的userAgent：

Java代码

String userAgent = http.getUserAgent();
if ((userAgent == null) || (userAgent.length() == 0)) {
if (Http.LOG.isErrorEnabled()) { Http.LOG.error("User-agent is not set!"); }
} else {
reqStr.append("User-Agent: ");
reqStr.append(userAgent);
reqStr.append("\r\n");
}

String userAgent = http.getUserAgent();
if ((userAgent == null) || (userAgent.length() == 0)) {
	if (Http.LOG.isErrorEnabled()) { Http.LOG.error("User-agent is not set!"); }
} else {
	reqStr.append("User-Agent: ");
	reqStr.append(userAgent);
	reqStr.append("\r\n");
}

通过上面的分析可知：在nutch-site.xml中只需要增加如下几种配置之一便可以模拟一个特定的浏览器（Imitating a specific browser）：

1、模拟Firefox浏览器：

Xml代码

<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko</value>
</property>
<property>
<name>http.agent.version</name>
<value>20100101 Firefox/27.0</value>
</property>

<property>
	<name>http.agent.name</name>
	<value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko</value>
</property>
<property>
	<name>http.agent.version</name>
	<value>20100101 Firefox/27.0</value>
</property>

2、模拟IE浏览器：

Xml代码

<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident</value>
</property>
<property>
<name>http.agent.version</name>
<value>6.0)</value>
</property>

<property>
	<name>http.agent.name</name>
	<value>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident</value>
</property>
<property>
	<name>http.agent.version</name>
	<value>6.0)</value>
</property>

3、模拟Chrome浏览器：

Xml代码

<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari</value>
</property>
<property>
<name>http.agent.version</name>
<value>537.36</value>
</property>

<property>
	<name>http.agent.name</name>
	<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari</value>
</property>
<property>
	<name>http.agent.version</name>
	<value>537.36</value>
</property>

4、模拟Safari浏览器：

Xml代码

<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari</value>
</property>
<property>
<name>http.agent.version</name>
<value>534.57.2</value>
</property>

<property>
	<name>http.agent.name</name>
	<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari</value>
</property>
<property>
	<name>http.agent.version</name>
	<value>534.57.2</value>
</property>

5、模拟Opera浏览器：

Xml代码

<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 OPR</value>
</property>
<property>
<name>http.agent.version</name>
<value>19.0.1326.59</value>
</property>

<property>
	<name>http.agent.name</name>
	<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 OPR</value>
</property>
<property>
	<name>http.agent.version</name>
	<value>19.0.1326.59</value>
</property>

后记：查看User-Agent的方法：

1、http://www.useragentstring.com

2、http://whatsmyuseragent.com

3、http://www.enhanceie.com/ua.aspx

NUTCH/HADOOP视频教程

时间： 2024-11-08 04:36:11

配置Nutch模拟浏览器以绕过反爬虫限制

配置Nutch模拟浏览器以绕过反爬虫限制的相关文章

Python冒充其他浏览器抓取猫眼电影数据(绕过反爬虫)

网站常见的反爬虫和应对方法

常见的反爬虫和应对方法

网站常见的反爬虫和应对方法(转)

【转载】网站常见的反爬虫和应对方法

反-反爬虫：用几行代码写出和人类一样的动态爬虫

(转)常见的反爬虫和应对方法

第三百三十三节，web爬虫讲解2—Scrapy框架爬虫—Scrapy模拟浏览器登录—获取Scrapy框架Cookies

爬虫实例——爬取煎蛋网OOXX频道（反反爬虫——伪装成浏览器）