Java豆瓣电影爬虫——小爬虫成长记(附源码)

  以前也用过爬虫,比如使用nutch爬取指定种子,基于爬到的数据做搜索,还大致看过一些源码。当然,nutch对于爬虫考虑的是十分全面和细致的。每当看到屏幕上唰唰过去的爬取到的网页信息以及处理信息的时候,总感觉这很黑科技。正好这次借助梳理Spring MVC的机会,想自己弄个小爬虫,简单没关系,有些小bug也无所谓,我需要的只是一个能针对某个种子网站能爬取我想要的信息就可以了。有Exception就去解决,可能是一些API使用不当,也可能是遇到了http请求状态异常,又或是数据库读写有问题,就是在这个报exception和解决exception的过程中,JewelCrawler(儿子的小名)已经可以能够独立的爬取数据,并且还有一项基于Word2Vec算法做个情感分析的小技能。

  后面可能还会有未知的Exception等着解决,也有一些性能需要优化,比如和数据库的交互,数据的读写等等。但是目测年内没有太多精力放这上面了,所以今天做一个简单的总结,而且前两篇主要侧重的是功能和结果,这篇来说说JewelCrawler是如何诞生的,并将代码放到Github上(源码地址在文章最后),有兴趣的可以关注下(仅供交流学习,请勿他用,考虑下douban君。多一点真诚,少一点伤害)

环境介绍

  开发工具:Intellij idea 14

  数据库: Mysql 5.5 + 数据库管理工具Navicat(可用来连接查询数据库)

  语言:Java

  Jar包管理:Maven

  版本管理:Git

目录结构

  其中

  com.ansj.vec是Word2Vec算法的Java版本实现

  com.jackie.crawler.doubanmovie是爬虫实现模块,其中又包括

  有些包是空的,因为这些模块还没有用上,其中

    constants包是存放常量类

    crawl包存放爬虫入口程序

    entity包映射数据库表的实体类

    test包存放测试类

    utils包存放工具类

  resource模块存放的是配置文件和资源文件,比如

    beans.xml:Spring上下文的配置文件

    seed.properties:种子文件

    stopwords.dic:停用词库

    comment12031715.txt:爬取的短评数据

    tokenizerResult.txt:使用IKAnalyzer分词后的结果文件

    vector.mod:基于Word2Vec算法训练的模型数据

  test模块是测试模块,用于编写UT.

数据库配置

  1. 添加依赖的包

  JewelCrawler使用的maven管理,所以只需要在pom.xml中添加相应的依赖就可以了

<dependency>
    <groupId>org.springframework</groupId>
    <artifactId>spring-jdbc</artifactId>
    <version>4.1.1.RELEASE</version>
</dependency>
<dependency>
    <groupId>commons-pool</groupId>
    <artifactId>commons-pool</artifactId>
    <version>1.6</version>
</dependency>
<dependency>
    <groupId>commons-dbcp</groupId>
    <artifactId>commons-dbcp</artifactId>
    <version>1.4</version>
</dependency>
<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.38</version>
</dependency>
<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.38</version>
</dependency>

  

  2. 声明数据源bean

  我们需要在beans.xml中声明数据源的bean

<context:property-placeholder location="classpath*:*.properties"/>
<bean id="dataSource" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close">
    <property name="driverClassName" value="${jdbc.driver}"/>
    <property name="url" value="${jdbc.url}"/>
    <property name="username" value="${jdbc.username}"/>
    <property name="password" value="${jdbc.password}"/>
</bean>

注意: 这里是绑定了外部配置文件jdbc.properties,具体数据源的参数从该文件读取。

  如果遇到问题“SQL [insert into user(id) values(?)]; Field ‘name‘ doesn‘t  have a default value;”解决方法是设置表的相应字段为自增长字段。

解析页面遇到的问题

  对于爬到的网页数据需要解析dom结构,拿到自己想要的数据,期间遇到如下错误

  org.htmlparser.Node不识别

  解决方法:添加jar包依赖

<dependency>
    <groupId>org.htmlparser</groupId>
    <artifactId>htmlparser</artifactId>
    <version>1.6</version>
</dependency>

  

  org.apache.http.HttpEntity不识别

  解决方法:添加jar包依赖

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.2</version>
</dependency> 

  当然这是期间遇到的问题,最后用的是Jsoup做的页面解析。

maven仓库下载速度慢

  之前使用的是默认的maven中央仓库,下载jar包的速度很慢,不知道是我的网络问题还是其他原因,后来在网上找到了阿里云的maven仓库,更新后,相比之前简直是秒下,吐血推荐。

<mirrors>
    <mirror>
      <id>alimaven</id>
      <name>aliyun maven</name>
      <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
      <mirrorOf>central</mirrorOf>
    </mirror>
</mirrors>

  找到maven的settings.xml文件,添加这个镜像即可。

读取resource模块下文件的一种方法

  比如读取seed.properties文件

@Test
    public void testFile(){
        File seedFile = new File(this.getClass().getResource("/seed.properties").getPath());
        System.out.print("===========" + seedFile.length() + "===========" );
    }

  

有关正则表达式

  使用regrex正则表达式的时候,如果匹配上了定义的Pattern,则需要先调用matcher的find方法然后才能使用group方法找到子串。直接调用group方法是没有办法找到你想要的结果的。

  我看了下上面Matcher类的源码

package java.util.regex;

import java.util.Objects;

public final class Matcher implements MatchResult {

    /**
     * The Pattern object that created this Matcher.
     */
    Pattern parentPattern;

    /**
     * The storage used by groups. They may contain invalid values if
     * a group was skipped during the matching.
     */
    int[] groups;

    /**
     * The range within the sequence that is to be matched. Anchors
     * will match at these "hard" boundaries. Changing the region
     * changes these values.
     */
    int from, to;

    /**
     * Lookbehind uses this value to ensure that the subexpression
     * match ends at the point where the lookbehind was encountered.
     */
    int lookbehindTo;

    /**
     * The original string being matched.
     */
    CharSequence text;

    /**
     * Matcher state used by the last node. NOANCHOR is used when a
     * match does not have to consume all of the input. ENDANCHOR is
     * the mode used for matching all the input.
     */
    static final int ENDANCHOR = 1;
    static final int NOANCHOR = 0;
    int acceptMode = NOANCHOR;

    /**
     * The range of string that last matched the pattern. If the last
     * match failed then first is -1; last initially holds 0 then it
     * holds the index of the end of the last match (which is where the
     * next search starts).
     */
    int first = -1, last = 0;

    /**
     * The end index of what matched in the last match operation.
     */
    int oldLast = -1;

    /**
     * The index of the last position appended in a substitution.
     */
    int lastAppendPosition = 0;

    /**
     * Storage used by nodes to tell what repetition they are on in
     * a pattern, and where groups begin. The nodes themselves are stateless,
     * so they rely on this field to hold state during a match.
     */
    int[] locals;

    /**
     * Boolean indicating whether or not more input could change
     * the results of the last match.
     *
     * If hitEnd is true, and a match was found, then more input
     * might cause a different match to be found.
     * If hitEnd is true and a match was not found, then more
     * input could cause a match to be found.
     * If hitEnd is false and a match was found, then more input
     * will not change the match.
     * If hitEnd is false and a match was not found, then more
     * input will not cause a match to be found.
     */
    boolean hitEnd;

    /**
     * Boolean indicating whether or not more input could change
     * a positive match into a negative one.
     *
     * If requireEnd is true, and a match was found, then more
     * input could cause the match to be lost.
     * If requireEnd is false and a match was found, then more
     * input might change the match but the match won‘t be lost.
     * If a match was not found, then requireEnd has no meaning.
     */
    boolean requireEnd;

    /**
     * If transparentBounds is true then the boundaries of this
     * matcher‘s region are transparent to lookahead, lookbehind,
     * and boundary matching constructs that try to see beyond them.
     */
    boolean transparentBounds = false;

    /**
     * If anchoringBounds is true then the boundaries of this
     * matcher‘s region match anchors such as ^ and $.
     */
    boolean anchoringBounds = true;

    /**
     * No default constructor.
     */
    Matcher() {
    }

/**
 * All matchers have the state used by Pattern during a match.
 */
Matcher(Pattern parent, CharSequence text) {
    this.parentPattern = parent;
    this.text = text;

    // Allocate state storage
    int parentGroupCount = Math.max(parent.capturingGroupCount, 10);
    groups = new int[parentGroupCount * 2];
    locals = new int[parent.localCount];

    // Put fields into initial states
    reset();
}
....
/**
 * Returns the input subsequence matched by the previous match.
 *
 * <p> For a matcher <i>m</i> with input sequence <i>s</i>,
 * the expressions <i>m.</i><tt>group()</tt> and
 * <i>s.</i><tt>substring(</tt><i>m.</i><tt>start(),</tt> <i>m.</i><tt>end())</tt>
 * are equivalent.  </p>
 *
 * <p> Note that some patterns, for example <tt>a*</tt>, match the empty
 * string.  This method will return the empty string when the pattern
 * successfully matches the empty string in the input.  </p>
 *
 * @return The (possibly empty) subsequence matched by the previous match,
 *         in string form
 *
 * @throws  IllegalStateException
 *          If no match has yet been attempted,
 *          or if the previous match operation failed
 */
public String group() {
    return group(0);
}
/**
 * Returns the input subsequence captured by the given group during the
 * previous match operation.
 *
 * <p> For a matcher <i>m</i>, input sequence <i>s</i>, and group index
 * <i>g</i>, the expressions <i>m.</i><tt>group(</tt><i>g</i><tt>)</tt> and
 * <i>s.</i><tt>substring(</tt><i>m.</i><tt>start(</tt><i>g</i><tt>),</tt> <i>m.</i><tt>end(</tt><i>g</i><tt>))</tt>
 * are equivalent.  </p>
 *
 * <p> <a href="Pattern.html#cg">Capturing groups</a> are indexed from left
 * to right, starting at one.  Group zero denotes the entire pattern, so
 * the expression <tt>m.group(0)</tt> is equivalent to <tt>m.group()</tt>.
 * </p>
 *
 * <p> If the match was successful but the group specified failed to match
 * any part of the input sequence, then <tt>null</tt> is returned. Note
 * that some groups, for example <tt>(a*)</tt>, match the empty string.
 * This method will return the empty string when such a group successfully
 * matches the empty string in the input.  </p>
 *
 * @param  group
 *         The index of a capturing group in this matcher‘s pattern
 *
 * @return  The (possibly empty) subsequence captured by the group
 *          during the previous match, or <tt>null</tt> if the group
 *          failed to match part of the input
 *
 * @throws  IllegalStateException
 *          If no match has yet been attempted,
 *          or if the previous match operation failed
 *
 * @throws  IndexOutOfBoundsException
 *          If there is no capturing group in the pattern
 *          with the given index
 */
public String group(int group) {
    if (first < 0)
        throw new IllegalStateException("No match found");
    if (group < 0 || group > groupCount())
        throw new IndexOutOfBoundsException("No group " + group);
    if ((groups[group*2] == -1) || (groups[group*2+1] == -1))
        return null;
    return getSubSequence(groups[group * 2], groups[group * 2 + 1]).toString();
}
/**
 * Attempts to find the next subsequence of the input sequence that matches
 * the pattern.
 *
 * <p> This method starts at the beginning of this matcher‘s region, or, if
 * a previous invocation of the method was successful and the matcher has
 * not since been reset, at the first character not matched by the previous
 * match.
 *
 * <p> If the match succeeds then more information can be obtained via the
 * <tt>start</tt>, <tt>end</tt>, and <tt>group</tt> methods.  </p>
 *
 * @return  <tt>true</tt> if, and only if, a subsequence of the input
 *          sequence matches this matcher‘s pattern
 */
public boolean find() {
    int nextSearchIndex = last;
    if (nextSearchIndex == first)
        nextSearchIndex++;

    // If next search starts before region, start it at region
    if (nextSearchIndex < from)
        nextSearchIndex = from;

    // If next search starts beyond region then it fails
    if (nextSearchIndex > to) {
        for (int i = 0; i < groups.length; i++)
            groups[i] = -1;
        return false;
    }
    return search(nextSearchIndex);
}

/**
 * Initiates a search to find a Pattern within the given bounds.
 * The groups are filled with default values and the match of the root
 * of the state machine is called. The state machine will hold the state
 * of the match as it proceeds in this matcher.
 *
 * Matcher.from is not set here, because it is the "hard" boundary
 * of the start of the search which anchors will set to. The from param
 * is the "soft" boundary of the start of the search, meaning that the
 * regex tries to match at that index but ^ won‘t match there. Subsequent
 * calls to the search methods start at a new "soft" boundary which is
 * the end of the previous match.
 */
boolean search(int from) {
    this.hitEnd = false;
    this.requireEnd = false;
    from        = from < 0 ? 0 : from;
    this.first  = from;
    this.oldLast = oldLast < 0 ? from : oldLast;
    for (int i = 0; i < groups.length; i++)
        groups[i] = -1;
    acceptMode = NOANCHOR;
    boolean result = parentPattern.root.match(this, from, text);
    if (!result)
        this.first = -1;
    this.oldLast = this.last;
    return result;
}
...
}

  原因是这样的:这里如果不先调用find方法,直接调用group,可以发现group方法调用group(int group),该方法的方法体中有if first<0,显然这里这个条件是成立的,因为first的初始值就是-1,所以这里会抛异常。但是如果调用find方法,可以发现,最终会调用search(nextSearchIndex),注意这里的nextSearchIndex已被last赋值,而last的值为0,再跳转到search方法中

boolean search(int from) {
    this.hitEnd = false;
    this.requireEnd = false;
    from        = from < 0 ? 0 : from;
    this.first  = from;
    this.oldLast = oldLast < 0 ? from : oldLast;
    for (int i = 0; i < groups.length; i++)
        groups[i] = -1;
    acceptMode = NOANCHOR;
    boolean result = parentPattern.root.match(this, from, text);
    if (!result)
        this.first = -1;
    this.oldLast = this.last;
    return result;
}

  这个nextSearchIndex传给了from,而from在方法体中被赋值给了first,所以,调用了find方法之后,这个的first就不在是-1,也就不是抛异常了。

  源码已经上传至Github:https://github.com/DMinerJackie/JewelCrawler

  以上说的问题比较碎,都是在遇到问题和解决问题的时候的一些总结。在具体操作的时候还会遇到其他问题,有问题或者建议的话欢迎提出来^^。

  最后放几张截止目前爬取的数据

  Record表

  其中存储的是79032条,爬取过的网页有48471条

  movie表

  目前爬取了2964部影视作品

  comments表

  爬取了29711条记录

  如果您觉得阅读本文对您有帮助,请点一下“推荐”按钮,您的“推荐”将是我最大的写作动力!如果您想持续关注我的文章,请扫描二维码,关注JackieZheng的微信公众号,我会将我的文章推送给您,并和您一起分享我日常阅读过的优质文章。

如果你觉得博主的文章对你那么一点小帮助,恰巧你又有想打赏博主的小冲动,那么事不宜迟,赶紧扫一扫,小额地赞助下,攒个奶粉钱,也是让博主有动力继续努力,写出更好的文章^^。

    1. 支付宝                                  2. 微信

                            

时间: 2024-10-11 20:58:57

Java豆瓣电影爬虫——小爬虫成长记(附源码)的相关文章

Java之协变返回类型详解(附源码)

前言 Java SE5中添加了协变返回类型,它表示在导出类中的被覆盖方法可以返回基类方法的返回类型的某种导出类型: 示例源码 package com.mufeng.theeighthchapter; class Grain { @Override public String toString() { // TODO Auto-generated method stub return "Grain"; } } class Wheat extends Grain { @Override p

Java Integer 进制转化的实现(附源码),对模与补码的理解

1.toBinaryString方法的实现 1 public static String toBinaryString(int i) { 2 return toUnsignedString0(i, 1); 3 } 4 private static String toUnsignedString0(int val, int shift) { 5 // assert shift > 0 && shift <=5 : "Illegal shift value";

Android 音视频深入 八 小视频录制(附源码下载)

本篇项目地址,求starthttps://github.com/979451341/Audio-and-video-learning-materials/tree/master/%E5%B0%8F%E8%A7%86%E9%A2%91%E5%BD%95%E5%88%B6 这个项目我觉得厉害,因为之前录屏用了很多代码,而这个真正实现录屏功能的代码一百行都不到 还有就是关于整个界面UI也做得不错,但是关于界面如何实现的我就不多说了,直接说如何实现录屏的代码 首先Camera如何给数据SurfaceVi

手把手教你部署WEB邮件系统Squirrelmail小松鼠(内附源码包)

简介 Web邮件系统指的是提供给用户发信.收信的网页操作界面,能够完成和MUA软件类似的邮件管理.通讯簿等附加功能,同时Web邮件系统依赖于已有的收信.发信服务器,但不需要用户预先配置邮箱属性,因此使用更加方便,在Internet中应用十分广泛.而SquirrelMail是使用PHP开发的一套网页程序可以与 Postfix. Dovecot很好地协作,通过Web界面提供邮件发送.接收和管理操作. 实验环境 系统环境:centos6.5 LinuxIP地址:192.168.1.77(Linux)

小武小久成长记

Posts 小久坐玩具车 小武画狐狸 Jun 29, 2019 小武气垫船 Jun 18, 2019 小武剪纸 Jun 16, 2019 父亲节 Jun 15, 2019 组装车 Jun 10, 2019 小久帮爷爷修理 Jun 4, 2019 三个小朋友开船 Jun 2, 2019 小武画兔子 May 6, 2019 小武和小久用彩笔乱画 Apr 21, 2019 两兄弟吃药 Apr 20, 2019 小久叫爷爷 Apr 19, 2019 小久躲起来 Apr 19, 2019 小久够奶瓶很得意

《极简笔记》源码分析(二)

0. 介绍 此文将对Github上lguipeng大神所开发的 极简笔记 v2.0 (点我下载源码)代码进行分析学习. 通过此文你将学到: 应用源码的研读方法 MVP架构模式 Application的应用 Degger2依赖注入框架 搜索控件的使用 ButterKnife库的使用 Material主题 RecyclerView等新控件的用法 Lambda表达式 Java自定义注解 aFinal框架 RxJava框架 EventBus消息框架 布局文件常用技巧 PreferenceFragment

利用Java针对MySql封装的jdbc框架类 JdbcUtils 完整实现(包含增删改查、JavaBean反射原理,附源码)

最近看老罗的视频,跟着完成了利用Java操作MySql数据库的一个框架类JdbcUtils.java,完成对数据库的增删改查.其中查询这块,包括普通的查询和利用反射完成的查询,主要包括以下几个函数接口: 1.public Connection getConnection()   获得数据库的连接 2.public boolean updateByPreparedStatement(String sql, List<Object>params)throws SQLException  更新数据库

Java设计模式-代理模式之动态代理(附源码分析)

Java设计模式-代理模式之动态代理(附源码分析) 动态代理概念及类图 上一篇中介绍了静态代理,动态代理跟静态代理一个最大的区别就是:动态代理是在运行时刻动态的创建出代理类及其对象.上篇中的静态代理是在编译的时候就确定了代理类具体类型,如果有多个类需要代理,那么就得创建多个.还有一点,如果Subject中新增了一个方法,那么对应的实现接口的类中也要相应的实习该方法,不符合设计模式原则. 动态代理的做法:在运行时刻,可以动态创建出一个实现了多个接口的代理类.每个代理类的对象都会关联一个表示内部处理

Java中的容器(集合)之HashMap源码解析

1.HashMap源码解析(JDK8) 基础原理: 对比上一篇<Java中的容器(集合)之ArrayList源码解析>而言,本篇只解析HashMap常用的核心方法的源码. HashMap是一个以键值对存储的容器. hashMap底层实现为数组+链表+红黑树(链表超过8时转为红黑树,JDK7为数组+链表). HashMap会根据key的hashCode得到对应的hash值,再去数组中找寻对应的数组位置(下标). hash方法如下: static final int hash(Object key

横屏小游戏--萝莉快跑源码分析三

主角出场: 初始化主角 hero = new GameObjHero(); hero->setScale(0.5); hero->setPosition(ccp(100,160)); hero->setVisible(false); addChild(hero,1); 进入GameObjHero类ccp文件 创建主角及动作 this->setContentSize(CCSizeMake(85,90)); //接收触摸事件 CCDirector* pDirector = CCDire