java爬虫,爬取当当网数据

   背景:女票快毕业了(没错!我是有女票的!!!),写论文,主题是儿童性教育,查看儿童性教育绘本数据死活找不到,没办法,就去当当网查询下数据,但是数据怎么弄下来呢,首先想到用Python,但是不会!!百度一番,最终决定还是用java大法爬虫,毕竟java熟悉点,话不多说,开工!:

  实现:

  首先搭建框架,创建一个maven项目,使用框架是springboot和mybatis,开发工具是idea,pom.xml如下:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.1.4.RELEASE</version>
        <relativePath/> <!-- lookup parent from repository -->
    </parent>
    <groupId>cn.com.boco</groupId>
    <artifactId>demo</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <name>demo</name>
    <description>Demo project for Spring Boot</description>

    <properties>
        <java.version>1.8</java.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-data-jpa</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-jdbc</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.mybatis.spring.boot</groupId>
            <artifactId>mybatis-spring-boot-starter</artifactId>
            <version>2.0.1</version>
        </dependency>

        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <scope>runtime</scope>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>com.oracle</groupId>
            <artifactId>ojdbc6</artifactId>
            <version>11.2.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.5</version>
        </dependency>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.11.3</version>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.45</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
        </plugins>
    </build>

</project>

目录结构如下:

连接的数据库是oracle本地的数据库,配置文件如下

注意:application.yml文件中

spring:  profiles:    active:dev指定的就是application_dev.yml文件,就是配置文件用的这个,在实际开发中,可以通过这种方式配置几份配置环境,这样发布的时候切换active属性就行,不用修改配置文件了

application_dev.yml配置文件:

server:
  port: 8084

spring:
  datasource:
    username: system
    password: 123456
    url: jdbc:oracle:thin:@localhost
    driver-class-name: oracle.jdbc.driver.OracleDriver

mybatis:
  mapper-locations: classpath*:mapping/*.xml
  type-aliases-package: cn.com.boco.demo.entity

#showSql
logging:
  level:
    com:
      example:
        mapper : debug

application.yml文件:

spring:
  profiles:
    active: dev

启动类如下,加上MapperScan注解,扫描dao层的接口:

@MapperScan("cn.com.boco.demo.mapper")
@SpringBootApplication
public class DemoApplication {

    public static void main(String[] args) {
        SpringApplication.run(DemoApplication.class, args);
    }

}

dao层接口:

@Repository
public interface BookMapper {

    void insertBatch(List<DangBook> list);

}

xml文件:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd">

<mapper namespace="cn.com.boco.demo.mapper.BookMapper">

    <insert id="insertBatch" parameterType="java.util.List">
        INSERT ALL
        <foreach collection="list" item="item" index="index" separator=" ">
            into dangdang_message (title,img,author,publish,detail,price,parentUrl,inputTime)  values
            (#{item.title,jdbcType=VARCHAR},
            #{item.img,jdbcType=VARCHAR},
            #{item.author,jdbcType=VARCHAR},
            #{item.publish,jdbcType=VARCHAR},
            #{item.detail,jdbcType=VARCHAR},
            #{item.price,jdbcType=DOUBLE},
            #{item.parentUrl,jdbcType=VARCHAR},
            #{item.inputTime,jdbcType=DATE})

        </foreach>
        select 1 from dual
    </insert>

</mapper>

两个实体类:

public class BaseModel {

    private int id;
    private Date inputTime;

    public Date getInputTime() {
        return inputTime;
    }

    public void setInputTime(Date inputTime) {
        this.inputTime = inputTime;
    }

    public int getId() {
        return id;
    }

    public void setId(int id) {
        this.id = id;
    }
}
@Alias("dangBook")
public class DangBook extends BaseModel {

    //标题
    private String title;
    //图片地址
    private String img;
    //作者
    private String author;
    //出版社
    private String publish;
    //详细说明
    private String detail;
    //价格
    private float price;
    //父链接,即请求链接
    private String parentUrl;

    public String getParentUrl() {
        return parentUrl;
    }

    public void setParentUrl(String parentUrl) {
        this.parentUrl = parentUrl;
    }

    public String getAuthor() {
        return author;
    }

    public void setAuthor(String author) {
        this.author = author;
    }

    public String getPublish() {
        return publish;
    }

    public void setPublish(String publish) {
        this.publish = publish;
    }

    public String getTitle() {
        return title;
    }

    public void setTitle(String title) {
        this.title = title;
    }

    public String getImg() {
        return img;
    }

    public void setImg(String img) {
        this.img = img;
    }

    public String getDetail() {
        return detail;
    }

    public void setDetail(String detail) {
        this.detail = detail;
    }

    public float getPrice() {
        return price;
    }

    public void setPrice(float price) {
        this.price = price;
    }

}

service层:

@Service
public class BookService {

    @Autowired
    private BookMapper bookMapper;

    public void insertBatch(List<DangBook> list){
        bookMapper.insertBatch(list);
    }

}

controll层代码:

@RestController
@RequestMapping("/book")
public class DangdangBookController {

    @Autowired
    private BookService bookService;

    private static Logger logger = LoggerFactory.getLogger(DemoApplication.class);
    //url解码之后
    private static final String URL = "http://search.dangdang.com/?key=性教育绘本&act=input&att=1000006:226&page_index=";
    //url解码之前
    private static final String URL2 = "http://search.dangdang.com/?key=%D0%D4%BD%CC%D3%FD%BB%E6%B1%BE&act=input&att=1000006%3A226&page_index=";
    @RequestMapping("/parse")
    public JSONObject parse(){
        JSONObject jsonObject = new JSONObject();
        for(int i =1;i<=10;i++){
            List<DangBook> dangBooks = ParseUtils.dingParse(URL+i);
            if(dangBooks != null && dangBooks.size() >0){

                logger.info("解析完数据,准备入库");
                bookService.insertBatch(dangBooks);
                logger.info("入库完成,入库数据条数"+ dangBooks.size());
                jsonObject.put("code",1);
                jsonObject.put("result","success");
            }else{
                jsonObject.put("code",0);
                jsonObject.put("result","fail");
            }

        }
        return jsonObject;
    }

}

本来是前端传入地址解析的,但是发现参数丢失了,用url编码也不行,最后放到后台了

ParseUtils和HttpGetUtils工具类:
public class HttpGetUtils {

    private static Logger logger = LoggerFactory.getLogger(HttpGetUtils.class);

    public static String getUrlContent(String url) {
        if (url == null) {
            logger.info("url地址为空");
            return null;
        }
        logger.info("url为:" + url);
        logger.info("开始解析");
        String contentLine = null;
        //最新版httpclient.jar已经舍弃new DefaultHttpClient()
        //但是还是可以用的
        HttpClient httpClient = new DefaultHttpClient();
        HttpResponse httpResponse = getResp(httpClient, url);
        if (httpResponse.getStatusLine().getStatusCode() == 200) {
            try {
                contentLine = EntityUtils.toString(httpResponse.getEntity(), "utf-8");
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        logger.info("解析结束");
        return contentLine;
    }

    /**
     * 根据url 获取response对象
     */
    public static HttpResponse getResp(HttpClient httpClient, String url) {
        logger.info("开始获取response对象");
        HttpGet httpGet = new HttpGet(url);
        HttpResponse httpResponse = new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK, "OK");
        try {
            httpResponse = httpClient.execute(httpGet);
        } catch (IOException e) {
            e.printStackTrace();
        }
        logger.info("获取对象结束");
        return httpResponse;
    }

}
public class ParseUtils {

    private static Logger logger = LoggerFactory.getLogger(ParseUtils.class);

    public static List<DangBook> dingParse(String url) {
        List<DangBook> list = new ArrayList<>();
        Date date = new Date();
        if (url == null) {
            logger.info("url为空,数据获取结束");
            return null;
        }

        logger.info("开始获取数据");
        String content = HttpGetUtils.getUrlContent(url);
        if (content != null)
            logger.info("得到解析数据");
        else {
            logger.info("解析数据为空,数据获取结束");
            return null;
        }

        Document document = Jsoup.parse(content);
        //遍历当当图书列表
        for(int i =1;i<=60;i++){
            Elements elements = document.select("ul[class=bigimg]").select("li[class=line"+i+"]");
            for (Element e : elements) {
                String title = e.select("p[class=name]").select("a").text();
                logger.info("书名:" + title);
                String img = e.select("a[class=pic]").select("img").attr("data-original");
                logger.info("图片地址:" + img);
                String authorAndPublish = e.select("p[class=search_book_author]").select("span").select("a").text();
                String []a = authorAndPublish.split(" ");
                String author = a[0];
                logger.info("作者:" + author);
                String publish = a[a.length - 1];
                logger.info("出版社:" + publish);
//            String publish =e.select("p[class=name]").select("a").text();
                String detail = e.select("p[class=detail]").text();
                logger.info("图书介绍:" + detail);
                String priceS = e.select("p[class=price]").select("span[class=search_now_price]").text();
                float price = 0.0f;
                if(priceS.length()>1 && priceS != null){
                    price = Float.parseFloat(priceS.substring(1, priceS.length() - 1));
                }
                logger.info("价格:" + price);
                logger.info("-------------------------------------------------------------------------");
                DangBook dangBook = new DangBook();
                dangBook.setTitle(title);
                dangBook.setImg(img);
                dangBook.setAuthor(author);
                dangBook.setPublish(publish);
                dangBook.setDetail(detail);
                dangBook.setPrice(price);
                dangBook.setParentUrl(url);
                dangBook.setInputTime(date);
                list.add(dangBook);
            }
        }
        return list;
    }

}

最后表里数据如下:

注意:建表的时候注意字段类型,orcale的var(255)不够我的这个数据标题用,开始报错,后来改了字段类型,还有注意ID的自增和入库时间的自动添加,个人数据库较差,百度一番才弄好

原文地址:https://www.cnblogs.com/grasslucky/p/10785641.html

时间: 2024-08-22 22:24:29

java爬虫,爬取当当网数据的相关文章

python爬虫案例-爬取当当网数据

输入关键字,爬取当当网中商品的基本数据,代码如下: 1 # Author:K 2 import requests 3 from lxml import etree 4 from fake_useragent import UserAgent 5 import re 6 import csv 7 8 9 def get_page(key): 10 for page in range(1,50): 11 url = 'http://search.dangdang.com/?key=%s&act=in

python 爬虫框架Scrapy爬取当当网数据

setting.py需要修改的两个地方:

Scrapy爬虫(5)爬取当当网图书畅销榜

??本次将会使用Scrapy来爬取当当网的图书畅销榜,其网页截图如下: ??我们的爬虫将会把每本书的排名,书名,作者,出版社,价格以及评论数爬取出来,并保存为csv格式的文件.项目的具体创建就不再多讲,可以参考上一篇博客,我们只需要修改items.py文件,以及新建一个爬虫文件BookSpider.py. ??items.py文件的代码如下,用来储存每本书的排名,书名,作者,出版社,价格以及评论数. import scrapy class BookspiderItem(scrapy.Item):

Java爬虫爬取 天猫 淘宝 京东 搜索页和 商品详情

Java爬虫爬取 天猫 淘宝 京东 搜索页和 商品详情 先识别商品url,区分平台提取商品编号,再根据平台带着商品编号爬取数据. 1.导包 <!-- 爬虫相关Jar包依赖 --> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>3.10-FINAL</version> </

Node.js爬虫-爬取慕课网课程信息

第一次学习Node.js爬虫,所以这时一个简单的爬虫,Node.js的好处就是可以并发的执行 这个爬虫主要就是获取慕课网的课程信息,并把获得的信息存储到一个文件中,其中要用到cheerio库,它可以让我们方便的操作HTML,就像是用jQ一样 开始前,记得 npm install cheerio 为了能够并发的进行爬取,用到了Promise对象 //接受一个url爬取整个网页,返回一个Promise对象 function getPageAsync(url){ return new Promise(

java爬虫爬取博客园数据

网络爬虫 编辑 网络爬虫(又称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁.自动索引.模拟程序或者蠕虫. 网络爬虫按照系统结构和实现技术,大致可以分为以下几种类型:通用网络爬虫(General Purpose Web Crawler).聚焦网络爬虫(Focused Web Crawler).增量式网络爬虫(Incremental Web Crawler).深层网络爬虫(Deep We

scrapy项目4:爬取当当网中机器学习的数据及价格(CrawlSpider类)

scrapy项目3中已经对网页规律作出解析,这里用crawlspider类对其内容进行爬取: 项目结构与项目3中相同如下图,唯一不同的为book.py文件 crawlspider类的爬虫文件book的生成命令为:scrapy genspider -t crawl book 'category.dangdang.com' book.py代码如下: # -*- coding: utf-8 -*- import scrapy # 创建用于提取响应中连接的对象 from scrapy.linkextra

我的第一个Scrapy 程序 - 爬取当当网信息

前面已经安装了Scrapy,下面来实现第一个测试程序. 概述 Scrapy是一个爬虫框架,他的基本流程如下所示(下面截图来自互联网) 简单的说,我们需要写一个item文件,定义返回的数据结构:写一个spider文件,具体爬取的数据程序,以及一个管道 pipeline 文件,作为后续操作,比如保存数据等等. 下面以当当网为例,看看怎么实现.这个例子里面我想爬取的内容是前面20页的羽绒服产品,包括产品名字,链接和评论数. 过程 1. 创建一个Scrapy的项目 scrapy startproject

scrapy爬取当当网

春节已经临近了尾声,也该收收心了.博客好久都没更新了,自己在年前写的爬虫也该“拿”出来了. 本次爬取的目标是当当网,获取当当网所有的书籍信息.采用scrapy+mongodb来采集存储数据.开干! 起始url: start_urls = ['http://category.dangdang.com/cp01.00.00.00.00.00-shlist.html'] 当当书籍的一级分类二级分类都很明显的展示了出来. ok~入口找到了,当当网也没有设置反爬措施,所以可以直接的放心爬取,如果需要大规模