scrapy初探之实现爬取小说

一、前言

上文说明了scrapy框架的基础知识，本篇实现了爬取第九中文网的免费小说。

二、scrapy实例创建

1、创建项目

C:\Users\LENOVO\PycharmProjects\fullstack\book9>scrapy startproject book9

2、定义要爬取的字段（item.py）

import scrapy

class Book9Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    book_name = scrapy.Field()   #小说名字
    chapter_name = scrapy.Field()   #小说章节名字
    chapter_content = scrapy.Field()    #小说章节内容

3、写爬虫（spiders/book.py）

在spiders目录下创建book.py文件
import scrapy
from book9.items import Book9Item
from scrapy.http import Request
import os

class Book9Spider(scrapy.Spider):
    name = "book9"
    allowed_domains = [‘book9.net‘]
    start_urls = [‘https://www.book9.net/xuanhuanxiaoshuo/‘]

    #爬取每本书的URL
    def parse(self, response):
        book_urls = response.xpath(‘//div[@class="r"]/ul/li/span[@class="s2"]/a/@href‘).extract()

        for book_url in book_urls:
            yield Request(book_url,callback=self.parse_read)

    #进入每一本书目录
    def parse_read(self,response):

        read_url = response.xpath(‘//div[@class="box_con"]/div/dl/dd/a/@href‘).extract()

        for i in read_url:
            read_url_path = os.path.join("https://www.book9.net" + i)

            yield Request(read_url_path,callback=self.parse_content)

    #爬取小说名，章节名，内容
    def parse_content(self,response):

        #爬取小说名
        book_name = response.xpath(‘//div[@class="con_top"]/a/text()‘).extract()[2]

        #爬取章节名
        chapter_name = response.xpath(‘//div[@class="bookname"]/h1/text()‘).extract_first()

        #爬取内容并处理
        chapter_content_2 = response.xpath(‘//div[@class="box_con"]/div/text()‘).extract()
        chapter_content_1 = ‘‘.join(chapter_content_2)
        chapter_content = chapter_content_1.replace(‘    ‘, ‘‘)

        item = Book9Item()
        item[‘book_name‘] = book_name
        item[‘chapter_name‘] = chapter_name
        item[‘chapter_content‘] = chapter_content

        yield item

4、处理爬虫返回的数据(pipelines.py)

import os

class Book9Pipeline(object):
    def process_item(self, item, spider):
        #创建小说目录

        file_path = os.path.join("D:\\Temp",item[‘book_name‘])
        print(file_path)
        if not os.path.exists(file_path):
            os.makedirs(file_path)

        #将各章节写入文件
        chapter_path = os.path.join(file_path,item[‘chapter_name‘] + ‘.txt‘)
        print(chapter_path)
        with open(chapter_path,‘w‘,encoding=‘utf-8‘) as f:
            f.write(item[‘chapter_content‘])

        return item

5、配置文件(settiings.py)

BOT_NAME = ‘book9‘

SPIDER_MODULES = [‘book9.spiders‘]
NEWSPIDER_MODULE = ‘book9.spiders‘
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# 设置请求头部
DEFAULT_REQUEST_HEADERS = {
    "User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
    ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘
}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    ‘book9.pipelines.Book9Pipeline‘: 300,
}

6、执行爬虫
C:\Users\LENOVO\PycharmProjects\fullstack\book9>scrapy crawl book9 --nolog

7、结果

原文地址：http://blog.51cto.com/jiayimeng/2124678

时间： 2024-11-09 19:15:46

scrapy初探之实现爬取小说的相关文章

Python实战项目网络爬虫之爬取小说吧小说正文

本次实战项目适合,有一定Python语法知识的小白学员.本人也是根据一些网上的资料,自己摸索编写的内容.有不明白的童鞋,欢迎提问. 目的:爬取百度小说吧中的原创小说<猎奇师>部分小说内容链接:http://tieba.baidu.com/p/4792877734 首先,自己定义一个类,方便使用.其实类就像一个"水果篮",这个"水果篮"里有很多的"水果",也就是我们类里面定义的变量啊,函数啊等等,各种各样的.每一种"水果&q

Python爬虫：爬取小说并存储到数据库

爬取小说网站的小说,并保存到数据库第一步:先获取小说内容 #!/usr/bin/python # -*- coding: UTF-8 -*- import urllib2,re domain = 'http://www.quanshu.net' headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrom

python爬虫——爬取小说 | 探索白子画和花千骨的爱恨情仇(转载)

转载出处:药少敏 ,感谢原作者清晰的讲解思路! 下述代码是我通过自己互联网搜索和拜读完此篇文章之后写出的具有同样效果的爬虫代码: 1 from bs4 import BeautifulSoup 2 import requests 3 4 if __name__ == '__main__': 5 html = requests.get('http://www.136book.com/huaqiangu/') 6 soup = BeautifulSoup(html.content, 'lxml'

Golang 简单爬虫实现，爬取小说

为什么要使用Go写爬虫呢? 对于我而言,这仅仅是练习Golang的一种方式. 所以,我没有使用爬虫框架,虽然其很高效. 为什么我要写这篇文章? 将我在写爬虫时找到资料做一个总结,希望对于想使用Golang写爬虫的你能有一些帮助. 爬虫主要需要解决两个问题: 获取网页解析网页如果这两个都无法解决的话就没法再讨论其他了. 开发一个爬取小说网站的爬虫会是一个不错的实践. 这是两个实例: Golang 简单爬虫实现 golang 用/x/net/html写的小爬虫,爬小说这是需要的两个项目: go

scrapy之360图片爬取

#今日目标 **scrapy之360图片爬取** 今天要爬取的是360美女图片,首先分析页面得知网页是动态加载,故需要先找到网页链接规律, 然后调用ImagesPipeline类实现图片爬取 *代码实现* so.py ``` # -*- coding: utf-8 -*- import scrapy import json from ..items import SoItem class SoSpider(scrapy.Spider): name = 'so' allowed_domains =

多线程爬取小说时如何保证章节的顺序

前言爬取小说时,以每一个章节为一个线程进行爬取,如果不加以控制的话,保存的时候各个章节之间的顺序会乱掉. 当然,这里说的是一本小说保存为单个txt文件,如果以每个章节为一个txt文件,自然不会存在这种情况. 不仅仅是小说,一些其他的数据在多线程爬取时也有类似情况,比如: 漫画:漫画其实是由大量图片组成,一般一本漫画会保存为一个pdf文件,在此过程要保证图片的顺序. 视频:现在网络上的视频大部分是由多个ts文件拼合,最后保存为一个mp4文件,要保证ts文件的顺序. 它们都有一个共同的特点,那就是

爬取小说

爬取小说: from bs4 import BeautifulSoup import requests class spiderstory(object): def __init__(self): self.url = 'http://www.365haoshu.com/Book/Chapter/' self.names = []#存放章节名称 self.hrefs = []#存放章节链接 def get_urlandname(self): '''获取章节名称和和章节URL''' respons

Python 爬取小说——《唐朝小闲人》

# 爬取小说:唐朝小闲人 # 导入需要用到的库 import requestsimport osimport reimport timeimport random # 查看源网页 beginurl = 'https://www.sbiquge.com/2_2523/' # 目录网页 ## 爬取各章网页 url_response = requests.get(beginurl).text #目录网页的源代码url_regex = '<a href ="/2_2

scrapy爬虫框架(四)-爬取多个网页

scrapy爬虫框架(四) 爬取多个网页思路:通过判断句子控网站中,下一页是否还有a标签来获取网址,拼接后继续爬取,最终写入json文件中. juziSpider.py # -*- coding: utf-8 -*- import scrapy from juzi.items import JuziItem class JuzispiderSpider(scrapy.Spider): name = 'juziSpider' allowed_domains = ['www.juzikong.co