起点中文网小说爬取-etree，xpath，os

本文章主要是lxml库的etree解析抽取与xpath解析的应用，还使用了os库写文件

import os
import requests
from lxml import etree#lxml库解析HTML、xml文件抽取想要的数据
#设计模式--面向对象
class Spider(object):
    def start_request(self):
        #1.请求网站拿到数据，抽取小说名创建文件夹，抽取
        response=requests.get(‘https://www.qidian.com/all‘)
        # print(response.text)
        html=etree.HTML(response.text)#结构化
        Bigsrc_list=html.xpath(‘//div[@class="book-mid-info"]/h4/a/@href‘)
        #//:选取元素，而不考虑元素的具体位置;     @:选取属性
        Bigtit_list=html.xpath(‘//div[@class="book-mid-info"]/h4/a/text()‘)
        # print(Bigsrc_list,Bigtit_list)
        for Bigsrc,Bigtit in zip(Bigsrc_list,Bigtit_list):
            if os.path.exists(Bigtit)==False:#如果不存在文件夹名为该小说名，就创建
                os.mkdir(Bigtit)
            # print(Bigsrc,Bigtit)
            self.file_data(Bigsrc,Bigtit)#调用下一个函数

    def file_data(self, Bigsrc, Bigtit):
        # 2.请求小说，拿到数据，抽取章名，抽取文章链接
        response = requests.get("https:" + Bigsrc)#补上缺少的https前缀
        html = etree.HTML(response.text)#etree.HTML可用于在python代码中嵌入“html文本”。
        listsrc_list = html.xpath(‘//ul[@class="cf"]/li/a/@href‘)
        listtit_list = html.xpath(‘//ul[@class="cf"]/li/a/text()‘)
        for Listsrc, Listtit in zip(listsrc_list, listtit_list):
            # print(Listsrc, Listtit)
            self.finally_file(Listsrc, Listtit,Bigtit)

    def finally_file(self,Listsrc, Listtit,Bigtit):
        # 3.请求文章拿到抽取文章内容，创建文件保存到相应文件夹
        response=requests.get("https:"+Listsrc)
        html=etree.HTML(response.text)#结构化
        content="\n".join(html.xpath(‘//div[@class="read-content j_readContent"]/p/text()‘))
        #S.join()返回一个字符串,元素之间的分隔符是S
        file_name=Bigtit+"\\"+Listtit+".txt"
        #创建Bigtit文件夹下的Listtit.txt文件
        print("正在存储文件"+file_name)
        with open(file_name,"a",encoding="utf-8")as f:
            f.write(content)
if __name__==‘__main__‘:
    spider = Spider()
    spider.start_request()

原文地址：https://www.cnblogs.com/fodalaoyao/p/10409736.html

时间： 2024-11-13 08:22:27

起点中文网小说爬取-etree，xpath，os的相关文章

爬虫入门之爬取策略 XPath与bs4实现(五)

爬虫入门之爬取策略 XPath与bs4实现(五) 在爬虫系统中,待抓取URL队列是很重要的一部分.待抓取URL队列中的URL以什么样的顺序排列也是一个很重要的问题,因为这涉及到先抓取那个页面,后抓取哪个页面.而决定这些URL排列顺序的方法,叫做抓取策略.下面重点介绍几种常见的抓取策略: 1 深度优先遍历策略: 深度优先遍历策略是指网络爬虫会从起始页开始,一个链接一个链接跟踪下去,处理完这条线路之后再转入下一个起始页,继续跟踪链接.我们以下面的图为例:遍历的路径:A-F-G E-H-I B C D

python爬虫之小说爬取

废话不多说,直接进入正题. 今天我要爬取的网站是起点中文网,内容是一部小说. 首先是引入库 from urllib.request import urlopen from bs4 import BeautifulSoup 然后将网址赋值 html=urlopen("http://read.qidian.com/chapter/dVQvL2RfE4I1/hJBflakKUDMex0RJOkJclQ2.html") //小说的第一章的网址 bsObj=BeautifulSoup(html)

小说爬取 python + urllib + lxml

from urllib import parse from urllib import request from lxml import etree import time class Novel: def __init__(self,*args): self.name = args[0] self.dict = args[1] self.txt = '' for key in sorted(self.dict): self.txt = self.txt + self.dict[key] def

晋江年下文爬取【xpath】

''' @Modify Time @Author 目标:晋江年下文爬取6页 ------------ ------- http://www.jjwxc.net/search.php?kw=%C4%EA%CF%C2&t=1&p=1 2019/8/31 15:19 laoalo ''' import requests from lxml import etree head = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64)

爬虫实践-爬取起点中文网小说信息

qidian.py: import xlwtimport requestsfrom lxml import etreeimport time all_info_list = [] def get_info(url): html = requests.get(url) selector = etree.HTML(html.text) infos = selector.xpath('//ul[@class="all-img-list cf"]/li') for info in infos:

如何用python爬虫从爬取一章小说到爬取全站小说

前言文的文字及图片来源于网络,仅供学习.交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理. PS:如有需要Python学习资料的小伙伴可以加点击下方链接自行获取http://t.cn/A6Zvjdun 很多好看的小说只能看不能下载,教你怎么爬取一个网站的所有小说知识点: requests xpath 全站小说爬取思路开发环境: 版本:anaconda5.2.0(python3.6.5) 编辑器:pycharm 第三方库: requests parsel 进行

python 爬取网络小说清洗并下载至txt文件

什么是爬虫网络爬虫,也叫网络蜘蛛(spider),是一种用来自动浏览万维网的网络机器人.其目的一般为编纂网络索引. 网络搜索引擎等站点通过爬虫软件更新自身的网站内容或其对其他网站的索引.网络爬虫可以将自己所访问的页面保存下来,以便搜索引擎事后生成索引供用户搜索. 爬虫访问网站的过程会消耗目标系统资源.不少网络系统并不默许爬虫工作.因此在访问大量页面时,爬虫需要考虑到规划.负载,还需要讲“礼貌”. 不愿意被爬虫访问.被爬虫主人知晓的公开站点可以使用robots.txt文件之类的方法避免访问.这个

通过python 爬取网址url 自动提交百度

通过python 爬取网址url 自动提交百度昨天同事说,可以手动提交百度这样索引量会上去. 然后想了下.是不是应该弄一个py 然后自动提交呢?想了下.还是弄一个把 python 代码如下: import os import re import shutil REJECT_FILETYPE = 'rar,7z,css,js,jpg,jpeg,gif,bmp,png,swf,exe' #定义爬虫过程中不下载的文件类型 def getinfo(webaddress): #'#通过用户输入的网址连接

爬虫简单之二---使用进程爬取起点中文网的六万多也页小说的名字，作者，等一些基本信息，并存入csv中

爬虫简单之二---使用进程爬取起点中文网的六万多也页小说的名字,作者,等一些基本信息,并存入csv中准备使用的环境和库Python3.6 + requests + bs4 + csv + multiprocessing 库的说明 requests模拟计算机对服务器发送requests请求 bs4:页面分析功能,分析页面找到所需要的特定内容 xlwt:把爬取的内容存入csv文件中 multiprocessing:开启多进程爬取 1.准备URLs 起点中文网起点中文网的URL:https://w