爬虫系列之豆瓣图书排行

豆瓣上有图书的排行榜，所以这次写了一个豆瓣的爬虫。

首先是分析排行榜的url

根据这个可以很容易的知道不同图书的排行榜就是在网站后面加上/tag/【类别】，所以我们首先要获得图书的类别信息。

这里可以将读书首页的热门标签给爬下来。

爬取标签内容并不难，代码如下：

 1 def getLabel(url):    #获得热门标签
 2     html = getHTMLText(url)
 3     soup = BeautifulSoup(html, ‘html.parser‘)
 4     a = soup.find_all(‘a‘)
 5     label_list = []
 6     for i in a:
 7         try:
 8             href = i.attrs[‘href‘]
 9             match = re.search(r‘/tag/.*‘, href)
10             if match and match[0][5]!=‘?‘:
11                 label_list.append(match[0])
12         except:
13             continue
14     return label_list

接下来是进入排行榜页面进行信息爬取，

代码如下：

 1 def getBookInfo():
 2     label_list = getLabel(‘https://book.douban.com/‘)
 3     label = get_label(label_list)
 4     name = []
 5     author = []
 6     price = []
 7     score = []
 8     number = []
 9     for page in label_list[int(label)-1:int(label)]:
10         for i in range(2):
11             html = getHTMLText(‘https://book.douban.com‘ + page + ‘?start=‘ + str(i*20) + ‘&type=T‘)
12             soup = BeautifulSoup(html, ‘html.parser‘)
13             book_list = soup.find_all(‘div‘, attrs={‘class‘:‘info‘})  #找到书籍的信息列表
14             for book in book_list:
15                 a = book.find_all(‘a‘,attrs={‘title‘:re.compile(‘.*‘)})[0]   #找到包含书籍名的a标签
16                 name.append(a.get(‘title‘))   #获得标题属性
17
18                 pub = book.find_all(‘div‘, attrs={‘class‘:‘pub‘})[0]
19                 pub = pub.string.strip().replace(‘\n‘,‘‘)
20                 author.append(re.findall(r‘(.*?)/‘, pub)[0].strip())
21                 split_list = pub.split()   #空格分割
22                 for j in split_list:
23                     match = re.search(r‘\d.*\..*‘, j)   #获得价格信息
24                     if match:
25                         price.append(match[0])
26
27                 span = book.find_all(‘span‘, attrs={‘class‘:‘pl‘})[0]  #获得评价人数所在标签
28                 span = span.string.strip().replace(‘\n‘,‘‘)
29                 number.append(re.findall(r‘\d+‘, span)[0])   #获得人数
30
31                 span = book.find_all(‘span‘, attrs={‘class‘:‘rating_nums‘})[0]
32                 score.append(span.string)
33
34     tplt = "{:3}\t{:15}\t{:15}\t{:10}\t{:4}\t{:7}"  #规定输出格式
35     print(tplt.format("序号", "书籍", "作者", "价格", "评分", "评价人数"))
36     l = len(name)
37     for count in range(l):
38         print(tplt.format(count+1, name[count],author[count],price[count],score[count],number[count]))

最终的总代码为：

 1 import requests
 2 import re
 3 from bs4 import BeautifulSoup
 4
 5
 6 def getHTMLText(url):
 7     try:
 8         r = requests.get(url, timeout=30)
 9         r.raise_for_status()
10         r.encoding = r.apparent_encoding
11         return r.text
12     except:
13         return ""
14
15
16 def getLabel(url):    #获得热门标签
17     html = getHTMLText(url)
18     soup = BeautifulSoup(html, ‘html.parser‘)
19     a = soup.find_all(‘a‘)
20     label_list = []
21     for i in a:
22         try:
23             href = i.attrs[‘href‘]
24             match = re.search(r‘/tag/.*‘, href)
25             if match and match[0][5]!=‘?‘:
26                 label_list.append(match[0])
27         except:
28             continue
29     return label_list
30
31
32 def get_label(label_list):
33     count = 1
34     for i in label_list:
35         print(str(count) + ‘: ‘ + label_list[count-1][5:] + ‘\t‘, end=‘‘)
36         count = count + 1
37     choose = input(‘\n\n请输入你想查询的图书类别：‘)
38     while int(choose)<=0 or int(choose)>=count:
39         choose = input(‘\n请输入正确的类别编号:‘)
40     return int(choose)
41
42 def getBookInfo():
43     label_list = getLabel(‘https://book.douban.com/‘)
44     label = get_label(label_list)
45     name = []
46     author = []
47     price = []
48     score = []
49     number = []
50     for page in label_list[int(label)-1:int(label)]:
51         for i in range(2):
52             html = getHTMLText(‘https://book.douban.com‘ + page + ‘?start=‘ + str(i*20) + ‘&type=T‘)
53             soup = BeautifulSoup(html, ‘html.parser‘)
54             book_list = soup.find_all(‘div‘, attrs={‘class‘:‘info‘})  #找到书籍的信息列表
55             for book in book_list:
56                 a = book.find_all(‘a‘,attrs={‘title‘:re.compile(‘.*‘)})[0]   #找到包含书籍名的a标签
57                 name.append(a.get(‘title‘))   #获得标题属性
58
59                 pub = book.find_all(‘div‘, attrs={‘class‘:‘pub‘})[0]
60                 pub = pub.string.strip().replace(‘\n‘,‘‘)
61                 author.append(re.findall(r‘(.*?)/‘, pub)[0].strip())
62                 split_list = pub.split()   #空格分割
63                 for j in split_list:
64                     match = re.search(r‘\d.*\..*‘, j)   #获得价格信息
65                     if match:
66                         price.append(match[0])
67
68                 span = book.find_all(‘span‘, attrs={‘class‘:‘pl‘})[0]  #获得评价人数所在标签
69                 span = span.string.strip().replace(‘\n‘,‘‘)
70                 number.append(re.findall(r‘\d+‘, span)[0])   #获得人数
71
72                 span = book.find_all(‘span‘, attrs={‘class‘:‘rating_nums‘})[0]
73                 score.append(span.string)
74
75     tplt = "{:3}\t{:15}\t{:15}\t{:10}\t{:4}\t{:7}"  #规定输出格式
76     print(tplt.format("序号", "书籍", "作者", "价格", "评分", "评价人数"))
77     l = len(name)
78     for count in range(l):
79         print(tplt.format(count+1, name[count],author[count],price[count],score[count],number[count]))
80
81
82
83 if __name__ ==‘__main__‘:
84     print(‘豆瓣图书综合排序查询\n‘)
85     getBookInfo()
86

最后的运行效果：

首先是类别表：

输入图书类别后就可以显示图书信息了：

我这里只爬取了两页的图书信息。

因为有些书的信息是不完整的，所以在爬取时可能会出现错误。我正则表达式写得也不是很好，很多地方都是会出错的，比如价格那儿。

原文地址：https://www.cnblogs.com/zyb993963526/p/9188881.html

时间： 2024-10-28 15:47:59

爬虫系列之豆瓣图书排行的相关文章

Python项目之我的第一个爬虫----爬取豆瓣图书网，统计图书数量

今天,花了一个晚上的时间边学边做,搞出了我的第一个爬虫.学习Python有两个月了,期间断断续续,但是始终放弃,今天搞了一个小项目,有种丰收的喜悦.废话不说了,直接附上我的全部代码. 1 # -*- coding:utf-8 -*- 2 __author__ = 'Young' 3 4 import re,urllib #urllib : 网页访问,返回网页的数据.内容 5 def my_get(ID):# 封装成函数方便调用 6 html = urllib.urlopen("https://r

爬虫-爬取豆瓣图书TOP250

import requests from bs4 import BeautifulSoup def get_book(url): wb_data = requests.get(url) soup = BeautifulSoup(wb_data.text,'lxml') title_list = soup.select('h1 > span') title = title_list[0].text author_list = soup.select('div#info > a') author

Python爬虫(三)——对豆瓣图书各模块评论数与评分图形化分析

文化经管 ....略结论: 一个模块的评分与评论数相关,评分为 [8.8——9.2] 之间的书籍评论数往往是模块中最多的原文地址:https://www.cnblogs.com/LexMoon/p/douban3.html

python爬虫——豆瓣图书top250信息

# -*- coding: utf-8 -*- import requests import re import sys reload(sys) sys.setdefaultencoding('utf-8') class Spider(object): def __init__(self): print('开始爬取豆瓣图书top250的内容......') # 传入url,返回网页源代码 def getSourceCode(self, url): html = requests.get(url)

33、豆瓣图书短评

练习介绍要求: 本练习需要运用scrapy的知识,爬取豆瓣图书TOP250(https://book.douban.com/top250 )前2页的书籍(50本)的短评数据存储成Excel 书名评论ID 短评内容 1.创建爬虫项目 1 D:\USERDATA\python>scrapy startproject duanping 2 New Scrapy project 'duanping', using template directory 'c:\users\www1707\appda

用Scrapy爬虫爬取豆瓣电影排行榜数据，存储到Mongodb数据库

爬虫第一步:新建项目选择合适的位置,执行命令:scrapy startproje xxxx(我的项目名:douban) 爬虫第二步:明确目标豆瓣电影排行url:https://movie.douban.com/top250?start=0, 分析url后发现srart=后面的数字,以25的步长递增,最大为225,所以可以利用这个条件来发Request请求本文只取了三个字段,电影名.评分和介绍,当然你想去更多信息也是可以的 item["name"]:电影名 item["r

豆瓣图书接口API

所有数据均来源于豆瓣图书,数据量并不完整,仅供学习爬虫对照结果使用. 接口地址: http://api.xiaomafeixiang.com/api/bookinfo?isbn=9787544270878 把isbn替换为需要的数据即可,请合理使用. 已经入库数据量: 4月5日,102576 原文地址:https://www.cnblogs.com/mazhiyong/p/12626387.html

Python 2.7_利用xpath语法爬取豆瓣图书top250信息_20170129

大年初二,忙完家里一些事,顺带有人交流爬取豆瓣图书top250 1.构造urls列表 urls=['https://book.douban.com/top250?start={}'.format(str(i) for i in range(0, 226, 25))] 2.模块 requests获取网页源代码 lxml 解析网页 xpath提取 3.提取信息 4.可以封装成函数此处没有封装调用 python代码: #coding:utf-8 import sys reload(sys) sys.

八爪鱼采集教程——如何采集豆瓣图书评价

豆瓣图书评价如何采集豆瓣是一个集品味系统(读书.电影.音乐).表达系统(我读.我看.我听)和交流系统(同城.小组.友邻)于一体的创新网络服务,致力于帮助都市人群发现生活中有用的事物.今天教大家怎么通过对豆瓣里面的图书评价信息采集,找一本适合自己的书. 方法/步骤 1.首先注册账号激活并登录 2.登陆后在八爪鱼采集软件界面左侧的菜单栏可以找到"采集规则"一项,小伙伴们可以先进入规则市场中搜索一下,自己想要采集的平台是否已经有现成的规则可借鉴.如果有,可以直接下载后导入新的任务即可使用,