Python爬取百度实时热点排行榜

今天爬取的百度的实时热点排行榜

按照惯例，先下载网站的内容到本地：

1 def downhtml():
2     url = ‘http://top.baidu.com/buzz?b=1&fr=20811‘
3     headers = {‘User-Agent‘:‘Mozilla/5.0‘}
4     r = requests.get(‘url‘,headers=headers)
5     with open(‘C:/Code/info_baidu.html‘,‘wb‘) as f:
6         f.write(r.content)

因为我习惯把网页整个抓到本地再来分析数据，所以会有这一步，后面会贴直接抓取并分析的代码。

开始分析数据：

我想抓取的排名，关键词和搜索指数这三个值。

打开网页源代码：

发现每个标题的各个元素是一个个td被包装在一个tr标签里面，每一个标题都是一个tr（这里注意前三个标题的tr标签是有class=‘hideline’，而后面的则没有）

排名：第一个td　　　　class=‘‘first‘

关键词：第二个td　　　 cass = ‘keyword‘

搜索指数：最后一个td 　　class = ‘last‘

确定了我所需要的数据的位置了之后，可以开始写代码了。

写一个把打开本地html并返回给BeautifulSoup调用的函数：

def send_html():#把本地的html文件调给get_pages的BeautifulSoup
    path = ‘C:/Code/info_baidu.html‘
    htmlfile= open(path,‘r‘)
    htmlhandle = htmlfile.read()
    return htmlhandle

这样，我就可以在下面的直接用本地html来测试，而不用每次都去请求百度的服务器了。

def get_pages(html):
    soup = BeautifulSoup(html,‘html.parser‘)
    all_topics=soup.find_all(‘tr‘)[1:]#切片

因为第一个tr装的是这些东西

<tr>
        <th width="50" class="first">排名</th>
        <th>关键词</th>
        <th width="30%" class="tc">相关链接</th>
        <th width="20%" class="last">搜索指数</th>
    </tr>

并不是排名第一的标题，所以我用切片把它过滤掉了。

然后开始挨个赋值：

def get_pages(html):
    soup = BeautifulSoup(html,‘html.parser‘)
    all_topics=soup.find_all(‘tr‘)[1:]
    for each_topic in all_topics:
        #print(each_topic)
        topic_times = each_topic.find(‘td‘,class_=‘last‘).get_text()#搜索指数
        topic_rank = each_topic.find(‘td‘,class_=‘first‘).get_text()#排名
        topic_name = each_topic.find(‘td‘,class_=‘keyword‘).get_text()#标题目
        print(‘排名：{}，标题：{}，热度：{}‘.format(topic_rank,topic_name,topic_times))

这样按道理来说应该是可以输出了，但百度还是想给我一点难度。

这里出现几个问题，

1：AttributeError: ‘NoneType‘ object has no attribute ‘get_text‘

2：输出的格式

3：只有一个值

按照惯例，第一个问题应该是里面多了一些不是Tag的类型，所以就来测试一下：

def get_pages(html):
    soup = BeautifulSoup(html,‘html.parser‘)
    all_topics=soup.find_all(‘tr‘)[1:]
    for each_topic in all_topics:
        #print(each_topic)
        topic_times = each_topic.find(‘td‘,class_=‘last‘)#搜索指数
        print(type(topic_times))

输出如下：

我们可以发现前几个值都参杂了NoneType（我去源代码看了一下，并不知道是什么导致的，等以后我知道了，再回来！）

因此，我们只要把NoneType给过滤掉就行。

def get_pages(html):
    soup = BeautifulSoup(html,‘html.parser‘)
    all_topics=soup.find_all(‘tr‘)[1:]
    for each_topic in all_topics:
        #print(each_topic)
        topic_times = each_topic.find(‘td‘,class_=‘last‘)#搜索指数
        topic_rank = each_topic.find(‘td‘,class_=‘first‘)#排名
        topic_name = each_topic.find(‘td‘,class_=‘keyword‘)#标题目
        # print(‘排名：{}，标题：{}，热度：{}‘.format(topic_rank,topic_name,topic_times))
        if topic_rank != None and topic_name!=None and topic_times!=None:
            topic_rank = each_topic.find(‘td‘,class_=‘first‘).get_text()
            topic_name = each_topic.find(‘td‘,class_=‘keyword‘).get_text()
            topic_times = each_topic.find(‘td‘,class_=‘last‘).get_text()
            print(‘排名：{}，标题：{}，热度：{}‘.format(topic_rank,topic_name,topic_times))

输出如下：

这样就解决了第一个问题，发现可以输出了，连第三个问题也解决了。

但第二个问题还在，这shit一般的格式让我很难受，导致这样的原因我猜是get_text时把一些空格符和换行符也一起输出了。

所以用replace()就应该可以解决了。

if topic_rank != None and topic_name!=None and topic_times!=None:
            topic_rank = each_topic.find(‘td‘,class_=‘first‘).get_text().replace(‘ ‘,‘‘).replace(‘\n‘,‘‘)
            topic_name = each_topic.find(‘td‘,class_=‘keyword‘).get_text().replace(‘ ‘,‘‘).replace(‘\n‘,‘‘)
            topic_times = each_topic.find(‘td‘,class_=‘last‘).get_text().replace(‘ ‘,‘‘).replace(‘\n‘,‘‘)
            print(‘排名：{}，标题：{}，热度：{}‘.format(topic_rank,topic_name,topic_times))

输出如下：

哦吼，这样感觉就不错了。

但强迫症患者感觉还是很难受啊，这个热度（搜索指数）的格式也太乱了。

经过一番搜索，网友的力量还是很强大的啊哈哈哈，马上就有办法了。

if topic_rank != None and topic_name!=None and topic_times!=None:
            topic_rank = each_topic.find(‘td‘,class_=‘first‘).get_text().replace(‘ ‘,‘‘).replace(‘\n‘,‘‘)
            topic_name = each_topic.find(‘td‘,class_=‘keyword‘).get_text().replace(‘ ‘,‘‘).replace(‘\n‘,‘‘)
            topic_times = each_topic.find(‘td‘,class_=‘last‘).get_text().replace(‘ ‘,‘‘).replace(‘\n‘,‘‘)
            #print(‘排名：{}，标题：{}，热度：{}‘.format(topic_rank,topic_name,topic_times))
            tplt = "排名：{0:^4}\t标题：{1:{3}^15}\t热度：{2:^7}"
            print(tplt.format(topic_rank,topic_name,topic_times,chr(12288)))

输出如下：

本强迫症患者终于满足了哈哈。

附上总代码：

 1 import requests
 2 from bs4 import BeautifulSoup
 3 import bs4
 4
 5
 6 def send_html():#把本地的html文件调给get_pages的BeautifulSoup
 7     path = ‘C:/Code/info_baidu.html‘
 8     htmlfile= open(path,‘r‘)
 9     htmlhandle = htmlfile.read()
10     return htmlhandle
11
12 def get_pages(html):
13     soup = BeautifulSoup(html,‘html.parser‘)
14     all_topics=soup.find_all(‘tr‘)[1:]
15     for each_topic in all_topics:
16         #print(each_topic)
17         topic_times = each_topic.find(‘td‘,class_=‘last‘)#搜索指数
18         topic_rank = each_topic.find(‘td‘,class_=‘first‘)#排名
19         topic_name = each_topic.find(‘td‘,class_=‘keyword‘)#标题目
20         if topic_rank != None and topic_name!=None and topic_times!=None:
21             topic_rank = each_topic.find(‘td‘,class_=‘first‘).get_text().replace(‘ ‘,‘‘).replace(‘\n‘,‘‘)
22             topic_name = each_topic.find(‘td‘,class_=‘keyword‘).get_text().replace(‘ ‘,‘‘).replace(‘\n‘,‘‘)
23             topic_times = each_topic.find(‘td‘,class_=‘last‘).get_text().replace(‘ ‘,‘‘).replace(‘\n‘,‘‘)
24             #print(‘排名：{}，标题：{}，热度：{}‘.format(topic_rank,topic_name,topic_times))
25             tplt = "排名：{0:^4}\t标题：{1:{3}^15}\t热度：{2:^7}"
26             print(tplt.format(topic_rank,topic_name,topic_times,chr(12288)))
27
28 if __name__ ==‘__main__‘:
29     get_pages(send_html())

。

还有直接爬取不用下载网页的总代码：

 1 import requests
 2 from bs4 import BeautifulSoup
 3 import bs4
 4
 5 def get_html(url,headers):
 6     r = requests.get(url,headers=headers)
 7     r.encoding = r.apparent_encoding
 8     return r.text
 9
10
11 def get_pages(html):
12     soup = BeautifulSoup(html,‘html.parser‘)
13     all_topics=soup.find_all(‘tr‘)[1:]
14     for each_topic in all_topics:
15         #print(each_topic)
16         topic_times = each_topic.find(‘td‘,class_=‘last‘)#搜索指数
17         topic_rank = each_topic.find(‘td‘,class_=‘first‘)#排名
18         topic_name = each_topic.find(‘td‘,class_=‘keyword‘)#标题目
19         if topic_rank != None and topic_name!=None and topic_times!=None:
20             topic_rank = each_topic.find(‘td‘,class_=‘first‘).get_text().replace(‘ ‘,‘‘).replace(‘\n‘,‘‘)
21             topic_name = each_topic.find(‘td‘,class_=‘keyword‘).get_text().replace(‘ ‘,‘‘).replace(‘\n‘,‘‘)
22             topic_times = each_topic.find(‘td‘,class_=‘last‘).get_text().replace(‘ ‘,‘‘).replace(‘\n‘,‘‘)
23             #print(‘排名：{}，标题：{}，热度：{}‘.format(topic_rank,topic_name,topic_times))
24             tplt = "排名：{0:^4}\t标题：{1:{3}^15}\t热度：{2:^8}"
25             print(tplt.format(topic_rank,topic_name,topic_times,chr(12288)))
26
27 def main():
28     url = ‘http://top.baidu.com/buzz?b=1&fr=20811‘
29     headers= {‘User-Agent‘:‘Mozilla/5.0‘}
30     html = get_html(url,headers)
31     get_pages(html)
32
33 if __name__==‘__main__‘:
34     main()

好了。完成任务，生活愉快！

原文地址：https://www.cnblogs.com/xunhuajun/p/10008867.html

时间： 2024-07-30 08:40:44

Python爬取百度实时热点排行榜的相关文章

爬取百度实时热点前十排行榜

import requests#导入相应库from bs4 import BeautifulSoupimport pandas as pdurl = 'http://top.baidu.com/buzz?b=1&c=513&fr=topbuzz_b341_c513'#要爬取的网址headers = {'User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/

python爬取百度翻译返回：{'error': 997, 'from': 'zh', 'to': 'en', 'query 问题

解决办法: 修改url为手机版的地址:http://fanyi.baidu.com/basetrans User-Agent也用手机版的测试代码: # -*- coding: utf-8 -*- """ ------------------------------------------------- File Name: requestsGet Description : 爬取在线翻译数据s Author : 神秘藏宝室 date: 2018-04-17 --------

python爬取百度搜索图片

在之前通过爬取贴吧图片有了一点经验,先根据之前经验再次爬取百度搜索界面图片废话不说,先上代码 #!/usr/bin/env python # -*- coding: utf-8 -*- # @Time : 2017/7/22 10:44 # @Author : wqj # @Contact : [email protected] # @Site : # @File : test.py # @Software: PyCharm Community Edition import requests

python 爬取百度url

1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 # @Date : 2017-08-29 18:38:23 4 # @Author : EnderZhou ([email protected]) 5 # @Link : http://www.cnblogs.com/enderzhou/ 6 # @Version : $Id$ 7 8 import requests 9 import sys 10 from Queue import Queu

Python爬取百度贴吧内容

参考资料:https://cuiqingcai.com/993.html 即静觅» Python爬虫实战二之爬取百度贴吧帖子我最近在忙学校的一个小项目的时候涉及到NLP的内容.但是在考虑如何训练的时候却才懂什么叫巧妇难为无米之炊的滋味.中文语料库实在少的可怜,偶尔有一两个带标签的语料库,拿出一看,标注惨不忍睹,都让我怀疑是不是机器标注的.正应了那句话,人工智能,有多少智能就有多少人工. 有什么办法呢,硬着头皮,走一步是一步吧,总比停滞不前要好.项目涉及到帖子,那么我相信不管是谁,首先想到的

python爬取百度搜索结果ur汇总

写了两篇之后,我觉得关于爬虫,重点还是分析过程分析些什么呢: 1)首先明确自己要爬取的目标比如这次我们需要爬取的是使用百度搜索之后所有出来的url结果 2)分析手动进行的获取目标的过程,以便以程序实现比如百度,我们先进行输入关键词搜索,然后百度反馈给我们搜索结果页,我们再一个个进行点击查询 3)思考程序如何实现,并克服实现中的具体困难那么我们就先按上面的步骤来,我们首先认识到所搜引擎,提供一个搜索框,让用户进行输入,然后点击执行我们可以先模拟进行搜索,发现点击搜索之后的完整url中有一

【学习笔记】python爬取百度真实url

今天跑个脚本需要一堆测试的url,,,挨个找复制粘贴肯定不是程序员的风格,so,还是写个脚本吧. 环境:python2.7 编辑器:sublime text 3 一.分析一下首先非常感谢百度大佬的url分类非常整齐,都在一个类下即c-showurl,所以只要根据css爬取链接就可以,利用beautifulsoup即可实现,代码如下: soup = BeautifulSoup(content,'html.parser') urls = soup.find_all

python爬取猫眼电影top100排行榜

爬取猫眼电影TOP100(http://maoyan.com/board/4?offset=90)1). 爬取内容: 电影名称,主演, 上映时间,图片url地址保存到mariadb数据库中;2). 所有的图片保存到本地/mnt/maoyan/电影名.png 代码: import re import pymysql as mysql from urllib import request from urllib.request import urlopen u = 'root' p = 'root'

python 爬取百度云资源

1 import urllib.request 2 import re 3 import random 4 5 def get_source(key): 6 7 print('请稍等,爬取中....') 8 headers = [{'User-Agent':'Mozilla/5.0 (Windows NT 6.3 WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.4.8.1000 Chrome/30.0.1599.101 Safari