21天打造分布式爬虫-中国天气网实战(四)

4.1.中国天气网

网址:http://www.weather.com.cn/textFC/hb.shtml

解析:BeautifulSoup4

爬取所有城市的最低天气

import requests
from bs4 import BeautifulSoup
import html5lib

def parse_page(url):
    headers = {
        ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36‘,
    }
    response = requests.get(url)
    text = response.content.decode(‘utf-8‘)
    # 需要用到html5lib解析器,去补全html标签
    soup = BeautifulSoup(text,‘html5lib‘)
    conMidtab = soup.find(‘div‘,class_=‘conMidtab‘)
    tables = conMidtab.find_all(‘table‘)
    for table in tables:
        trs = table.find_all(‘tr‘)[2:]
        for index,tr in enumerate(trs):
            tds = tr.find_all(‘td‘)
            city_td = tds[0]
            if index == 0:
                city_td = tds[1]
            city = list(city_td.stripped_strings)[0]
            temp_td = tds[-2]
            temp = list(temp_td.stripped_strings)[0]
            print({‘city‘:city,‘temp‘:temp})

def main():
    url_list = [
        ‘http://www.weather.com.cn/textFC/hb.shtml‘,
        ‘http://www.weather.com.cn/textFC/db.shtml‘,
        ‘http://www.weather.com.cn/textFC/hd.shtml‘,
        ‘http://www.weather.com.cn/textFC/hz.shtml‘,
        ‘http://www.weather.com.cn/textFC/hn.shtml‘,
        ‘http://www.weather.com.cn/textFC/xb.shtml‘,
        ‘http://www.weather.com.cn/textFC/xn.shtml‘,
        ‘http://www.weather.com.cn/textFC/gat.shtml‘,
    ]
    for url in url_list:
        parse_page(url)

if __name__ == ‘__main__‘:
    main()

对爬取的数据进行可视化处理

  • 按温度对城市进行排名
  • 取前10个
  • 生成直方图

代码:

import requests
from bs4 import BeautifulSoup
import html5lib
from pyecharts import Bar

ALL_DATA = []

def parse_page(url):
    headers = {
        ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36‘,
    }
    response = requests.get(url)
    text = response.content.decode(‘utf-8‘)
    # 需要用到html5lib解析器,去补全html标签
    soup = BeautifulSoup(text,‘html5lib‘)
    conMidtab = soup.find(‘div‘,class_=‘conMidtab‘)
    tables = conMidtab.find_all(‘table‘)
    for table in tables:
        trs = table.find_all(‘tr‘)[2:]
        for index,tr in enumerate(trs):
            tds = tr.find_all(‘td‘)
            city_td = tds[0]
            if index == 0:
                city_td = tds[1]
            city = list(city_td.stripped_strings)[0]
            temp_td = tds[-2]
            temp = list(temp_td.stripped_strings)[0]
            # print({‘city‘:city,‘temp‘:int(temp)})
            ALL_DATA.append({‘city‘:city,‘temp‘:int(temp)})

def main():
    url_list = [
        ‘http://www.weather.com.cn/textFC/hb.shtml‘,
        ‘http://www.weather.com.cn/textFC/db.shtml‘,
        ‘http://www.weather.com.cn/textFC/hd.shtml‘,
        ‘http://www.weather.com.cn/textFC/hz.shtml‘,
        ‘http://www.weather.com.cn/textFC/hn.shtml‘,
        ‘http://www.weather.com.cn/textFC/xb.shtml‘,
        ‘http://www.weather.com.cn/textFC/xn.shtml‘,
        ‘http://www.weather.com.cn/textFC/gat.shtml‘,
    ]
    for url in url_list:
        parse_page(url)
    #按天气最低进行排序,并只取10个
    ALL_DATA.sort(key=lambda data:data[‘temp‘])
    data = ALL_DATA[0:10]
    #分别取出所有城市和温度
    cities = list(map(lambda x:x[‘city‘],data))
    temps = list(map(lambda x:x[‘temp‘],data))

    chart = Bar("中国天气最低气温排行榜")
    chart.add(‘‘,cities,temps)
    chart.render(‘temperature.html‘)

if __name__ == ‘__main__‘:
    main()

结果:

原文地址:https://www.cnblogs.com/derek1184405959/p/9403808.html

时间: 2024-08-27 07:50:55

21天打造分布式爬虫-中国天气网实战(四)的相关文章

21天打造分布式爬虫-房天下全国658城市房源(十一)

项目:爬取房天下网站全国所有城市的新房和二手房信息 网站url分析 1.获取所有城市url http://www.fang.com/SoufunFamily.htm 例如:http://cq.fang.com/ 2.新房url http://newhouse.sh.fang.com/house/s/ 3.二手房url http://esf.sh.fang.com/ 4.北京新房和二手房url规则不同 http://newhouse.fang.com/house/s/ http://esf.fan

21天打造分布式爬虫(一)

1.1.urlopen函数的用法 #encoding:utf-8 from urllib import request res = request.urlopen("https://www.cnblogs.com/") print(res.readlines()) #urlopen的参数 #def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, # *, cafile=None, capath=None,

21天打造分布式爬虫-数据解析实战(三)

3.1.豆瓣电影 使用lxml import requests from lxml import etree headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36', 'Referer':'https://movie.douban.com/' } url = 'https

21天打造分布式爬虫-Selenium爬取拉钩职位信息(六)

6.1.爬取第一页的职位信息 第一页职位信息 from selenium import webdriver from lxml import etree import re import time class LagouSpider(object): def __init__(self): self.driver = webdriver.Chrome() #python职位 self.url = 'https://www.lagou.com/jobs/list_python?labelWords

21天打造分布式爬虫-Crawl爬取小程序社区(八)

8.1.Crawl的用法实战 新建项目 scrapy startproject wxapp scrapy genspider -t crawl wxapp_spider "wxapp-union.com" wxapp_spider.py # -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider,

21天打造分布式爬虫-urllib库(一)

1.1.urlopen函数的用法 #encoding:utf-8 from urllib import request res = request.urlopen("https://www.cnblogs.com/") print(res.readlines()) #urlopen的参数 #def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, # *, cafile=None, capath=None,

第5课-中国天气网爬虫案例

一.中国天气网爬虫案例 #中国天气网爬虫 import requests from pyecharts.charts import Bar from bs4 import BeautifulSoup import copy import html5lib datas = [] data = { "city":None, "day":None, "higher_temp":None, "lower_temp":None } HE

【Python3 爬虫】U11_爬取中国天气网

目录 1.网页分析 2.代码实现 1.网页分析 庚子年初,各种大事件不期而至,又赶上最近气温突变,所以写个爬虫来爬取下中国天气网,并通过图表反映气温最低的前20个城市. 中国天气网:http://www.weather.com.cn/textFC/hb.shtml 打开后如下图: 从图中可以看到所有城市按照地区划分了,并且每个城市都有最低气温和最高气温,通过chrome查看Elements,如下: 从上图可以看到展示当天的数据,那么<div class='conMidtab'>..这个标签则没

中国天气网api

namespace wyl.Weather { /// <summary> /// 天气预报辅助类 /// 接口数据来自于 中国天气网 /// </summary> public class WeatherHelper { private static readonly string appId = ConfigurationManager.AppSettings["appId"].ToString(); private static readonly stri