用Selenium抓取新浪天气

（1）用Selenium抓取新浪天气

系统环境：

操作系统：macOS 10.13.6 python ：2.7.10

用虚拟环境实现

一、创建虚拟环境：

mkvirtualenv --python=/usr/bin/python python_2

二、激活虚拟环境：

workon python_2

三、安装Selenium

pip install Selenium

四、安装firefox的Selenium补丁文件:

brew install geckodriver

五、在~/.bash_profile中增加一行：

export PATH=$PATH:/usr/local/Cellar/geckodriver/0.22.0/bin

六、安装beautifulsoup4、lxml、html5lib：

pip install beautifulsoup4

pip install lxml

pip install html5lib

python代码：

#coding:utf-8

import sys

reload(sys)

sys.setdefaultencoding(‘utf8‘)

from selenium import webdriver

from selenium.webdriver.common.keys import Keys

import time, datetime

from bs4 import BeautifulSoup

driver = webdriver.Firefox()

driver.get("http://weather.sina.com.cn")

assert u"新浪" in driver.title

elem = driver.find_element_by_id("hd_sh_input")

elem.clear()

elem.send_keys(u"长春")

time.sleep(2)

elem.send_keys(Keys.RETURN)

time.sleep(2)

handles = driver.window_handles

for handle in handles: # 切换窗口

if handle != driver.current_window_handle:

# print ‘switch to second window‘, handle

driver.close() # 关闭第一个窗口

driver.switch_to.window(handle) # 切换到第二个窗口

html_const = driver.page_source

soup = BeautifulSoup(html_const, ‘html.parser‘)

div_tag = soup.find_all("div", class_="blk_fc_c0_i")

for i in div_tag:

for tag in i.find_all(True):

if tag[‘class‘][0] == ‘wt_fc_c0_i_date‘:

print "日期：", datetime.date.today().strftime(‘%Y‘)+ "-" + tag.string

if tag[‘class‘][0] == ‘wt_fc_c0_i_temp‘:

print "温度：", tag.string

if tag[‘class‘][0] == ‘wt_fc_c0_i_tip‘:

print "风力：", tag.string

if tag[‘class‘][0] == ‘l‘ :

print "PM5：", tag.string

if tag[‘class‘][0] == ‘r‘ :

print "空气质量：", tag.string

print "________________"

driver.close()

运行结果：

日期： 2018-09-30

温度： 15°C / 7°C

风力：北风 3～4级

PM5： 21

空气质量：优

________________

日期： 2018-10-01

温度： 15°C / 4°C

风力：西北风 3～4级

PM5： 21

空气质量：优

________________

日期： 2018-10-02

温度： 19°C / 7°C

风力：西风小于3级

PM5： 40

空气质量：优

________________

日期： 2018-10-03

温度： 20°C / 8°C

风力：西南风小于3级

PM5： 58

空气质量：良

________________

日期： 2018-10-04

温度： 21°C / 9°C

风力：西南风小于3级

PM5： 57

空气质量：良

________________

日期： 2018-10-05

温度： 22°C / 9°C

风力：西南风小于3级

PM5： 40

空气质量：优

________________

原文地址：https://www.cnblogs.com/herosoft/p/9733002.html

时间： 2024-10-26 00:25:52

用Selenium抓取新浪天气的相关文章

爬虫Scrapy学习指南之抓取新浪天气

scrapy有一个简单的入门文档,大家可以参考一下,我感觉官方文档是最靠谱的,也是最真实的. 首先我们先创建一个scrapy的项目 scrapy startproject weather 我采用的是ubuntu12.04的系统,建立项目之后主文件夹就会出现一个weather的文件夹.我们可以通过tree来查看文件夹的结构.可以使用sudoapt-get install tree安装. tree weather weather ├── scrapy.cfg ├── wea.json ├── wea

python爬虫：使用urllib.request和BeautifulSoup抓取新浪新闻标题、链接和主要内容

案例一抓取对象: 新浪国内新闻(http://news.sina.com.cn/china/),该列表中的标题名称.时间.链接. 完整代码: from bs4 import BeautifulSoup import requests url = 'http://news.sina.com.cn/china/' web_data = requests.get(url) web_data.encoding = 'utf-8' soup = BeautifulSoup(web_data.text,'

python爬虫：抓取新浪新闻内容（从当前时间到之前某个时间段），并用jieba分词，用于训练自己的分词模型

新浪新闻内容采用的是ajax动态显示内容,通过抓包,发现如下规律: 每次请求下一页,js那一栏都会出现新的url: "http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1" "||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&ta

Python抓取新浪新闻数据（二）

以下是抓取的完整代码(抓取了网页的title,newssource,dt,article,editor,comments)举例: 原文地址:http://blog.51cto.com/2290153/2126861

抓取新浪新闻的内容以及链接

import requestsfrom bs4 import BeautifulSoupres = requests.get('http://news.sina.com.cn/china/')res.encoding='utf-8'soup = BeautifulSoup(res.text,'html.parser') for news in soup.select('.news-item'): if(len(news.select('h2'))>0): h2=news.select('h2')

使用fastjson解析json抓取新浪新闻文章

首先看看2个简单的fastjson的使用例子一 package ivyy.taobao.com.domain.json; import java.util.Iterator; import com.alibaba.fastjson.JSONArray; import com.alibaba.fastjson.JSONObject; /** * @Author:jilongliang * @Date:2014-12-19 * @Version:1.0 * @Description: */ pub

python抓取新浪首页的小例子

参考廖雪峰的python教程:http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001386832653051fd44e44e4f9e4ed08f3e5a5ab550358d000 代码: 1 #!/usr/bin/python 2 3 # import module 4 import socket 5 import io 6 7 # create TCP object 8 s

Node.js抓取新浪新闻标题

"use strict"; let cheerio = require("cheerio"); let http = require("http"); let iconv = require("iconv-lite"); let mainUrl = "http://news.sina.com.cn/world/"; http.get(mainUrl, function(sres) { var chunks

运用python抓取博客园首页的所有数据，而且定时持续抓取新公布的内容存入mongodb中

原文地址:运用python抓取博客园首页的所有数据,而且定时持续抓取新公布的内容存入mongodb中依赖包: 1.jieba 2.pymongo 3.HTMLParser # -*- coding: utf-8 -*- """ @author: jiangfuqiang """ from HTMLParser import HTMLParser import re import time from datetime import date im