爬取IT之家业界新闻

爬取站点 https://it.ithome.com/ityejie/ ，进入详情页提取内容。

  1 import requests
  2 import json
  3 from lxml import etree
  4 from pymongo import MongoClient
  5
  6 url = ‘https://it.ithome.com/ithome/getajaxdata.aspx‘
  7 headers = {
  8     ‘authority‘: ‘it.ithome.com‘,
  9     ‘method‘: ‘POST‘,
 10     ‘path‘: ‘/ithome/getajaxdata.aspx‘,
 11     ‘scheme‘: ‘https‘,
 12     ‘accept‘: ‘text/html, */*; q=0.01‘,
 13     ‘accept-encoding‘: ‘gzip, deflate, br‘,
 14     ‘accept-language‘: ‘zh-CN,zh;q=0.9‘,
 15     ‘content-length‘: ‘40‘,
 16     ‘content-type‘: ‘application/x-www-form-urlencoded; charset=UTF-8‘,
 17     ‘cookie‘: ‘BAIDU_SSP_lcr=https://www.hao123.com/link/https/?key=http%3A%2F%2Fwww.ithome.com%2F&&monkey=m-kuzhan-group1&c=B329C2F33C91DEACCFAEB1680305F198; Hm_lvt_f2d5cbe611513efcf95b7f62b934c619=1530106766; ASP.NET_SessionId=tyxenfioljanx4xwsvz3s4t4; Hm_lvt_cfebe79b2c367c4b89b285f412bf9867=1530106547,1530115669; BEC=228f7aa5e3abfee5d059195ad34b4137|1530117889|1530109082; Hm_lpvt_f2d5cbe611513efcf95b7f62b934c619=1530273209; Hm_lpvt_cfebe79b2c367c4b89b285f412bf9867=1530273261‘,
 18     ‘origin‘: ‘https://it.ithome.com‘,
 19     ‘referer‘: ‘https://it.ithome.com/ityejie/‘,
 20     ‘user-agent‘: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3472.3 Safari/537.36‘,
 21     ‘x-requested-with‘: ‘XMLHttpRequest‘
 22 }
 23
 24 client = MongoClient()
 25 db = client[‘ithome‘]
 26 collection = db[‘ithome‘]
 27 max_page = 1000
 28
 29 def get_page(page):
 30
 31     formData = {
 32         ‘categoryid‘: ‘31‘,
 33         ‘type‘: ‘pccategorypage‘,
 34         ‘page‘: page,
 35         }
 36     try:
 37         r = requests.post(url, data=formData, headers=headers)
 38         if r.status_code == 200:
 39
 40             #print(type(r))
 41             html = r.text
 42             # 响应返回的是字符串，解析为HTML DOM模式 text = etree.HTML(html)
 43             text = etree.HTML(html)
 44             link_list = text.xpath(‘//h2/a/@href‘)
 45
 46             print("提取第"+str(page)+"页文章")
 47             id=0
 48             for link in link_list:
 49                 id+=1
 50                 print("解析第"+str(page)+"页第"+str(id)+"篇文章")
 51                 print("链接为："+link)
 52                 loadpage(link)
 53
 54     except requests.ConnectionError as e:
 55         print(‘Error‘, e.args)
 56
 57
 58 # 取出每个文章的链接
 59 def loadpage(link):
 60
 61     headers = {‘user-agent‘: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3472.3 Safari/537.36‘}
 62
 63     try:
 64
 65         reseponse = requests.get(link, headers = headers)
 66         if reseponse.status_code == 200:
 67             html = reseponse.text
 68             # 解析
 69             node = etree.HTML(html)
 70
 71             ithome ={}
 72             # 取出每个标题，正文等
 73
 74             # xpath返回的列表，这个列表就这一个参数，用索引方式取出来，标题
 75             ithome[‘title‘] = node.xpath(‘//*[@id="wrapper"]/div[1]/div[2]/h1‘)[0].text
 76             # 时间
 77             ithome[‘data‘] = node.xpath(‘//*[@id="pubtime_baidu"]‘)[0].text
 78             # 取出标签下的内容
 79             #content = node.xpath(‘//*[@id="paragraph"]/p/text()‘)
 80             ithome[‘content‘] = "".join(node.xpath(‘//*[@id="paragraph"]/p/text()‘)).strip()
 81             #content = node.xpath(‘//*[@id="paragraph"]/p‘)[1].text
 82             # 取出标签里包含的内容，作者
 83             ithome[‘author‘] = node.xpath(‘//*[@id="author_baidu"]/strong‘)[0].text
 84             # 评论数
 85             ithome[‘commentcount‘] = node.xpath(‘//span[@id="commentcount"]‘)[0].text
 86             #评论数没有取到
 87             write_to_file(ithome)
 88             save_to_mongo(ithome)
 89
 90     except requests.ConnectionError as e:
 91         print(‘Error‘, e.args)
 92
 93 def write_to_file(content):
 94     with open(‘ithome.json‘,‘a‘,encoding=‘utf-8‘) as f:
 95         f.write(json.dumps(content,ensure_ascii=False)+‘\n‘)
 96         f.close()
 97
 98 def save_to_mongo(result):
 99     if collection.insert(result):
100         print(‘Saved to Mongo‘)
101
102 if __name__ == ‘__main__‘:
103     for page in range(1, max_page + 1):
104         get_page(page)
105
106

原文地址：https://www.cnblogs.com/wanglinjie/p/9246369.html

时间： 2024-10-08 10:01:55

爬取IT之家业界新闻的相关文章

（原）python爬虫入门（2）---排序爬取的辽宁科技大学热点新闻

发现科大网页的源码中还有文章的点击率,何不做一个文章点击率的降序排行.简单,前面入门(1)基本已经完成我们所要的功能了,本篇我们仅仅需要添加:一个通过正则获取文章点击率的数字:再加一个根据该数字的插入排序.ok,大功告成! 简单说一下本文插入排序的第一个循环,找到列表中最大的数,放到列表 0 的位置做观察哨. 上代码: # -*- coding: utf-8 -*- # 程序:爬取点击排名前十的科大热点新闻 # 版本:0.1 # 时间:2014.06.30 # 语言:python 2.7 #--

python3 爬取汽车之家所有车型操作步骤

题记: 互联网上关于使用python3去爬取汽车之家的汽车数据(主要是汽车基本参数,配置参数,颜色参数,内饰参数)的教程已经非常多了,但大体的方案分两种: 1.解析出汽车之家某个车型的网页,然后正则表达式匹配出混淆后的数据对象与混淆后的js,并对混淆后的js使用pyv8进行解析返回正常字符,然后通过字符与数据对象进行匹配,具体方法见这位园友,传送门:https://www.cnblogs.com/my8100/p/js_qichezhijia.html (感谢这位大神前半部分的思路) 2.解析出

使用Python爬取腾讯房产的新闻，用的Python库：requests 、re、time、BeautifulSoup ????

import requests import re import time from bs4 import BeautifulSoup today = time.strftime('%Y-%m-%d',time.localtime(time.time())) one_url = 'http://hz.house.qq.com' #用来构建新的URL的链接 url = 'http://hz.house.qq.com/zxlist/bdxw.htm' #需要爬取的网址 html = requests

使用scrapy爬虫,爬取今日头条首页推荐新闻（scrapy+selenium+PhantomJS）

爬取今日头条https://www.toutiao.com/首页推荐的新闻,打开网址得到如下界面查看源代码你会发现全是js代码,说明今日头条的内容是通过js动态生成的. 用火狐浏览器F12查看得知得到了今日头条的推荐新闻的接口地址:https://www.toutiao.com/api/pc/focus/ 单独访问这个地址得到此接口得到的数据格式为json数据我们用scrapy+selenium+PhantomJS的方式获取今日头条推荐的内容下面是是scrapy中最核心的代码,位于s

爬取汽车之家新闻

a.首先伪造浏览器向某个地址发送HTTP请求,获取返回的字符串 import requestsresponse=requests.get(url='地址')#get请求 response.content #内容 response.encoding=apparent_encoding #检测编码形式,并设置编码 response.text #自动转码 b.通过Beautifulsoup4解析HTML格式字符串 from bs4 import BeautifulSoup soup = Beautif

py 爬取汽车之家新闻案例

``` import requests from bs4 import BeautifulSoup response = requests.get("https://www.autohome.com.cn/news/") # 1. content /text 的区别 # print(response.content) # content 拿到的字节 response.encoding = 'gbk' # print(response.text) # text 拿到的文本信息 soup

python 爬虫爬取几十家门店在美团外卖上的排名，并插入数据库，最后在前端显示

爬虫脚本: #!/usr/bin/env python # encoding: utf-8 """ @version: ?? @author: phpergao @license: Apache Licence @file: meituan_paiming.py @time: 2016/8/1 15:16 """ import urllib,json,re import urllib.parse import http.cookiejar im

1)②爬取光明网部分旅游新闻

1 __author__ = 'minmin' 2 #coding:utf-8 3 import re,urllib,sgmllib 4 5 #根据当前的url获取html 6 def getHtml(url): 7 page = urllib.urlopen(url) 8 html = page.read() 9 page.close() 10 return html 11 12 #根据html获取想要的文章内容 13 def func(str): 14 result= re.findall(

爬取汽车之家

import requests from bs4 import BeautifulSoup response = requests.get('https://www.autohome.com.cn/news/') response.encoding = 'gbk' soup = BeautifulSoup(response.text,"html.parser") div =soup.find(name='div',id='auto-channel-lazyload-article')