厉害了!使用Python神不知鬼不觉爬取公司内部的ppt资料(勿做商业用途!)

在写爬虫的过程中遇到如下错误:

1 WinError 10061 - No Connection Could be made

解决方法:

 1. 打开IE internet options
 2. Connections -> Lan Setting
 3. 勾上automatically detect settings

封装好的db操作

 1 # -*- coding:utf-8 -*-
 2 #__author__ = ‘ecaoyng‘
 3
 4 import pymysql
 5 import time
 6
 7 class DBOperation:
 8
 9     def __init__(self, tb_name):
10         self.db_host = ‘x‘
11         self.db_port = 3306
12         self.db_user = ‘x‘
13         self.db_pwd = ‘x‘
14         self.db_name = ‘x‘
15         self.tb_name = tb_name
16
17     def get_time(self):
18         now_time=time.strftime(‘%Y-%m-%d %H:%M:%S‘, time.localtime(time.time()))
19         return now_time
20     ‘‘‘
21     set up connection with db
22     ‘‘‘
23     def db_conn(self):
24         exec_time = self.get_time()
25         try:
26             conn = pymysql.connect(host=self.db_host,port=self.db_port,
27                                    user=self.db_user,passwd=self.db_pwd,db=self.db_name)
28             return conn
29         except Exception as e:
30             print((u‘[%s]: Errors during db connection:%s‘ % (exec_time, e)))
31             return None
32     ‘‘‘
33     set up cursor
34     ‘‘‘
35     def db_cursor(self, conn):
36         try:
37             cur = conn.cursor()
38             return cur
39         except Exception as e:
40             print(e)
41             return None
42
43     ‘‘‘
44     db close
45     ‘‘‘
46     def db_close(self,cur,conn):
47         exec_time = self.get_time()
48         cur.close()
49         conn.close()
50         print(u‘[%s]: db closed‘ % exec_time)
51
52
53
54     ‘‘‘
55     db operations
56     ‘‘‘
57     def tb_insert_url(self,cur,conn,urls):
58         exec_time = self.get_time()
59         tb_exist_sql = """CREATE TABLE IF NOT EXISTS """+ self.tb_name + """ (
60              URL  VARCHAR(200) NOT NULL
61              )"""
62         try:
63             cur.execute(tb_exist_sql)
64             print(u‘[%s]: try to create table %s if not exists.‘ % (exec_time, self.tb_name))
65             conn.commit()
66
67             sql_insert_url = ‘INSERT INTO ‘ + self.tb_name +‘ VALUES (%s)‘
68             cur.executemany(sql_insert_url,urls)
69             conn.commit()
70         except Exception as e:
71             print(u‘[%s]: Errors during insert into %s:%s‘ % (exec_time, self.tb_name ,e))
72
73
74 if __name__ == ‘__main__‘:
75
76     db=DBOperation(‘ECNSlides‘)
77     db_conn = db.db_conn()
78     db_cur = db.db_cursor(db_conn)
79     db.db_close(db_cur,db_conn)

下面是爬虫程序:

  1 # -*- coding:utf-8 -*-
  2 #__author__ = ‘ecaoyng‘
  3
  4 from ESlides.src.DBOperation import *
  5 import urllib.request
  6 import re
  7 import time
  8
  9
 10 class ESlidesCrawler:
 11     def __init__(self):
 12         self.target_link=‘https://mediabank.ericsson.net/search/slides/group%20function%20%28gf%29‘
 13         self.user_agent = ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36‘
 14         self.user_headers = {
 15             ‘User-Agent‘: self.user_agent,
 16             ‘Accept‘ : ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8‘,
 17             ‘Accept - Encoding‘ : ‘gzip, deflate, br‘,
 18             ‘Accept-Language‘ : ‘zh-CN,zh;q=0.8‘,
 19             ‘Cookie‘ : ‘PHPSESSID=57i0onm69eei46g6g23ek05tj2‘,
 20             ‘Host‘ : ‘mediabank.ericsson.net‘,
 21             ‘Referer‘ : ‘https://mediabank.ericsson.net/‘
 22
 23         }
 24         self.save_dir = ‘C:/Users/ecaoyng/Desktop/PPT/‘
 25
 26     ‘‘‘
 27     get local time
 28     ‘‘‘
 29     def get_time(self):
 30         now_time=time.strftime(‘%Y-%m-%d %H:%M:%S‘, time.localtime(time.time()))
 31         return now_time
 32     ‘‘‘
 33     get page links
 34     ‘‘‘
 35     def get_page(self):
 36         now_time=self.get_time()
 37         try:
 38             request = urllib.request.Request(self.target_link, headers=self.user_headers)
 39             response = urllib.request.urlopen(request)
 40             pageCode = response.read().decode(‘utf-8‘)
 41             return  pageCode
 42         except urllib.request.URLError as e:
 43             print(u‘%s Errors during connect to target link:%s‘ % (now_time, e))
 44             return None
 45     ‘‘‘
 46     get initial target links
 47     ‘‘‘
 48     def get_links(self):
 49         now_time = self.get_time()
 50         page_code = self.get_page()
 51         if page_code is not None:
 52             page_links = []
 53             try:
 54                 pattern = re.compile(
 55                     ‘<li id=.*?>.*?<a href="/media/(.*?)" class="thumb" draggable="true">‘,re.S)
 56                 items = re.findall(pattern, page_code)
 57                 for item in items:
 58                     item = ‘%s%s%s‘ % (‘https://mediabank.ericsson.net/details/‘, item, ‘/download/original‘)
 59                     page_links.append(item)
 60                 return page_links
 61             except Exception as e:
 62                 print(u‘[%s]: Errors during parser target link:%s‘ % (now_time, e))
 63                 return None
 64         else:
 65             print(‘page code returns none‘)
 66             return None
 67     ‘‘‘
 68     save links into database
 69     ‘‘‘
 70     def save_links(self):
 71         now_time = self.get_time()
 72         links=self.get_links()
 73         print(links)
 74         try:
 75             if links is not None:
 76                 db = DBOperation(‘ECNSlides‘)
 77                 db_conn = db.db_conn()
 78                 db_cur = db.db_cursor(db_conn)
 79                 print(u‘[%s]: start to urls insert to db‘ % now_time)
 80                 db.tb_insert_url(db_cur, db_conn, links)
 81                 print(u‘[%s]: write urls insert to db successfully‘ % now_time)
 82             else:
 83                 print(u‘[%s]: URL is None when insert to db‘ % now_time)
 84                 pass
 85         finally:
 86             db.db_close(db_cur, db_conn)
 87
 88     ‘‘‘
 89     download ECN slides with params by http
 90     ‘‘‘
 91     def slides_download_params(self):
 92
 93         links = self.get_links()
 94         try:
 95             for url in links:
 96                 now_time = self.get_time()
 97                 file_pattern = re.compile(
 98                     ‘.*?/(\d+)/download/original$‘,re.S)
 99                 file_name = re.findall(file_pattern, url)
100                 file_path = self.save_dir + ‘‘.join(file_name) + ‘.pptx‘
101
102                 print(‘Downloading to %s ...‘ % file_path)
103
104                 save_file = open(file_path,‘wb‘)
105                 save_file.write(urllib.request.urlopen(url).read())
106                 save_file.close()
107
108
109                 # with urllib.request.urlopen(url) as slide:
110                 #     with open(file_path, ‘wb‘) as outfile:
111                 #         outfile.write(slide.read())
112                 #
113                 #     break
114         except Exception as e:
115             print(u‘[%s]: Errors during download slides: %s.‘ % (now_time,e))
116
117
118     ‘‘‘
119     download ECN slides with remote db
120     ‘‘‘
121     def slides_download_db(self):
122         pass
123
124
125 if __name__ == ‘__main__‘:
126     crawler=ESlidesCrawler()
127     # crawler.save_links()
128     crawler.slides_download_params()

问题出现了,发现在http中敲入下载地址,类似于

https://mediabank.ericsson.net/details/Organization%20simple/83138/download/original

但是python代码中用这个地址返回的不是pptx文件,而是html文件.

要知道具体返回的是什么文件的方法如下:

1 # reobj=urllib.request.urlopen(url)
2 # print(type(reobj))
3 # print(reobj.info())
4 # print(reobj.getcode())

可以看到正常如果下载的是zip文件,则返回的信息如下:

1 Content-Type: application/x-zip-compressed
2 Last-Modified: Mon, 23 May 2016 07:50:56 GMT
3 Accept-Ranges: bytes
4 ETag: "0f075d6c7b4d11:0"
5 Server: Microsoft-IIS/7.5
6 X-Powered-By: ASP.NET
7 Date: Wed, 29 Nov 2017 07:07:27 GMT
8 Connection: close
9 Content-Length: 55712699

但是本来是ppt文件,却下载了

 1 Cache-Control: no-cache
 2 Pragma: no-cache
 3 Content-Length: 11743
 4 Content-Type: text/html
 5 Expires: Wed, 29 Nov 2017 07:04:04 GMT
 6 Server: Microsoft-IIS/8.0
 7 Set-Cookie: SMTargetSession=HTTPS%3A%2F%2Ffss%2Eericsson%2Ecom%2Fsiteminderagent%2Fredirectjsp%2Fredirect%2Dinternal%2Ejsp%3FSPID%3DMediabankIntern%26RelayState%3Dhttps%253A%252F%252Fmediabank%2Eericsson%2Enet%252Fdetails%252FOrganization%252520simple%252F83138%252Fdownload%252Foriginal%26SMPORTALURL%3Dhttps%253A%252F%252Ffss%2Eericsson%2Ecom%252Faffwebservices%252Fpublic%252Fsaml2sso%26SAMLTRANSACTIONID%3D176beb36%2Dfeb953b6%2D9a53d42e%2D58810506%2D087b72ac%2Da4e3; path=/
 8 Set-Cookie: ASPSESSIONIDACATSTTS=FOLBNEGCIBMFCPILNEMHOHFN; path=/
 9 X-Powered-By: ASP.NET
10 X-WAM-LOC: LP2-2
11 Date: Wed, 29 Nov 2017 07:05:04 GMT
12 Connection: close
13 Set-Cookie: BIGipServerWAM_PRD_Login=rd423o00000000000000000000ffff9958f466o50001; path=/

Content-Type: text/html 说明是html文件。将其打开之后发现是公司的安全认证页面.

于是开始思索是否可以用cookie的方式来抓取.

原文地址:https://www.cnblogs.com/huohuohuo1/p/9060458.html

时间: 2024-10-12 08:12:46

厉害了!使用Python神不知鬼不觉爬取公司内部的ppt资料(勿做商业用途!)的相关文章

python爬虫,爬取lol所以英雄的资料

import requestsimport jsonheaders = {  'user-agent': 'Mozilla/5.0 (iPad; CPU OS 11_0 like Mac OS X) AppleWebKit/604.1.34 (KHTML, like Gecko) Version/11.0 Mobile/15A5341f Safari/604.1' }#所以英雄的urlurl = 'https://lol.qq.com/biz/hero/champion.js' resp = r

[Python爬虫] Selenium爬取新浪微博客户端用户信息、热点话题及评论 (上)

一. 文章介绍 前一篇文章"[python爬虫] Selenium爬取新浪微博内容及用户信息"简单讲述了如何爬取新浪微博手机端用户信息和微博信息. 用户信息:包括用户ID.用户名.微博数.粉丝数.关注数等. 微博信息:包括转发或原创.点赞数.转发数.评论数.发布时间.微博内容等. 它主要通过从文本txt中读取用户id,通过"URL+用户ID" 访问个人网站,如柳岩: http://weibo.cn/guangxianliuya 因为手机端数据相对精简简单,所以采用输

[python学习] 简单爬取维基百科程序语言消息盒

文章主要讲述如何通过Python爬取维基百科的消息盒(Infobox),主要是通过正则表达式和urllib实现:后面的文章可能会讲述通过BeautifulSoup实现爬取网页知识.由于这方面的文章还是较少,希望提供一些思想和方法对大家有所帮助.如果有错误或不足之处,欢迎之处:如果你只想知道该篇文章最终代码,建议直接阅读第5部分及运行截图. 一. 维基百科和Infobox 你可能会疑惑Infobox究竟是个什么东西呢?下面简单介绍. 维基百科作为目前规模最大和增长最快的开放式的在线百科系统,其典型

[python] 常用正则表达式爬取网页信息及分析HTML标签总结【转】

[python] 常用正则表达式爬取网页信息及分析HTML标签总结 转http://blog.csdn.net/Eastmount/article/details/51082253 标签: pythonpython爬虫正则表达式html知识总结 2016-04-07 06:13 3615人阅读 评论(4) 收藏 举报  分类: Python爬虫(23)  Python基础知识(17)  版权声明:本文为博主原创文章,转载请注明CSDN博客源地址!共同学习,一起进步~ 这篇文章主要是介绍Pytho

[Python爬虫] Selenium爬取新浪微博移动端热点话题及评论 (下)

这篇文章主要讲述了使用python+selenium爬取新浪微博的热点话题和评论信息.其中使用该爬虫的缺点是效率极低,傻瓜式的爬虫,不能并行执行等,但是它的优点是采用分析DOM树结构分析网页源码并进行信息爬取,同时它可以通过浏览器进行爬取中间过程的演示及验证码的输入.这篇文章对爬虫的详细过程就不再论述了,主要是提供可运行的代码和运行截图即可.希望文章对你有所帮助吧~ 参考文章 [python爬虫] Selenium爬取新浪微博内容及用户信息 [Python爬虫] Selenium爬取新浪微博客户

Python爬虫入门 | 爬取豆瓣电影信息

这是一个适用于小白的Python爬虫免费教学课程,只有7节,让零基础的你初步了解爬虫,跟着课程内容能自己爬取资源.看着文章,打开电脑动手实践,平均45分钟就能学完一节,如果你愿意,今天内你就可以迈入爬虫的大门啦~好啦,正式开始我们的第二节课<爬取豆瓣电影信息>吧!啦啦哩啦啦,都看黑板~1. 爬虫原理1.1 爬虫基本原理听了那么多的爬虫,到底什么是爬虫?爬虫又是如何工作的呢?我们先从"爬虫原理"说起.爬虫又称为网页蜘蛛,是一种程序或脚本.但重点在于:它能够按照一定的规则,自动

如何利用Python网络爬虫爬取微信朋友圈动态--附代码(下)

前天给大家分享了如何利用Python网络爬虫爬取微信朋友圈数据的上篇(理论篇),今天给大家分享一下代码实现(实战篇),接着上篇往下继续深入. 一.代码实现 1.修改Scrapy项目中的items.py文件.我们需要获取的数据是朋友圈和发布日期,因此在这里定义好日期和动态两个属性,如下图所示. 2.修改实现爬虫逻辑的主文件moment.py,首先要导入模块,尤其是要主要将items.py中的WeixinMomentItem类导入进来,这点要特别小心别被遗漏了.之后修改start_requests方

python爬虫实例——爬取歌单

学习自http://www.hzbook.com/index.php/Book/search.html 书名:从零开始学python网络爬虫 爬取酷狗歌单,保存入csv文件 直接上源代码:(含注释) import requests #用于请求网页获取网页数据 from bs4 import BeautifulSoup #解析网页数据 import time #time库中的sleep()方法可以让程序暂停 import csv ''' 爬虫测试 酷狗top500数据 写入csv文件 ''' fp

爬取艺龙酒店基础资料

爬取艺龙酒店基础资料 通过对网页源代码分析找到相应节点 提取相应要素并存储 提取自己需要的信息,然后存储就好,这边存放mysql数据库 结果 原文地址:https://www.cnblogs.com/lily19941214/p/11808179.html