厉害了！使用Python神不知鬼不觉爬取公司内部的ppt资料（勿做商业用途！）

在写爬虫的过程中遇到如下错误：

1 WinError 10061 - No Connection Could be made

解决方法：

 1. 打开IE internet options
 2. Connections -> Lan Setting
 3. 勾上automatically detect settings

封装好的db操作

 1 # -*- coding:utf-8 -*-
 2 #__author__ = ‘ecaoyng‘
 3
 4 import pymysql
 5 import time
 6
 7 class DBOperation:
 8
 9     def __init__(self, tb_name):
10         self.db_host = ‘x‘
11         self.db_port = 3306
12         self.db_user = ‘x‘
13         self.db_pwd = ‘x‘
14         self.db_name = ‘x‘
15         self.tb_name = tb_name
16
17     def get_time(self):
18         now_time=time.strftime(‘%Y-%m-%d %H:%M:%S‘, time.localtime(time.time()))
19         return now_time
20     ‘‘‘
21     set up connection with db
22     ‘‘‘
23     def db_conn(self):
24         exec_time = self.get_time()
25         try:
26             conn = pymysql.connect(host=self.db_host,port=self.db_port,
27                                    user=self.db_user,passwd=self.db_pwd,db=self.db_name)
28             return conn
29         except Exception as e:
30             print((u‘[%s]: Errors during db connection:%s‘ % (exec_time, e)))
31             return None
32     ‘‘‘
33     set up cursor
34     ‘‘‘
35     def db_cursor(self, conn):
36         try:
37             cur = conn.cursor()
38             return cur
39         except Exception as e:
40             print(e)
41             return None
42
43     ‘‘‘
44     db close
45     ‘‘‘
46     def db_close(self,cur,conn):
47         exec_time = self.get_time()
48         cur.close()
49         conn.close()
50         print(u‘[%s]: db closed‘ % exec_time)
51
52
53
54     ‘‘‘
55     db operations
56     ‘‘‘
57     def tb_insert_url(self,cur,conn,urls):
58         exec_time = self.get_time()
59         tb_exist_sql = """CREATE TABLE IF NOT EXISTS """+ self.tb_name + """ (
60              URL  VARCHAR(200) NOT NULL
61              )"""
62         try:
63             cur.execute(tb_exist_sql)
64             print(u‘[%s]: try to create table %s if not exists.‘ % (exec_time, self.tb_name))
65             conn.commit()
66
67             sql_insert_url = ‘INSERT INTO ‘ + self.tb_name +‘ VALUES (%s)‘
68             cur.executemany(sql_insert_url,urls)
69             conn.commit()
70         except Exception as e:
71             print(u‘[%s]: Errors during insert into %s:%s‘ % (exec_time, self.tb_name ,e))
72
73
74 if __name__ == ‘__main__‘:
75
76     db=DBOperation(‘ECNSlides‘)
77     db_conn = db.db_conn()
78     db_cur = db.db_cursor(db_conn)
79     db.db_close(db_cur,db_conn)

下面是爬虫程序：

  1 # -*- coding:utf-8 -*-
  2 #__author__ = ‘ecaoyng‘
  3
  4 from ESlides.src.DBOperation import *
  5 import urllib.request
  6 import re
  7 import time
  8
  9
 10 class ESlidesCrawler:
 11     def __init__(self):
 12         self.target_link=‘https://mediabank.ericsson.net/search/slides/group%20function%20%28gf%29‘
 13         self.user_agent = ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36‘
 14         self.user_headers = {
 15             ‘User-Agent‘: self.user_agent,
 16             ‘Accept‘ : ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8‘,
 17             ‘Accept - Encoding‘ : ‘gzip, deflate, br‘,
 18             ‘Accept-Language‘ : ‘zh-CN,zh;q=0.8‘,
 19             ‘Cookie‘ : ‘PHPSESSID=57i0onm69eei46g6g23ek05tj2‘,
 20             ‘Host‘ : ‘mediabank.ericsson.net‘,
 21             ‘Referer‘ : ‘https://mediabank.ericsson.net/‘
 22
 23         }
 24         self.save_dir = ‘C:/Users/ecaoyng/Desktop/PPT/‘
 25
 26     ‘‘‘
 27     get local time
 28     ‘‘‘
 29     def get_time(self):
 30         now_time=time.strftime(‘%Y-%m-%d %H:%M:%S‘, time.localtime(time.time()))
 31         return now_time
 32     ‘‘‘
 33     get page links
 34     ‘‘‘
 35     def get_page(self):
 36         now_time=self.get_time()
 37         try:
 38             request = urllib.request.Request(self.target_link, headers=self.user_headers)
 39             response = urllib.request.urlopen(request)
 40             pageCode = response.read().decode(‘utf-8‘)
 41             return  pageCode
 42         except urllib.request.URLError as e:
 43             print(u‘%s Errors during connect to target link:%s‘ % (now_time, e))
 44             return None
 45     ‘‘‘
 46     get initial target links
 47     ‘‘‘
 48     def get_links(self):
 49         now_time = self.get_time()
 50         page_code = self.get_page()
 51         if page_code is not None:
 52             page_links = []
 53             try:
 54                 pattern = re.compile(
 55                     ‘<li id=.*?>.*?<a href="/media/(.*?)" class="thumb" draggable="true">‘,re.S)
 56                 items = re.findall(pattern, page_code)
 57                 for item in items:
 58                     item = ‘%s%s%s‘ % (‘https://mediabank.ericsson.net/details/‘, item, ‘/download/original‘)
 59                     page_links.append(item)
 60                 return page_links
 61             except Exception as e:
 62                 print(u‘[%s]: Errors during parser target link:%s‘ % (now_time, e))
 63                 return None
 64         else:
 65             print(‘page code returns none‘)
 66             return None
 67     ‘‘‘
 68     save links into database
 69     ‘‘‘
 70     def save_links(self):
 71         now_time = self.get_time()
 72         links=self.get_links()
 73         print(links)
 74         try:
 75             if links is not None:
 76                 db = DBOperation(‘ECNSlides‘)
 77                 db_conn = db.db_conn()
 78                 db_cur = db.db_cursor(db_conn)
 79                 print(u‘[%s]: start to urls insert to db‘ % now_time)
 80                 db.tb_insert_url(db_cur, db_conn, links)
 81                 print(u‘[%s]: write urls insert to db successfully‘ % now_time)
 82             else:
 83                 print(u‘[%s]: URL is None when insert to db‘ % now_time)
 84                 pass
 85         finally:
 86             db.db_close(db_cur, db_conn)
 87
 88     ‘‘‘
 89     download ECN slides with params by http
 90     ‘‘‘
 91     def slides_download_params(self):
 92
 93         links = self.get_links()
 94         try:
 95             for url in links:
 96                 now_time = self.get_time()
 97                 file_pattern = re.compile(
 98                     ‘.*?/(\d+)/download/original$‘,re.S)
 99                 file_name = re.findall(file_pattern, url)
100                 file_path = self.save_dir + ‘‘.join(file_name) + ‘.pptx‘
101
102                 print(‘Downloading to %s ...‘ % file_path)
103
104                 save_file = open(file_path,‘wb‘)
105                 save_file.write(urllib.request.urlopen(url).read())
106                 save_file.close()
107
108
109                 # with urllib.request.urlopen(url) as slide:
110                 #     with open(file_path, ‘wb‘) as outfile:
111                 #         outfile.write(slide.read())
112                 #
113                 #     break
114         except Exception as e:
115             print(u‘[%s]: Errors during download slides: %s.‘ % (now_time,e))
116
117
118     ‘‘‘
119     download ECN slides with remote db
120     ‘‘‘
121     def slides_download_db(self):
122         pass
123
124
125 if __name__ == ‘__main__‘:
126     crawler=ESlidesCrawler()
127     # crawler.save_links()
128     crawler.slides_download_params()

问题出现了，发现在http中敲入下载地址，类似于

https://mediabank.ericsson.net/details/Organization%20simple/83138/download/original

但是python代码中用这个地址返回的不是pptx文件，而是html文件.

要知道具体返回的是什么文件的方法如下：

1 # reobj=urllib.request.urlopen(url)
2 # print(type(reobj))
3 # print(reobj.info())
4 # print(reobj.getcode())

可以看到正常如果下载的是zip文件，则返回的信息如下：

1 Content-Type: application/x-zip-compressed
2 Last-Modified: Mon, 23 May 2016 07:50:56 GMT
3 Accept-Ranges: bytes
4 ETag: "0f075d6c7b4d11:0"
5 Server: Microsoft-IIS/7.5
6 X-Powered-By: ASP.NET
7 Date: Wed, 29 Nov 2017 07:07:27 GMT
8 Connection: close
9 Content-Length: 55712699

但是本来是ppt文件，却下载了

 1 Cache-Control: no-cache
 2 Pragma: no-cache
 3 Content-Length: 11743
 4 Content-Type: text/html
 5 Expires: Wed, 29 Nov 2017 07:04:04 GMT
 6 Server: Microsoft-IIS/8.0
 7 Set-Cookie: SMTargetSession=HTTPS%3A%2F%2Ffss%2Eericsson%2Ecom%2Fsiteminderagent%2Fredirectjsp%2Fredirect%2Dinternal%2Ejsp%3FSPID%3DMediabankIntern%26RelayState%3Dhttps%253A%252F%252Fmediabank%2Eericsson%2Enet%252Fdetails%252FOrganization%252520simple%252F83138%252Fdownload%252Foriginal%26SMPORTALURL%3Dhttps%253A%252F%252Ffss%2Eericsson%2Ecom%252Faffwebservices%252Fpublic%252Fsaml2sso%26SAMLTRANSACTIONID%3D176beb36%2Dfeb953b6%2D9a53d42e%2D58810506%2D087b72ac%2Da4e3; path=/
 8 Set-Cookie: ASPSESSIONIDACATSTTS=FOLBNEGCIBMFCPILNEMHOHFN; path=/
 9 X-Powered-By: ASP.NET
10 X-WAM-LOC: LP2-2
11 Date: Wed, 29 Nov 2017 07:05:04 GMT
12 Connection: close
13 Set-Cookie: BIGipServerWAM_PRD_Login=rd423o00000000000000000000ffff9958f466o50001; path=/

Content-Type: text/html 说明是html文件。将其打开之后发现是公司的安全认证页面.

于是开始思索是否可以用cookie的方式来抓取.

原文地址：https://www.cnblogs.com/huohuohuo1/p/9060458.html

时间： 2024-10-12 08:12:46

厉害了！使用Python神不知鬼不觉爬取公司内部的ppt资料（勿做商业用途！）的相关文章

python爬虫，爬取lol所以英雄的资料

import requestsimport jsonheaders = { 'user-agent': 'Mozilla/5.0 (iPad; CPU OS 11_0 like Mac OS X) AppleWebKit/604.1.34 (KHTML, like Gecko) Version/11.0 Mobile/15A5341f Safari/604.1' }#所以英雄的urlurl = 'https://lol.qq.com/biz/hero/champion.js' resp = r

[Python爬虫] Selenium爬取新浪微博客户端用户信息、热点话题及评论 (上)

一. 文章介绍前一篇文章"[python爬虫] Selenium爬取新浪微博内容及用户信息"简单讲述了如何爬取新浪微博手机端用户信息和微博信息. 用户信息:包括用户ID.用户名.微博数.粉丝数.关注数等. 微博信息:包括转发或原创.点赞数.转发数.评论数.发布时间.微博内容等. 它主要通过从文本txt中读取用户id,通过"URL+用户ID" 访问个人网站,如柳岩: http://weibo.cn/guangxianliuya 因为手机端数据相对精简简单,所以采用输

[python学习] 简单爬取维基百科程序语言消息盒

文章主要讲述如何通过Python爬取维基百科的消息盒(Infobox),主要是通过正则表达式和urllib实现:后面的文章可能会讲述通过BeautifulSoup实现爬取网页知识.由于这方面的文章还是较少,希望提供一些思想和方法对大家有所帮助.如果有错误或不足之处,欢迎之处:如果你只想知道该篇文章最终代码,建议直接阅读第5部分及运行截图. 一. 维基百科和Infobox 你可能会疑惑Infobox究竟是个什么东西呢?下面简单介绍. 维基百科作为目前规模最大和增长最快的开放式的在线百科系统,其典型

[python] 常用正则表达式爬取网页信息及分析HTML标签总结【转】

[python] 常用正则表达式爬取网页信息及分析HTML标签总结转http://blog.csdn.net/Eastmount/article/details/51082253 标签: pythonpython爬虫正则表达式html知识总结 2016-04-07 06:13 3615人阅读评论(4) 收藏举报分类: Python爬虫(23) Python基础知识(17) 版权声明:本文为博主原创文章,转载请注明CSDN博客源地址!共同学习,一起进步~ 这篇文章主要是介绍Pytho

[Python爬虫] Selenium爬取新浪微博移动端热点话题及评论 (下)

这篇文章主要讲述了使用python+selenium爬取新浪微博的热点话题和评论信息.其中使用该爬虫的缺点是效率极低,傻瓜式的爬虫,不能并行执行等,但是它的优点是采用分析DOM树结构分析网页源码并进行信息爬取,同时它可以通过浏览器进行爬取中间过程的演示及验证码的输入.这篇文章对爬虫的详细过程就不再论述了,主要是提供可运行的代码和运行截图即可.希望文章对你有所帮助吧~ 参考文章 [python爬虫] Selenium爬取新浪微博内容及用户信息 [Python爬虫] Selenium爬取新浪微博客户

Python爬虫入门 | 爬取豆瓣电影信息

这是一个适用于小白的Python爬虫免费教学课程,只有7节,让零基础的你初步了解爬虫,跟着课程内容能自己爬取资源.看着文章,打开电脑动手实践,平均45分钟就能学完一节,如果你愿意,今天内你就可以迈入爬虫的大门啦~好啦,正式开始我们的第二节课<爬取豆瓣电影信息>吧!啦啦哩啦啦,都看黑板~1. 爬虫原理1.1 爬虫基本原理听了那么多的爬虫,到底什么是爬虫?爬虫又是如何工作的呢?我们先从"爬虫原理"说起.爬虫又称为网页蜘蛛,是一种程序或脚本.但重点在于:它能够按照一定的规则,自动

如何利用Python网络爬虫爬取微信朋友圈动态--附代码（下）

前天给大家分享了如何利用Python网络爬虫爬取微信朋友圈数据的上篇(理论篇),今天给大家分享一下代码实现(实战篇),接着上篇往下继续深入. 一.代码实现 1.修改Scrapy项目中的items.py文件.我们需要获取的数据是朋友圈和发布日期,因此在这里定义好日期和动态两个属性,如下图所示. 2.修改实现爬虫逻辑的主文件moment.py,首先要导入模块,尤其是要主要将items.py中的WeixinMomentItem类导入进来,这点要特别小心别被遗漏了.之后修改start_requests方

python爬虫实例——爬取歌单

学习自http://www.hzbook.com/index.php/Book/search.html 书名:从零开始学python网络爬虫爬取酷狗歌单,保存入csv文件直接上源代码:(含注释) import requests #用于请求网页获取网页数据 from bs4 import BeautifulSoup #解析网页数据 import time #time库中的sleep()方法可以让程序暂停 import csv ''' 爬虫测试酷狗top500数据写入csv文件 ''' fp

爬取艺龙酒店基础资料

爬取艺龙酒店基础资料通过对网页源代码分析找到相应节点提取相应要素并存储提取自己需要的信息,然后存储就好,这边存放mysql数据库结果原文地址:https://www.cnblogs.com/lily19941214/p/11808179.html