scrapy 爬取知乎问题、答案 ,并异步写入数据库(mysql)

 

python版本  python2.7

爬取知乎流程:

 一 、分析 在访问知乎首页的时候(https://www.zhihu.com),在没有登录的情况下,会进行重定向到(https://www.zhihu.com/signup?next=%2F)这个页面,

  爬取知乎,首先要完成登录操作,登陆的时候观察往那个页面发送了post或者get请求。可以利用抓包工具来获取登录时密码表单等数据的提交地址。

 1、利用抓包工具,查看用户名密码数据的提交地址页就是post请求,将表单数据提交的网址,经过查看。是这个网址 ‘https://www.zhihu.com/api/v3/oauth/sign_in‘。

 2、通过抓取上述登录地址,在其请求的contenr字段中,发现post请求服务器不止包含用户名,密码,还有timetamp,lang,client_id,sihnature等表单数据,需要知道每一个表单数据的特点,而特点是我们数据变化 在每次登录的时候的变化来查找数据的规律。

 3、经过多次登录观察,这些表单数据中只有timetamp,和signature是变化的,其他的值是不变的。

   4、通过js发现 signature字段的值是有多个字段组合加密而成,其实timetamp时间戳是核心,每次根据时间的变化,生成不同的signature值。

5、考虑到signature的值加密较为复杂,直接将浏览器登陆成功后的时间戳timetamp和signature 复制到请求数据中,然后进行登录。6、表单数据田中完毕,发送post请求时,出现了缺少验证码票据的错误(capsion_ticket)  经过分析验证码票据是为了获取验证码而提供的一种验证方式,而抓包装工具中关于验证码的请求有两次, 一次获取的是:{‘show_captcha‘:true}而同时第二次获取的是:{‘img_base_64‘:Rfadausifpoauerfae}。7、经过分析{‘show_captcha‘:true} 是获取验证码的关键信息,再抓包信息中发现第一次请求相应的set-cookie中,包含了capsion_ticket验证码票据信息。8、在此模拟登陆又出现了错误‘ERR_xxx_AUTH_TOKEN‘错误信息,而她出现在我们很根据验证码票据获取验证码图片时,我们从抓包中查看关于Authorization:oauth ce30dasjfsdjhfkiswdnf.所以将其在headers当中进行配置。验证码问题:
验证码问题    -对于知乎的验证码,有两种情况,一种是英文的图片验证码,一种是点击倒立文字的验证码,当登录需要验证码的时候,回向这两个网站发送数据         倒立文字验证码:https://www.zhihu.com/api/v3/oauth/captcha?lang=cn         英文图片验证码:https://www.zhihu.com/api/v3/oauth/captcha?lang=en    -英文验证码得到数据是四个英文字母。可采用云打码在线识别。   -倒立文字验证码是得到的是每个汉字有一定的范围,当登陆的时候点击验证码的时候,从https://www.zhihu.com/api/v3/oauth/captcha?lang=cn该网站获取到的一个像素点(x,y),比如倒立文字在第三个和第五个,就会有一个可选范围,只要输入合适的像素点 就可以登录。  -只对倒立文字进行验证  -只是简单地爬取第一页的问题及回答
二、创建scrapy项目  scrapy startproject ZhiHuSpider  scrapy genspider zhihu zhihu.com三、代码  在zhihu.py中代码如下:  
  1 # -*- coding: utf-8 -*-
  2 import base64
  3 import json
  4 import urlparse
  5 import re
  6 from datetime import datetime
  7 import scrapy
  8 from scrapy.loader import ItemLoader
  9 from ..items import ZhiHuQuestionItem, ZhiHuAnswerItem
 10
 11
 12 class ZhihuSpider(scrapy.Spider):
 13     name = ‘zhihu‘
 14     allowed_domains = [‘www.zhihu.com‘]
 15     start_urls = [‘https://www.zhihu.com‘]
 16     start_answer_url = "https://www.zhihu.com/api/v4/questions/{}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset={}&sort_by=default"
 17
 18     headers = {
 19         ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0‘,
 20         ‘Referer‘: ‘https://www.zhihu.com‘,
 21         ‘HOST‘: ‘www.zhihu.com‘,
 22         ‘Authorization‘: ‘oauth c3cef7c66a1843f8b3a9e6a1e3160e20‘
 23     }
 24     points_list = [[20, 27], [42, 25], [65, 20], [90, 25], [115, 32], [140, 25], [160, 25]]
 25
 26     def start_requests(self):
 27         """
 28         重写父类的start_requests()函数,在这里设置爬虫的起始url为登录页面的url。
 29         :return:
 30         """
 31         yield scrapy.Request(
 32             url=‘https://www.zhihu.com/api/v3/oauth/captcha?lang=cn‘,
 33             callback=self.captcha,
 34             headers=self.headers,
 35         )
 36
 37     def captcha(self, response):
 38         show_captcha = json.loads(response.body)[‘show_captcha‘]
 39         if show_captcha:
 40             print u‘有验证码‘
 41             yield scrapy.Request(
 42                 url=‘https://www.zhihu.com/api/v3/oauth/captcha?lang=cn‘,
 43                 method=‘PUT‘,
 44                 headers=self.headers,
 45                 callback=self.shi_bie
 46             )
 47         else:
 48             print u‘没有验证码‘
 49             # 直接进行登录的操作
 50             post_url = ‘https://www.zhihu.com/api/v3/oauth/sign_in‘
 51             post_data = {
 52                 ‘client_id‘: ‘c3cef7c66a1843f8b3a9e6a1e3160e20‘,
 53                 ‘grant_type‘: ‘password‘,
 54                 ‘timestamp‘: ‘1515391742289‘,
 55                 ‘source‘: ‘com.zhihu.web‘,
 56                 ‘signature‘: ‘6d1d179e50a06d1c17d6e8b5c89f77db34f406ac‘,
 57                 ‘username‘: ‘‘,#账号
 58                 ‘password‘: ‘‘,#密码
 59                 ‘captcha‘: ‘‘,
 60                 ‘lang‘: ‘cn‘,
 61                 ‘ref_source‘: ‘homepage‘,
 62                 ‘utm_source‘: ‘‘
 63             }
 64
 65             yield scrapy.FormRequest(
 66                 url=post_url,
 67                 headers=self.headers,
 68                 formdata=post_data,
 69                 callback=self.index_page
 70             )
 71
 72     def shi_bie(self, response):
 73         try:
 74             img= json.loads(response.body)[‘img_base64‘]
 75         except Exception, e:
 76             print ‘获取img_base64的值失败,原因:%s‘%e
 77         else:
 78             print ‘成功获取加密后的图片地址‘
 79             # 将加密后的图片进行解密,同时保存到本地
 80             img = img.encode(‘utf-8‘)
 81             img_data = base64.b64decode(img)
 82             with open(‘zhihu_captcha.GIF‘, ‘wb‘) as f:
 83                 f.write(img_data)
 84
 85             captcha = raw_input(‘请输入倒立汉字的位置:‘)
 86             if len(captcha) == 2:
 87                 # 说明有两个倒立的汉字
 88                 pass
 89                 first_char = int(captcha[0]) - 1 # 第一个汉字对应列表中的索引
 90                 second_char = int(captcha[1]) - 1 # 第二个汉字对应列表中的索引
 91                 captcha = ‘{"img_size":[200,44],"input_points":[%s,%s]}‘ % (self.points_list[first_char], self.points_list[second_char])
 92             else:
 93                 # 说明只有一个倒立的汉字
 94                 pass
 95                 first_char = int(captcha[0]) - 1
 96                 captcha = ‘{"img_size":[200,44],"input_points":[%s]}‘ % (
 97             self.points_list[first_char])
 98
 99             data = {
100                 ‘input_text‘: captcha
101             }
102             yield scrapy.FormRequest(
103                 url=‘https://www.zhihu.com/api/v3/oauth/captcha?lang=cn‘,
104                 headers=self.headers,
105                 formdata=data,
106                 callback=self.get_result
107             )
108
109     def get_result(self, response):
110         try:
111             yan_zheng_result = json.loads(response.body)[‘success‘]
112         except Exception, e:
113             print ‘关于验证码的POST请求响应失败,原因:{}‘.format(e)
114         else:
115             if yan_zheng_result:
116                 print u‘验证成功‘
117                 post_url = ‘https://www.zhihu.com/api/v3/oauth/sign_in‘
118                 post_data = {
119                     ‘client_id‘: ‘c3cef7c66a1843f8b3a9e6a1e3160e20‘,
120                     ‘grant_type‘: ‘password‘,
121                     ‘timestamp‘: ‘1515391742289‘,
122                     ‘source‘: ‘com.zhihu.web‘,
123                     ‘signature‘: ‘6d1d179e50a06d1c17d6e8b5c89f77db34f406ac‘,
124                     ‘username‘: ‘‘,#账号
125                     ‘password‘: ‘‘,#密码
126                     ‘captcha‘: ‘‘,
127                     ‘lang‘: ‘cn‘,
128                     ‘ref_source‘: ‘homepage‘,
129                     ‘utm_source‘: ‘‘
130                 }            #以上数据需要在抓包中获取
131
132                 yield scrapy.FormRequest(
133                     url=post_url,
134                     headers=self.headers,
135                     formdata=post_data,
136                     callback=self.index_page
137                 )
138             else:
139                 print u‘是错误的验证码!‘
140
141     def index_page(self, response):
142         for url in self.start_urls:
143             yield scrapy.Request(
144                 url=url,
145                 headers=self.headers
146             )
147
148     def parse(self, response):
149         """
150         提取首页中的所有问题的url,并对这些url进行进一步的追踪,爬取详情页的数据。
151         :param response:
152         :return:
153         """
154         # /question/19618276/answer/267334062
155         all_urls = response.xpath(‘//a[@data-za-detail-view-element_name="Title"]/@href‘).extract()
156         all_urls = [urlparse.urljoin(response.url, url) for url in all_urls]
157         for url in all_urls:
158             # https://www.zhihu.com/question/19618276/answer/267334062
159             # 同时提取:详情的url;文章的ID;
160             result = re.search(‘(.*zhihu.com/question/(\d+))‘, url)
161             if result:
162                 detail_url = result.group(1)
163                 question_id = result.group(2)
164                 # 将详情url交由下载器去下载网页源码
165                 yield scrapy.Request(
166                     url=detail_url,
167                     headers=self.headers,
168                     callback=self.parse_detail_question,
169                     meta={
170                         ‘question_id‘: question_id,
171                     }
172                 )
173
174                 # 在向详情url发送请求的同时,根据问题的ID,同时向问题的url发送请求。由于问题和答案是两个独立的url。而答案其实是一个JSON的API接口,直接请求即可,不需要和问题url产生联系。
175                 yield scrapy.Request(
176                     # 参数:问题ID,偏移量。默认偏移量为0,从第一个答案开始请求
177                     url=self.start_answer_url.format(question_id, 0),
178                     headers=self.headers,
179                     callback=self.parse_detail_answer,
180                     meta={
181                         ‘question_id‘: question_id
182                     }
183                 )
184
185                 break
186
187     def parse_detail_question(self, response):
188         """
189         用于处理详情页面关于question问题的数据,比如:问题名称,简介,浏览数,关注者数等
190         :param response:
191         :return:
192         """
193         item_loader = ItemLoader(item=ZhiHuQuestionItem(), response=response)
194         item_loader.add_value(‘question_id‘, response.meta[‘question_id‘])
195         item_loader.add_xpath(‘question_title‘, ‘//div[@class="QuestionHeader"]//h1/text()‘)
196         item_loader.add_xpath(‘question_topic‘, ‘//div[@class="QuestionHeader-topics"]//div[@class="Popover"]/div/text()‘)
197         # 获取的问题中,可能会不存在简介
198         item_loader.add_xpath(‘question_content‘, ‘//span[@class="RichText"]/text()‘)
199         item_loader.add_xpath(‘question_watch_num‘, ‘//button[contains(@class, "NumberBoard-item")]//strong/text()‘)
200         item_loader.add_xpath(‘question_click_num‘, ‘//div[@class="NumberBoard-item"]//strong/text()‘)
201         item_loader.add_xpath(‘question_answer_num‘, ‘//h4[@class="List-headerText"]/span/text()‘)
202         item_loader.add_xpath(‘question_comment_num‘, ‘//div[@class="QuestionHeader-Comment"]/button/text()‘)
203         item_loader.add_value(‘question_url‘, response.url)
204         item_loader.add_value(‘question_crawl_time‘, datetime.now())
205
206         question_item = item_loader.load_item()
207         yield question_item
208
209     def parse_detail_answer(self, response):
210         """
211         用于解析某一个问题ID对应的所有答案。
212         :param response:
213         :return:
214         """
215         answer_dict = json.loads(response.body)
216         is_end = answer_dict[‘paging‘][‘is_end‘]
217         next_url = answer_dict[‘paging‘][‘next‘]
218
219         for answer in answer_dict[‘data‘]:
220             answer_item = ZhiHuAnswerItem()
221             answer_item[‘answer_id‘] = answer[‘id‘]
222             answer_item[‘answer_question_id‘] = answer[‘question‘][‘id‘]
223             answer_item[‘answer_author_id‘] = answer[‘author‘][‘id‘]
224             answer_item[‘answer_url‘] = answer[‘url‘]
225             answer_item[‘answer_comment_num‘] = answer[‘comment_count‘]
226             answer_item[‘answer_praise_num‘] = answer[‘voteup_count‘]
227             answer_item[‘answer_create_time‘] = answer[‘created_time‘]
228             answer_item[‘answer_content‘] = answer[‘content‘]
229             answer_item[‘answer_crawl_time‘] = datetime.now()
230             answer_item[‘answer_update_time‘] = answer[‘updated_time‘]
231
232             yield answer_item
233
234         # 判断is_end如果值为False,说明还有下一页
235         if not is_end:
236             yield scrapy.Request(
237                 url=next_url,
238                 headers=self.headers,
239                 callback=self.parse_detail_answer
240             )

  item.py中代码:

    

 1 # -*- coding: utf-8 -*-
 2
 3 # Define here the models for your scraped items
 4 #
 5 # See documentation in:
 6 # https://doc.scrapy.org/en/latest/topics/items.html
 7
 8 from datetime import datetime
 9 import scrapy
10 from utils.common import extract_num
11
12
13 class ZhihuspiderItem(scrapy.Item):
14     # define the fields for your item here like:
15     # name = scrapy.Field()
16     pass
17
18
19 class ZhiHuQuestionItem(scrapy.Item):
20     question_id=scrapy.Field()              # 问题ID
21     question_title = scrapy.Field()         # 问题标题
22     question_topic = scrapy.Field()         # 问题分类
23     question_content = scrapy.Field()       # 问题内容
24     question_watch_num = scrapy.Field()     # 关注者数量
25     question_click_num = scrapy.Field()     # 浏览者数量
26     question_answer_num = scrapy.Field()    # 回答总数
27     question_comment_num = scrapy.Field()   # 评论数量
28     question_crawl_time = scrapy.Field()    # 爬取时间
29     question_url = scrapy.Field()           # 问题详情url
30
31     def get_insert_sql(self):
32         insert_sql = "insert into zhihu_question(question_id, question_title, question_topic, question_content, question_watch_num, question_click_num, question_answer_num, question_comment_num, question_crawl_time, question_url) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE question_id=VALUES(question_id),question_title=VALUES(question_title),question_topic=VALUES(question_topic),question_content=VALUES(question_content),question_watch_num=VALUES(question_watch_num),question_click_num=VALUES(question_click_num),question_answer_num=VALUES(question_answer_num),question_comment_num=VALUES(question_comment_num),question_crawl_time=VALUES(question_crawl_time),question_url=VALUES(question_url)"
33
34         # 整理字段对应的数据
35         question_id = str(self[‘question_id‘][0])
36         question_title = ‘‘.join(self[‘question_title‘])
37         question_topic = ",".join(self[‘question_topic‘])
38
39         try:
40             question_content = ‘‘.join(self[‘question_content‘])
41         except Exception,e:
42             question_content = ‘question_content内容为空‘
43
44         question_watch_num = ‘‘.join(self[‘question_watch_num‘]).replace(‘,‘, ‘‘)
45         question_watch_num = extract_num(question_watch_num)
46
47         question_click_num = ‘‘.join(self[‘question_click_num‘]).replace(‘,‘, ‘‘)
48         question_click_num = extract_num(question_click_num)
49         # ‘86 回答‘
50         question_answer_num = ‘‘.join(self[‘question_answer_num‘])
51         question_answer_num = extract_num(question_answer_num)
52         # ‘100 条评论‘
53         question_comment_num = ‘‘.join(self[‘question_comment_num‘])
54         question_comment_num = extract_num(question_comment_num)
55
56         question_crawl_time = self[‘question_crawl_time‘][0]
57         question_url = self[‘question_url‘][0]
58
59         args_tuple = (question_id, question_title, question_topic, question_content, question_watch_num, question_click_num, question_answer_num, question_comment_num, question_crawl_time, question_url)
60
61         return insert_sql, args_tuple
62
63
64 class ZhiHuAnswerItem(scrapy.Item):
65     answer_id = scrapy.Field()                  # 答案的ID (zhihu_answer表的主键)
66     answer_question_id = scrapy.Field()         # 问题的ID (zhihu_question表的主键)
67     answer_author_id = scrapy.Field()           # 回答用户的ID
68     answer_url = scrapy.Field()                 # 回答的url
69     answer_comment_num = scrapy.Field()         # 该回答的总评论数
70     answer_praise_num = scrapy.Field()          # 该回答的总点赞数
71     answer_create_time = scrapy.Field()         # 该回答的创建时间
72     answer_content = scrapy.Field()             # 回答的内容
73     answer_update_time = scrapy.Field()         # 回答的更新时间
74
75     answer_crawl_time = scrapy.Field()          # 爬虫的爬取时间
76
77     def get_insert_sql(self):
78         insert_sql = "insert into zhihu_answer(answer_id, answer_question_id, answer_author_id, answer_url, answer_comment_num, answer_praise_num, answer_create_time, answer_content, answer_update_time, answer_crawl_time) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE answer_id=VALUES(answer_id),answer_question_id=VALUES(answer_question_id),answer_author_id=VALUES(answer_author_id),answer_url=VALUES(answer_url),answer_comment_num=VALUES(answer_comment_num),answer_praise_num=VALUES(answer_praise_num),answer_create_time=VALUES(answer_create_time),answer_content=VALUES(answer_content),answer_update_time=VALUES(answer_update_time),answer_crawl_time=VALUES(answer_crawl_time)"
79
80         # 处理answer_item中的数据
81         # fromtimestamp(timestamp):将一个时间戳数据转化为一个date日期类型的数据
82         answer_id = self[‘answer_id‘]
83         answer_question_id = self[‘answer_question_id‘]
84         answer_author_id = self[‘answer_author_id‘]
85         answer_url = self[‘answer_url‘]
86         answer_comment_num = self[‘answer_comment_num‘]
87         answer_praise_num = self[‘answer_praise_num‘]
88         answer_content = self[‘answer_content‘]
89         answer_create_time = datetime.fromtimestamp(self[‘answer_create_time‘])
90         answer_update_time = datetime.fromtimestamp(self[‘answer_update_time‘])
91         answer_crawl_time = self[‘answer_crawl_time‘]
92
93         args_tuple = (answer_id, answer_question_id, answer_author_id, answer_url, answer_comment_num, answer_praise_num, answer_create_time, answer_content, answer_update_time, answer_crawl_time)
94
95         return insert_sql, args_tuple

  

    pipeline,py代码如下:

    

 1 # -*- coding: utf-8 -*-
 2
 3 # Define your item pipelines here
 4 #
 5 # Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7
 8 import MySQLdb
 9 import MySQLdb.cursors
10 from twisted.enterprise import adbapi
11
12 # 数据库的异步写入操作。因为execute()及commit()提交数据库的方式是同步插入数据,一旦数据量比较大,scrapy的解析是异步多线程的方式,解析速度非常快,而数据库的写入速度比较慢,可能会导致item中的数据插入数据库不及时,造成数据库写入的阻塞,最终导致数据库卡死或者数据丢失。
13
14
15 class ZhihuspiderPipeline(object):
16     def process_item(self, item, spider):
17         return item
18
19
20 class MySQLTwistedPipeline(object):
21     def __init__(self, dbpool):
22         self.dbpool = dbpool
23
24     @classmethod
25     def from_settings(cls, settings):
26         args = dict(
27             host=settings[‘MYSQL_HOST‘],
28             db=settings[‘MYSQL_DB‘],
29             user=settings[‘MYSQL_USER‘],
30             passwd=settings[‘MYSQL_PASSWD‘],
31             charset=settings[‘MYSQL_CHARSET‘],
32             cursorclass=MySQLdb.cursors.DictCursor
33         )
34         # 创建一个线程池对象
35         # 参数1:用于连接MySQL数据库的驱动
36         # 参数2:数据库的链接信息(host, port, user等)
37         dbpool = adbapi.ConnectionPool(‘MySQLdb‘, **args)
38         return cls(dbpool)
39
40     def process_item(self, item, spider):
41         # 在线程池dbpool中通过调用runInteraction()函数,来实现异步插入数据的操作。runInteraction()会insert_sql交由线程池中的某一个线程执行具体的插入操作。
42         query = self.dbpool.runInteraction(self.insert, item)
43         # addErrorback()数据库异步写入失败时,会执行addErrorback()内部的函数调用。
44         query.addErrback(self.handle_error, item)
45
46     def handle_error(self, failure, item):
47         print u‘插入数据失败,原因:{},错误对象:{}‘.format(failure, item)
48
49     def insert(self, cursor, item):
50         pass
51         # 当存在多张表时,每一个表对应的数据,解析时间是不确定的,不太可能保证问题,答案同时能够解析完成,并且同时进入到pipeline中执行Insert的操作。
52         # 所以,不能再这个函数中,对所有的表执行execute()的操作。
53         # 解决办法:将sql语句在每一个Item类中实现。
54         # insert_question = ‘‘
55         # insert_answer = ‘‘
56         # insert_user = ‘‘
57         insert_sql, args = item.get_insert_sql()
58         cursor.execute(insert_sql, args)

 

setting.py代码如下:  
 1 # -*- coding: utf-8 -*-
 2
 3 # Scrapy settings for ZhiHuSpider project
 4 #
 5 # For simplicity, this file contains only settings considered important or
 6 # commonly used. You can find more settings consulting the documentation:
 7 #
 8 #     https://doc.scrapy.org/en/latest/topics/settings.html
 9 #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
10 #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
11
12 BOT_NAME = ‘ZhiHuSpider‘
13
14 SPIDER_MODULES = [‘ZhiHuSpider.spiders‘]
15 NEWSPIDER_MODULE = ‘ZhiHuSpider.spiders‘
16
17
18 # Crawl responsibly by identifying yourself (and your website) on the user-agent
19 #USER_AGENT = ‘ZhiHuSpider (+http://www.yourdomain.com)‘
20
21 # Obey robots.txt rules
22 ROBOTSTXT_OBEY = False
23
24 # Configure maximum concurrent requests performed by Scrapy (default: 16)
25 #CONCURRENT_REQUESTS = 32
26
27 # Configure a delay for requests for the same website (default: 0)
28 # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
29 # See also autothrottle settings and docs
30 #DOWNLOAD_DELAY = 3
31 # The download delay setting will honor only one of:
32 #CONCURRENT_REQUESTS_PER_DOMAIN = 16
33 #CONCURRENT_REQUESTS_PER_IP = 16
34
35 # Disable cookies (enabled by default)
36 #COOKIES_ENABLED = False
37
38 # Disable Telnet Console (enabled by default)
39 #TELNETCONSOLE_ENABLED = False
40
41 # Override the default request headers:
42 # DEFAULT_REQUEST_HEADERS = {
43 #   ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
44 #   ‘Accept-Language‘: ‘en‘,
45 # }
46
47 # Enable or disable spider middlewares
48 # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
49 #SPIDER_MIDDLEWARES = {
50 #    ‘ZhiHuSpider.middlewares.ZhihuspiderSpiderMiddleware‘: 543,
51 #}
52
53 # Enable or disable downloader middlewares
54 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
55 #DOWNLOADER_MIDDLEWARES = {
56 #    ‘ZhiHuSpider.middlewares.ZhihuspiderDownloaderMiddleware‘: 543,
57 #}
58
59 # Enable or disable extensions
60 # See https://doc.scrapy.org/en/latest/topics/extensions.html
61 #EXTENSIONS = {
62 #    ‘scrapy.extensions.telnet.TelnetConsole‘: None,
63 #}
64
65 # Configure item pipelines
66 # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
67 ITEM_PIPELINES = {
68    # ‘ZhiHuSpider.pipelines.ZhihuspiderPipeline‘: 300,
69     ‘ZhiHuSpider.pipelines.MySQLTwistedPipeline‘:1,
70 }
71
72 # Enable and configure the AutoThrottle extension (disabled by default)
73 # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
74 #AUTOTHROTTLE_ENABLED = True
75 # The initial download delay
76 #AUTOTHROTTLE_START_DELAY = 5
77 # The maximum download delay to be set in case of high latencies
78 #AUTOTHROTTLE_MAX_DELAY = 60
79 # The average number of requests Scrapy should be sending in parallel to
80 # each remote server
81 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
82 # Enable showing throttling stats for every response received:
83 #AUTOTHROTTLE_DEBUG = False
84
85 # Enable and configure HTTP caching (disabled by default)
86 # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
87 #HTTPCACHE_ENABLED = True
88 #HTTPCACHE_EXPIRATION_SECS = 0
89 #HTTPCACHE_DIR = ‘httpcache‘
90 #HTTPCACHE_IGNORE_HTTP_CODES = []
91 #HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage‘
92
93 MYSQL_HOST = ‘localhost‘# 本机端口,
94 MYSQL_DB = ‘‘      #数据库名字
95 MYSQL_USER = ‘‘    #数据库用户名
96 MYSQL_PASSWD = ‘‘  #密码
97 MYSQL_CHARSET = ‘utf8‘

  另外设置了一个工具模块新建了一个python package.用来过滤item数据

    需要在item中导入模块

      代码如下:

      

1 import re
2
3
4 def extract_num(value):
5     result = re.search(re.compile(‘(\d+)‘), value)
6     res = int(result.group(1))
7     return res

 

    

  


 

  

原文地址:https://www.cnblogs.com/Chai-zz/p/8407322.html

时间: 2024-10-03 23:06:44

scrapy 爬取知乎问题、答案 ,并异步写入数据库(mysql)的相关文章

利用 Scrapy 爬取知乎用户信息

思路:通过获取知乎某个大V的关注列表和被关注列表,查看该大V和其关注用户和被关注用户的详细信息,然后通过层层递归调用,实现获取关注用户和被关注用户的关注列表和被关注列表,最终实现获取大量用户信息. 一.新建一个scrapy项目 scrapy startproject zhihuuser 移动到新建目录下: cd zhihuuser 新建spider项目: scrapy genspider zhihu 二.这里以爬取知乎大V轮子哥的用户信息来实现爬取知乎大量用户信息. a) 定义 spdier.p

Scrapy爬取慕课网(imooc)所有课程数据并存入MySQL数据库

爬取目标:使用scrapy爬取所有课程数据,分别为 1.课程名 2.课程简介 3.课程等级 4.学习人数 并存入MySQL数据库  (目标网址  http://www.imooc.com/course/list) 一.导出数据文件到本地 1.新建imooc项目 1 scrapy startproject imooc 2.修改 items.py,添加项目item 1 from scrapy import Item,Field 2 class ImoocItem(Item): 3 Course_na

运维学python之爬虫高级篇(七)scrapy爬取知乎关注用户存入mongodb

首先,祝大家开工大吉!本篇将要介绍的是从一个用户开始,通过抓关注列表和粉丝列表,实现用户的详细信息抓取并将抓取到的结果存储到 MongoDB. 1 环境需求 基础环境沿用之前的环境,只是增加了MongoDB(非关系型数据库)和PyMongo(Python 的 MongoDB 连接库),默认我认为大家都已经安装好并启动 了MongoDB 服务. 项目创建.爬虫创建.禁用ROBOTSTXT_OBEY设置略(可以参考上一篇) 2 测试爬虫效果 我这里先写一个简单的爬虫,爬取用户的关注人数和粉丝数,代码

scrapy爬取知乎问答

登陆 参考 https://github.com/zkqiang/Zhihu-Login # -*- coding: utf-8 -*- import scrapy import time import re import base64 import hmac import hashlib import json import matplotlib.pyplot as plt from PIL import Image class ZhihuSpider(scrapy.Spider): name

使用scrapy爬取知乎图片

settings.py # -*- coding: utf-8 -*- # Scrapy settings for zhihutupian project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrap

Python爬虫爬取知乎小结

博客首发至Marcovaldo's blog (http://marcovaldong.github.io/) 最近学习了一点网络爬虫,并实现了使用python来爬取知乎的一些功能,这里做一个小的总结.网络爬虫是指通过一定的规则自动的从网上抓取一些信息的程序或脚本.我们知道机器学习和数据挖掘等都是从大量的数据出发,找到一些有价值有规律的东西,而爬虫则可以帮助我们解决获取数据难的问题,因此网络爬虫是我们应该掌握的一个技巧. python有很多开源工具包供我们使用,我这里使用了requests.Be

Scrapy爬取美女图片 (原创)

有半个月没有更新了,最近确实有点忙.先是华为的比赛,接着实验室又有项目,然后又学习了一些新的知识,所以没有更新文章.为了表达我的歉意,我给大家来一波福利... 今天咱们说的是爬虫框架.之前我使用python爬取慕课网的视频,是根据爬虫的机制,自己手工定制的,感觉没有那么高大上,所以我最近玩了玩 python中强大的爬虫框架Scrapy. Scrapy是一个用 Python 写的 Crawler Framework ,简单轻巧,并且非常方便.Scrapy 使用 Twisted 这个异步网络库来处理

Scrapy爬取美女图片续集 (原创)

上一篇咱们讲解了Scrapy的工作机制和如何使用Scrapy爬取美女图片,而今天接着讲解Scrapy爬取美女图片,不过采取了不同的方式和代码实现,对Scrapy的功能进行更深入的运用. 在学习Scrapy官方文档的过程中,发现Scrapy自身实现了图片和文件的下载功能,不需要咱们之前自己实现图片的下载(不过原理都一样). 在官方文档中,我们可以看到下面一些话:Scrapy为下载item中包含的文件(比如在爬取到产品时,同时也想保存对应的图片)提供了一个可重用的 item pipelines .

第三百三十四节,web爬虫讲解2—Scrapy框架爬虫—Scrapy爬取百度新闻,爬取Ajax动态生成的信息

第三百三十四节,web爬虫讲解2-Scrapy框架爬虫-Scrapy爬取百度新闻,爬取Ajax动态生成的信息 crapy爬取百度新闻,爬取Ajax动态生成的信息,抓取百度新闻首页的新闻标题和rul地址 有多网站,当你浏览器访问时看到的信息,在html源文件里却找不到,由得信息还是滚动条滚动到对应的位置后才显示信息,那么这种一般都是 js 的 Ajax 动态请求生成的信息 我们以百度新闻为列: 1.分析网站 首先我们浏览器打开百度新闻,在网页中间部分找一条新闻信息 然后查看源码,看看在源码里是否有