Scrapy——settings配置文件

# -*- coding: utf-8 -*-

# Scrapy settings for tencent project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

# 项目名
BOT_NAME = ‘tencent‘

# 爬虫位置
SPIDER_MODULES = [‘tencent.spiders‘]
NEWSPIDER_MODULE = ‘tencent.spiders‘

# 日志输出
LOG_LEVEL = "INFO"

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36‘

# Robot协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# 设置最大并发请求
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs

# 每次请求之前的下载延迟
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16   #每个域名的最大并发请求数
#CONCURRENT_REQUESTS_PER_IP = 16       #每个ip的最大并发请求数

# 下次请求之前携带之前请求的cookie
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# 禁用Telnet控制台(默认启用)
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# 覆盖默认请求头,USER_AGENT在之前,放此处将不起作用
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
#   ‘Accept-Language‘: ‘en‘,
#}

# 启用或禁用spider中间件
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    ‘tencent.middlewares.TencentSpiderMiddleware‘: 543,
#}

# 启用或禁用下载器中间件
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    ‘tencent.middlewares.TencentDownloaderMiddleware‘: 543,
#}

#启用或禁用扩展程序
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    ‘scrapy.extensions.telnet.TelnetConsole‘: None,
#}

# 配置项目管道
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   ‘tencent.pipelines.TencentPipeline‘: 300,
}

# 启用和配置AutoThrottle扩展(默认情况下禁用)
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True

#初始下载延迟
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5

#在高延迟的情况下设置的最大下载延迟
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

#启用显示所收到的每个响应的调节统计信息:
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

#启用和配置HTTP缓存(默认情况下禁用)
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = ‘httpcache‘
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = ‘scrapy.extensions.httpcache.FilesystemCacheStorage‘

原文地址:https://www.cnblogs.com/Jery-9527/p/10793521.html

时间: 2024-10-10 20:00:24

Scrapy——settings配置文件的相关文章

Scrapy 框架 配置文件

配置文件 基本配置 #1.项目名称,默认的USER_AGENT由它来构成,也作为日志记录的日志名 BOT_NAME = 'Amazon' #2.爬虫应用路径 SPIDER_MODULES = ['Amazon.spiders'] NEWSPIDER_MODULE = 'Amazon.spiders' #3.客户端User-Agent请求头 #USER_AGENT = 'Amazon (+http://www.yourdomain.com)' #4.是否遵循爬虫协议 # Obey robots.t

Winform Settings配置文件的保存

添加附加设置组的步骤 从"Project"(项目)菜单中选择"Add New Item"(添加新项).将会打开"Add New Item"(添加新项)对话框. 在"Add New Item"(添加新项)对话框中,选择"Settings File"(设置文件). 在"Name"(名称)框中为设置文件命名,如 SpecialSettings.settings,然后单击"Add&qu

Django的settings配置文件

一.邮件配置 EMAIL_BACKEND = 'django.core.mail.backends.smtp.EmailBackend' EMAIL_HOST = 'smtp.qq.com' EMAIL_PORT = 25 EMAIL_HOST_USER = '[email protected]' EMAIL_HOST_PASSWORD = '' #授权码 EMAIL_SUBJECT_PREFIX = '我发的邮件' EMAIL_USE_TLS = True #与SMTP服务器通信事,是否启动T

scrapy settings

USER_AGENT = 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Mobile Safari/537.36'LOG_LEVER = LOG_LEVEL = 'ERROR'IMAGES_STORE ='./imgs'ROBOTSTXT_OBEY = False 原文地址:https://www.cnblogs.

scrapy 中 settings.py 中字段的意思

# -*- coding: utf-8 -*- # Scrapy settings for fenbushi project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:## https://doc.scrapy.org/en/latest/topi

scrapy之settings参数

#==>第一部分:基本配置<=== #1.项目名称,默认的USER_AGENT由它来构成,也作为日志记录的日志名 BOT_NAME = 'Amazon' #2.爬虫应用路径 SPIDER_MODULES = ['Amazon.spiders'] NEWSPIDER_MODULE = 'Amazon.spiders' #3.客户端User-Agent请求头 #USER_AGENT = 'Amazon (+http://www.yourdomain.com)' #4.是否遵循爬虫协议 # Obey

Python爬虫从入门到放弃(十一)之 Scrapy框架整体的一个了解

这里是通过爬取伯乐在线的全部文章为例子,让自己先对scrapy进行一个整理的理解 该例子中的详细代码会放到我的github地址:https://github.com/pythonsite/spider/tree/master/jobboleSpider 注:这个文章并不会对详细的用法进行讲解,是为了让对scrapy各个功能有个了解,建立整体的印象. 在学习Scrapy框架之前,我们先通过一个实际的爬虫例子来理解,后面我们会对每个功能进行详细的理解.这里的例子是爬取http://blog.jobb

Scrapy

一.安装 Linux pip3 install scrapy Windows a. pip3 install wheel b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted c. 进入下载目录,执行 pip3 install Twisted?17.1.0?cp35?cp35m?win_amd64.whl d. pip3 install scrapy e. 下载并安装pywin32:https://sourceforge.

爬虫必备—Scrapy

一.Scrapy简介 Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中.其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫.Scrapy用途广泛,可以用于数据挖掘.监测和自动化测试. Scrapy 使用了 Twisted异步网络库来处理网络通讯.整体架构大致如下: Scrapy主