Python.Scrapy.11-scrapy-source-code-analysis-part-1

Scrapy 源代码分析系列－1 spider, spidermanager, crawler, cmdline, command

分析的源代码版本是0.24.6, url: https://github.com/DiamondStudio/scrapy/blob/0.24.6

如github 中Scrapy 源码树所示，包含的子包有:

commands, contracts, contrib, contrib_exp, core, http, selector, settings, templates, tests, utils, xlib

包含的模块有:

_monkeypatches.py, cmdline.py, conf.py, conftest.py, crawler.py, dupefilter.py, exceptions.py,

extension.py, interface.py, item.py, link.py, linkextractor.py, log.py, logformatter.py, mail.py,

middleware.py, project.py, resolver.py, responsetypes.py, shell.py, signalmanager.py, signals.py,

spider.py, spidermanager.py, squeue.py, stats.py, statscol.py, telnet.py, webservice.py

先从重要的模块进行分析。

0. scrapy依赖的第三方库或者框架

twisted

1. 模块: spider, spidermanager, crawler, cmdline, command

1.1 spider.py spidermanager.py crawler.py

spider.py定义了spider的基类: BaseSpider. 每个spider实例只能有一个crawler属性。那么crawler具备哪些功能呢?

crawler.py定义了类Crawler，CrawlerProcess。

类Crawler依赖: SignalManager, ExtensionManager, ExecutionEngine, 以及设置项STATS_CLASS、SPIDER_MANAGER_CLASS

、LOG_FORMATTER

类CrawlerProcess: 顺序地在一个进程中运行多个Crawler。依赖: twisted.internet.reactor、twisted.internet.defer。

启动爬行(Crawlering)。该类在1.2中cmdline.py会涉及。

spidermanager.py定义类SpiderManager, 类SpiderManager用来创建和管理所有website-specific的spider。

1.2 cmdline.py command.py

cmdline.py定义了公有函数: execute(argv=None, settings=None)。

函数execute是工具scrapy的入口方法(entry method)，如下所示:

 1 XiaoKL$ cat `which scrapy`
 2 #!/usr/bin/python
 3
 4 # -*- coding: utf-8 -*-
 5 import re
 6 import sys
 7
 8 from scrapy.cmdline import execute
 9
10 if __name__ == ‘__main__‘:
11     sys.argv[0] = re.sub(r‘(-script\.pyw|\.exe)?$‘, ‘‘, sys.argv[0])
12     sys.exit(execute())

所以可以根据这个点为切入点进行scrapy源码的分析。下面是execute()函数:

 1 def execute(argv=None, settings=None):
 2     if argv is None:
 3         argv = sys.argv
 4
 5     # --- backwards compatibility for scrapy.conf.settings singleton ---
 6     if settings is None and ‘scrapy.conf‘ in sys.modules:
 7         from scrapy import conf
 8         if hasattr(conf, ‘settings‘):
 9             settings = conf.settings
10     # ------------------------------------------------------------------
11
12     if settings is None:
13         settings = get_project_settings()
14     check_deprecated_settings(settings)
15
16     # --- backwards compatibility for scrapy.conf.settings singleton ---
17     import warnings
18     from scrapy.exceptions import ScrapyDeprecationWarning
19     with warnings.catch_warnings():
20         warnings.simplefilter("ignore", ScrapyDeprecationWarning)
21         from scrapy import conf
22         conf.settings = settings
23     # ------------------------------------------------------------------
24
25     inproject = inside_project()
26     cmds = _get_commands_dict(settings, inproject)
27     cmdname = _pop_command_name(argv)
28     parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), 29         conflict_handler=‘resolve‘)
30     if not cmdname:
31         _print_commands(settings, inproject)
32         sys.exit(0)
33     elif cmdname not in cmds:
34         _print_unknown_command(settings, cmdname, inproject)
35         sys.exit(2)
36
37     cmd = cmds[cmdname]
38     parser.usage = "scrapy %s %s" % (cmdname, cmd.syntax())
39     parser.description = cmd.long_desc()
40     settings.setdict(cmd.default_settings, priority=‘command‘)
41     cmd.settings = settings
42     cmd.add_options(parser)
43     opts, args = parser.parse_args(args=argv[1:])
44     _run_print_help(parser, cmd.process_options, args, opts)
45
46     cmd.crawler_process = CrawlerProcess(settings)
47     _run_print_help(parser, _run_command, cmd, args, opts)
48     sys.exit(cmd.exitcode)

execute()函数主要做: 对命令行进行解析并对scrapy命令进行检查；解析命令行参数；获取设置信息；创建CrawlerProcess对象。

CrawlerProcess对象、设置信息、命令行参数都赋值给ScrapyCommand(或其子类)的对象。

自然我们需要来查看定义类ScrapyCommand的模块: command.py。

ScrapyCommand的子类在子包scrapy.commands中进行定义。

command.py: 定义类ScrapyCommand，该类作为Scrapy Commands的基类。来简单看一下类ScrapyCommand提供的接口/方法:

 1 class ScrapyCommand(object):
 2
 3     requires_project = False
 4     crawler_process = None
 5
 6     # default settings to be used for this command instead of global defaults
 7     default_settings = {}
 8
 9     exitcode = 0
10
11     def __init__(self):
12         self.settings = None  # set in scrapy.cmdline
13
14     def set_crawler(self, crawler):
15         assert not hasattr(self, ‘_crawler‘), "crawler already set"
16         self._crawler = crawler
17
18     @property
19     def crawler(self):
20         warnings.warn("Command‘s default `crawler` is deprecated and will be removed. "
21             "Use `create_crawler` method to instatiate crawlers.",
22             ScrapyDeprecationWarning)
23
24         if not hasattr(self, ‘_crawler‘):
25             crawler = self.crawler_process.create_crawler()
26
27             old_start = crawler.start
28             self.crawler_process.started = False
29
30             def wrapped_start():
31                 if self.crawler_process.started:
32                     old_start()
33                 else:
34                     self.crawler_process.started = True
35                     self.crawler_process.start()
36
37             crawler.start = wrapped_start
38
39             self.set_crawler(crawler)
40
41         return self._crawler
42
43     def syntax(self):
44
45     def short_desc(self):
46
47     def long_desc(self):
48
49     def help(self):
50
51     def add_options(self, parser):
52
53     def process_options(self, args, opts):
54
55     def run(self, args, opts):

类ScrapyCommand的类属性:

requires_project: 是否需要在Scrapy project中运行

crawler_process：CrawlerProcess对象。在cmdline.py的execute()函数中进行设置。

类ScrapyCommand的方法，重点关注:

def crawler(self): 延迟创建Crawler对象。

def run(self, args, opts): 需要子类进行覆盖实现。

那么我们来具体看一个ScrapyCommand的子类(参考 Python.Scrapy.14-scrapy-source-code-analysis-part-4)。

To Be Continued:

接下来分析模块: signals.py signalmanager.py project.py conf.py

时间： 2024-10-09 23:39:23

Python.Scrapy.11-scrapy-source-code-analysis-part-1的相关文章

CEPH CRUSH 算法源码分析原文CEPH CRUSH algorithm source code analysis

原文地址 CEPH CRUSH algorithm source code analysis http://www.shalandis.com/original/2016/05/19/CEPH-CRUSH-algorithm-source-code-analysis/ 文章比较深入的写了CRUSH算法的原理和过程.通过调试深入的介绍了CRUSH计算的过程.文章中添加了些内容. 写在前面读本文前,你需要对ceph的基本操作,pool和CRUSH map非常熟悉.并且较深入的读过源码. 分析的方法

AOP spring source code analysis

例子 1 在使用 New 的情况下实现 AOP public class TraceTest { public static void main(String args[]) { TraceTest test = new TraceTest(); test.rpcCall(); } // 虽然 intellij 没有给出提示,但是这个 Trace 还是成功的 @Trace public void rpcCall() { System.out.println("call rpc"); }

Memcached source code analysis -- Analysis of change of state--reference

This article mainly introduces the process of Memcached, libevent structure of the main thread and worker thread based on the processing of the connection state of mutual conversion (not involving data access operations), the main business logic is t

Memcached source code analysis (threading model)--reference

Look under the start memcahced threading process memcached multi-threaded mainly by instantiating multiple libevent, are a main thread and n workers thread is the main thread or workers thread all through the the libevent management network event, in

Golang Template source code analysis(Parse)

This blog was written at go 1.3.1 version. We know that we use template thought by followed way: func main() { name := "waynehu" tmpl := template.New("test") tmpl, err := tmpl.Parse("hello {{.}}") if err != nil { panic(err) }

Apache Commons Pool2 源码分析 | Apache Commons Pool2 Source Code Analysis

Apache Commons Pool实现了对象池的功能.定义了对象的生成.销毁.激活.钝化等操作及其状态转换,并提供几个默认的对象池实现.在讲述其实现原理前,先提一下其中有几个重要的对象: PooledObject(池对象). PooledObjectFactory(池对象工厂). Object Pool(对象池). 下面分别详细讲解它们的实现. PooledObject(池对象) 用于封装对象(如:线程.数据库连接.TCP连接),将其包裹成可被池管理的对象.提供了两个默认的池对象实现: De

Redis source code analysis

http://zhangtielei.com/posts/blog-redis-dict.html http://zhangtielei.com/assets/photos_redis/redis_dict_structure.png https://github.com/antirez/redis/blob/unstable/src/dict.c http://bbs.redis.cn/forum.php?mod=viewthread&tid=545 http://redisplanet.co

Source Code Analysis in Swift - @autoclosure

@autoclosure:字面理解意思就是自动闭包. 在Swift中有这样的运算符&&,我们知道&&运算符是阻断的在Swift中运算符是一个函数,如果&&左边是false就不会计算右边的,直接返回false. @inline(__always) @warn_unused_result public func && <T : Boolean, U : Boolean>( lhs: T, rhs: @autoclosure () t

Top 40 Static Code Analysis Tools

https://www.softwaretestinghelp.com/tools/top-40-static-code-analysis-tools/ In this article, I have summarised some of the top static code analysis tools. Can we ever imagine sitting back and manually reading each line of codes to find flaws? To eas

[code] Transformer For Summarization Source Code Reading

Basic Information 作者:李丕绩(腾讯AI Lab) 模型:Transformer + copy mechanism for abstractive summarization 数据集:CNN/Daily Mail Parameters WARNING: IN DEBUGGING MODE USE COPY MECHANISM USE COVERAGE MECHANISM USE AVG NLL as LOSS USE LEARNABLE W2V EMBEDDING RNN TY