Python.Scrapy.14-scrapy-source-code-analysis-part-4

Scrapy 源代码分析系列-4 scrapy.commands 子包

子包scrapy.commands定义了在命令scrapy中使用的子命令(subcommand): bench, check, crawl, deploy, edit, fetch,

genspider, list, parse, runspider, settings, shell, startproject, version, view。 所有的子命令模块都定义了一个继承自

类ScrapyCommand的子类Command。

首先来看一下子命令crawl, 该子命令用来启动spider。

1. crawl.py

关注的重点在方法run(self, args, opts):

 1 def run(self, args, opts):
 2         if len(args) < 1:
 3             raise UsageError()
 4         elif len(args) > 1:
 5             raise UsageError("running ‘scrapy crawl‘ with more than one spider is no longer supported")
 6         spname = args[0]
 7
 8         crawler = self.crawler_process.create_crawler()  # A
 9         spider = crawler.spiders.create(spname, **opts.spargs) # B
10         crawler.crawl(spider) # C
11         self.crawler_process.start() # D

那么问题来啦,run接口方法是从哪里调用的呢? 让我们回到 Python.Scrapy.11-scrapy-source-code-analysis-part-1

中 "1.2 cmdline.py command.py" 关于"_run_print_help() "的说明。

A: 创建类Crawler对象crawler。在创建Crawler对象时, 同时将创建Crawler对象的实例属性spiders(SpiderManager)。如下所示:

 1 class Crawler(object):
 2
 3     def __init__(self, settings):
 4         self.configured = False
 5         self.settings = settings
 6         self.signals = SignalManager(self)
 7         self.stats = load_object(settings[‘STATS_CLASS‘])(self)
 8         self._start_requests = lambda: ()
 9         self._spider = None
10         # TODO: move SpiderManager to CrawlerProcess
11         spman_cls = load_object(self.settings[‘SPIDER_MANAGER_CLASS‘])
12         self.spiders = spman_cls.from_crawler(self)  # spiders 的类型是: SpiderManager

Crawler对象对应一个SpiderManager对象,而SpiderManager对象管理多个Spider。

B: 获取Sipder对象。

C: 为Spider对象安装Crawler对象。(为蜘蛛安装爬行器)

D: 类CrawlerProcess的start()方法如下:

 1     def start(self):
 2         if self.start_crawling():
 3             self.start_reactor()
 4
 5     def start_crawling(self):
 6         log.scrapy_info(self.settings)
 7         return self._start_crawler() is not None
 8
 9     def start_reactor(self):
10         if self.settings.getbool(‘DNSCACHE_ENABLED‘):
11             reactor.installResolver(CachingThreadedResolver(reactor))
12         reactor.addSystemEventTrigger(‘before‘, ‘shutdown‘, self.stop)
13         reactor.run(installSignalHandlers=False)  # blocking call
14
15     def _start_crawler(self):
16         if not self.crawlers or self.stopping:
17             return
18
19         name, crawler = self.crawlers.popitem()
20         self._active_crawler = crawler
21         sflo = log.start_from_crawler(crawler)
22         crawler.configure()
23         crawler.install()
24         crawler.signals.connect(crawler.uninstall, signals.engine_stopped)
25         if sflo:
26             crawler.signals.connect(sflo.stop, signals.engine_stopped)
27         crawler.signals.connect(self._check_done, signals.engine_stopped)
28         crawler.start()  # 调用类Crawler的start()方法
29         return name, crawler

类Crawler的start()方法如下:

1     def start(self):
2         yield defer.maybeDeferred(self.configure)
3         if self._spider:
4             yield self.engine.open_spider(self._spider, self._start_requests()) # 和Engine建立了联系 (ExecutionEngine)
5         yield defer.maybeDeferred(self.engine.start)

关于类ExecutionEngine将在子包scrapy.core分析涉及。

2. startproject.py

3. subcommand是如何加载的

在cmdline.py的方法execute()中有如下几行代码:

1     inproject = inside_project()
2     cmds = _get_commands_dict(settings, inproject)
3     cmdname = _pop_command_name(argv)

_get_commands_dict():

1 def _get_commands_dict(settings, inproject):
2     cmds = _get_commands_from_module(‘scrapy.commands‘, inproject)
3     cmds.update(_get_commands_from_entry_points(inproject))
4     cmds_module = settings[‘COMMANDS_MODULE‘]
5     if cmds_module:
6         cmds.update(_get_commands_from_module(cmds_module, inproject))
7     return cmds

_get_commands_from_module():

1 def _get_commands_from_module(module, inproject):
2     d = {}
3     for cmd in _iter_command_classes(module):
4         if inproject or not cmd.requires_project:
5             cmdname = cmd.__module__.split(‘.‘)[-1]
6             d[cmdname] = cmd()
7     return d

To Be Continued

接下来解析settings相关的逻辑。Python.Scrapy.15-scrapy-source-code-analysis-part-5

时间: 2024-10-14 03:36:38

Python.Scrapy.14-scrapy-source-code-analysis-part-4的相关文章

AOP spring source code analysis

例子 1 在使用 New 的情况下实现 AOP public class TraceTest { public static void main(String args[]) { TraceTest test = new TraceTest(); test.rpcCall(); } // 虽然 intellij 没有给出提示,但是这个 Trace 还是成功的 @Trace public void rpcCall() { System.out.println("call rpc"); }

Memcached source code analysis -- Analysis of change of state--reference

This article mainly introduces the process of Memcached, libevent structure of the main thread and worker thread based on the processing of the connection state of mutual conversion (not involving data access operations), the main business logic is t

Memcached source code analysis (threading model)--reference

Look under the start memcahced threading process memcached multi-threaded mainly by instantiating multiple libevent, are a main thread and n workers thread is the main thread or workers thread all through the the libevent management network event, in

CEPH CRUSH 算法源码分析 原文CEPH CRUSH algorithm source code analysis

原文地址 CEPH CRUSH algorithm source code analysis http://www.shalandis.com/original/2016/05/19/CEPH-CRUSH-algorithm-source-code-analysis/ 文章比较深入的写了CRUSH算法的原理和过程.通过调试深入的介绍了CRUSH计算的过程.文章中添加了些内容. 写在前面 读本文前,你需要对ceph的基本操作,pool和CRUSH map非常熟悉.并且较深入的读过源码. 分析的方法

Golang Template source code analysis(Parse)

This blog was written at go 1.3.1 version. We know that we use template thought by followed way: func main() { name := "waynehu" tmpl := template.New("test") tmpl, err := tmpl.Parse("hello {{.}}") if err != nil { panic(err) }

Apache Commons Pool2 源码分析 | Apache Commons Pool2 Source Code Analysis

Apache Commons Pool实现了对象池的功能.定义了对象的生成.销毁.激活.钝化等操作及其状态转换,并提供几个默认的对象池实现.在讲述其实现原理前,先提一下其中有几个重要的对象: PooledObject(池对象). PooledObjectFactory(池对象工厂). Object Pool(对象池). 下面分别详细讲解它们的实现. PooledObject(池对象) 用于封装对象(如:线程.数据库连接.TCP连接),将其包裹成可被池管理的对象.提供了两个默认的池对象实现: De

Redis source code analysis

http://zhangtielei.com/posts/blog-redis-dict.html http://zhangtielei.com/assets/photos_redis/redis_dict_structure.png https://github.com/antirez/redis/blob/unstable/src/dict.c http://bbs.redis.cn/forum.php?mod=viewthread&tid=545 http://redisplanet.co

Source Code Analysis in Swift - @autoclosure

@autoclosure:字面理解意思就是自动闭包. 在Swift中有这样的运算符&&,我们知道&&运算符是阻断的 在Swift中运算符是一个函数,如果&&左边是false就不会计算右边的,直接返回false. @inline(__always) @warn_unused_result public func && <T : Boolean, U : Boolean>( lhs: T, rhs: @autoclosure () t

Top 40 Static Code Analysis Tools

https://www.softwaretestinghelp.com/tools/top-40-static-code-analysis-tools/ In this article, I have summarised some of the top static code analysis tools. Can we ever imagine sitting back and manually reading each line of codes to find flaws? To eas

[code] Transformer For Summarization Source Code Reading

Basic Information 作者:李丕绩(腾讯AI Lab) 模型:Transformer + copy mechanism for abstractive summarization 数据集:CNN/Daily Mail Parameters WARNING: IN DEBUGGING MODE USE COPY MECHANISM USE COVERAGE MECHANISM USE AVG NLL as LOSS USE LEARNABLE W2V EMBEDDING RNN TY