scrapy爬虫实例

一、爬取电影信息

http://www.imdb.cn/nowplaying/{num}    #页面规则

http://www.imdb.cn/title/tt{num}    #某部电影信息

获取电影url和title

新建项目

scrapy startproject imdb

修改items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class ImdbItem(Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    #url = scrapy.Field()    #url
    #title = scrapy.Field()  #影片名

    video_title = Field()
    video_rating = Field()
    video_name = Field()
    video_alias = Field()
    video_director = Field()
    video_actor = Field()
    video_length = Field()
    video_language = Field()
    video_year = Field()
    video_type = Field()
    video_color = Field()
    video_area = Field()
    video_voice = Field()
    video_summary = Field()
    video_url = Field()

在spiders目录下新建爬虫文件moive.py

# -*- coding: utf-8 -*-

from scrapy.spiders import CrawlSpider, Request, Rule
from imdb.items import ImdbItem
from scrapy.linkextractor import LinkExtractor

class ImdbSpider(CrawlSpider):
    name = 'imdb'
    allowed_domains = ['www.imdb.cn']
    rules = (
        Rule(LinkExtractor(allow=r"/title/tt\d+$"), callback="parse_imdb", follow=True),
    )

    def start_requests(self):
        for i in range(1, 20):
            url = "http://www.imdb.cn/nowplaying/" + str(i)
            yield Request(url=url, callback=self.parse)

    def parse_imdb(self, response):
        item = ImdbItem()
        try:
            item['video_title'] = "".join(response.xpath('//*[@class="fk-3"]/div[@class="hdd"]/h3/text()').extract())
            item['video_rating'] = "".join(
                response.xpath('//*[@class="fk-3"]/div[@class="hdd"]/span/i/text()').extract())
            content = response.xpath('//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li').extract()
            for i in range(0, len(content)):
                if "片名" in content[i]:
                    if i == 0:
                        item['video_name'] = "".join(
                            response.xpath('//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[1]/a/text()').extract())
                if "别名" in content[i]:
                    if i == 1:
                        item['video_alias'] = "|".join(
                            response.xpath('//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[2]/a/text()').extract())
                if "导演" in content[i]:
                    if i == 1:
                        item['video_director'] = "|".join(
                            response.xpath('//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[2]/a/text()').extract())
                    elif i == 2:
                        item['video_director'] = "|".join(
                            response.xpath('//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[3]/a/text()').extract())
                if "主演" in content[i]:
                    if i == 2:
                        item['video_actor'] = "|".join(
                            response.xpath('//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[3]/a/text()').extract())
                    if i == 3:
                        item['video_actor'] = "|".join(
                            response.xpath('//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[4]/a/text()').extract())
                if "上映时间" in content[i]:
                    if i == 4:
                        item['video_year'] = "|".join(
                            response.xpath('//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[5]/a[1]/text()').extract())
                        a = response.xpath('//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[5]/a').extract()
                        length = len(a) - 1
                        try:
                            item['video_color'] = "".join(
                                response.xpath(
                                    '//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[5]/a/text()').extract()[length])
                        except Exception as e:
                            item['video_color'] = ""
                        try:
                            type = "|".join(
                                response.xpath(
                                    '//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[5]/a/text()').extract()[1:length])
                            maohao = type.split(":")
                            if len(maohao) > 0:
                                item['video_type'] = maohao[0]
                            else:
                                item['video_type'] = ""
                        except Exception as e:
                            item['video_type'] = ""
                    if i == 5:
                        item['video_year'] = "".join(
                            response.xpath('//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[6]/a[1]/text()').extract())
                        a = response.xpath('//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[6]/a').extract()
                        length = len(a) - 1
                        try:
                            item['video_color'] = "".join(
                                response.xpath(
                                    '//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[6]/a/text()').extract()[length])
                        except Exception as e:
                            item['video_color'] = ""
                        try:
                            type = "|".join(
                                response.xpath(
                                    '//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[6]/a/text()').extract()[1:length])
                            maohao = type.split(":")
                            if len(maohao) > 0:
                                item['video_type'] = maohao[0]
                            else:
                                item['video_type'] = ""
                        except Exception as e:
                            item['video_type'] = ""

                if "国家" in content[i]:
                    if i == 5:
                        item['video_area'] = "|".join(
                            response.xpath('//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[6]/a[1]/text()').extract())
                        item['video_voice'] = "|".join(
                            response.xpath('//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[6]/a[2]/text()').extract())
                    if i == 6:
                        item['video_area'] = "|".join(
                            response.xpath('//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[7]/a[1]/text()').extract())
                        item['video_voice'] = "|".join(
                            response.xpath('//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[7]/a[2]/text()').extract())
            item['video_length'] = "".join(
                response.xpath(
                    '//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[@class="nolink"]/text()').extract()).replace(
                "&nbsp", "")
            item['video_language'] = "".join(
                response.xpath('//*[@class="fk-3"]/div[@class="bdd clear"]/ul/li[@class="nolink"]/a/text()').extract())
            item['video_summary'] = "".join(
                response.xpath(
                    '//*[@class="fk-4 clear"]/div[@class="bdd clear"]/i/text()').extract()).lstrip().rstrip().replace(
                "<br>", "")
            item['video_url'] = response.url
            yield item
        except Exception as error:
            log(error)

在spiders目录下新建run.py启动文件

vim run.py

# coding:utf-8

from scrapy import cmdline

cmdline.execute("scrapy crawl imdb".split())

二、有限深度爬取

新建项目

scrapy startproject douban

scrapy中,我们在settings.py设置深度使用DEPTH_LIMIT,例如:DEPTH_LIMIT = 5,该深度是相对于初始请求url的深度

修改settings.py

DEPTH_LIMIT = 4

#豆瓣有反爬虫机制,因此设置延时DOWNLOAD_DELAY

DOWNLOAD_DELAY = 2

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'    #设置代理

items.py

from scrapy import Item, Field

# 音乐
class MusicItem(Item):
    music_name = Field()
    music_alias = Field()
    music_singer = Field()
    music_time = Field()
    music_rating = Field()
    music_votes = Field()
    music_tags = Field()
    music_url = Field()

class MusicReviewItem(Item):
    review_title = Field()
    review_content = Field()
    review_author = Field()
    review_music = Field()
    review_time = Field()
    review_url = Field()

爬虫文件music.py

# --*-- coding: utf-8 --*--

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from douban.items import MusicItem, MusicReviewItem
from scrapy import log

class ReviewSpider(CrawlSpider):
    name = 'review'
    allowed_domains = ['music.douban.com']
    start_urls = ['https://music.douban.com/subject/1406522/']
    rules = (
        Rule(LinkExtractor(allow=r"/subject/\d+/reviews$")),
        Rule(LinkExtractor(allow=r"/subject/\d+/reviews\?sort=time$")),
        Rule(LinkExtractor(allow=r"/subject/\d+/reviews\?sort=time\&start=\d+$")),
        Rule(LinkExtractor(allow=r"/review/\d+/$"), callback="parse_review", follow=True),
    )

    def parse_review(self,response):
        try:
            item = MusicReviewItem()
            item['review_title'] = "".join(response.xpath('//*[@property="v:summary"]/text()').extract())
            content = "".join(
                response.xpath('//*[@id="link-report"]/div[@property="v:description"]/text()').extract()
            )
            item['review_content'] = content.lstrip().rstrip().replace('\n'," ")
            item['review_author'] = "".join(response.xpath('//*[@property="v:reviewer"]/text()').extract())
            item['review_music'] = "".join(response.xpath('//*[@class="main-hd"]/a[2]/text()').extract())
            item['review_time'] = "".join(response.xpath('//*[@class="main-hd"]/p/text()').extract())
            item['review_url'] = response.url
            yield item
        except Exception as error:
            log(error)

启动命令文件run.py

# --*-- coding: utf-8 --*--
from scrapy import cmdline
cmdline.execute("scrapy crawl review -o review.json".split())

-o 参数导出结果到review.json文件

多个爬虫组合

我们现在有这么个需求,既要爬取音乐详情又要爬取乐评,既要爬取电影详情又要爬取影评,这个要怎么搞,难道是每一个需求就要创建一个项目么,如果按这种方式,我们就要创建四个项目,分别来爬取音乐、乐评、电影、影评,显然这么做的话,代码不仅有很多重合的部分,而且还不容易维护爬虫

新建项目

scrapy startproject multi

修改settings.py

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
DOWNLOAD_DELAY = 2

修改items.py
from scrapy import Item, Field

# 音乐
class MusicItem(Item):
    music_name = Field()
    music_alias = Field()
    music_singer = Field()
    music_time = Field()
    music_rating = Field()
    music_votes = Field()
    music_tags = Field()
    music_url = Field()

# 乐评
class MusicReviewItem(Item):
    review_title = Field()
    review_content = Field()
    review_author = Field()
    review_music = Field()
    review_time = Field()
    review_url = Field()

# 电影
class VideoItem(Item):
    video_name = Field()
    video_alias = Field()
    video_actor = Field()
    video_year = Field()
    video_time = Field()
    video_rating = Field()
    video_votes = Field()
    video_tags = Field()
    video_url = Field()
    video_director = Field()
    video_type = Field()
    video_bigtype = Field()
    video_area = Field()
    video_language = Field()
    video_length = Field()
    video_writer = Field()
    video_desc = Field()
    video_episodes = Field()

# 影评
class VideoReviewItem(Item):
    review_title = Field()
    review_content = Field()
    review_author = Field()
    review_video = Field()
    review_time = Field()
    review_url = Field()
    
    
spiders目录下新建两个爬虫文件

videospider.py
# --*-- coding: utf-8 --*--

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from multi.items import VideoItem, VideoReviewItem
from scrapy import log

import re

AREA = re.compile(r"制片国家/地区:</span> (.+?)<br>")
ALIAS = re.compile(r"又名:</span> (.+?)<br>")
LANGUAGE = re.compile(r"语言:</span> (.+?)<br>")
EPISODES = re.compile(r"集数:</span> (.+?)<br>")
LENGTH = re.compile(r"单集片长:</span> (.+?)<br>")

class VideoSpider(CrawlSpider):
    name = 'video'
    allowed_domains = ['movie.douban.com']
    start_urls = [
        'https://movie.douban.com/tag/',
        'https://movie.douban.com/tag/?view=cloud'
    ]
    rules = (Rule(LinkExtractor(allow=r"/tag/((\d+)|([\u4e00-\u9fa5]+)|(\w+))$")),
             Rule(LinkExtractor(allow=r"/tag/((\d+)|([\u4e00-\u9fa5]+)|(\w+))\?start=\d+\&type=T$")),
             Rule(LinkExtractor(allow=r"/subject/\d+/reviews$")),
             Rule(LinkExtractor(allow=r"/subject/\d+/reviews\?start=\d+$")),
             Rule(LinkExtractor(allow=r"/subject/\d+/$"), callback="parse_video", follow=True),
             Rule(LinkExtractor(allow=r"/review/\d+/$"), callback="parse_review", follow=True),
             )

    def parse_video(self, response):
        item = VideoItem()
        try:
            item["video_url"] = response.url
            item["video_name"] = ''.join(
                response.xpath('//*[@id="content"]/h1/span[@property="v:itemreviewed"]/text()').extract()
            )
            try:
                item["video_year"] = ''.join(
                    response.xpath('//*[@id="content"]/h1/span[@class="year"]/text()').extract()).replace(
                    "(", "").replace(")", ""
                )
            except Exception as e:
                print('Exception:', e)
                item['video_year'] = ''

            introduction = response.xpath('//*[@id="link-report"]/span[@property="v:summary"]/text()').extract()
            if introduction:
                item["video_desc"] = ''.join(introduction).strip().replace("\r\n", " ")
            else:
                item["video_desc"] = ''.join(
                    response.xpath('//*[@id="link-report"]/span/text()').extract()).strip().replace("\r\n", " ")

            item["video_director"] = "|".join(
                response.xpath('//*[@id="info"]/span/span/a[@rel="v:directedBy"]/text()').extract())
            item["video_writer"] = "|".join(
                response.xpath('//*[@id="info"]/span[2]/span[2]/a/text()').extract())

            item["video_actor"] = "|".join(response.xpath("//a[@rel='v:starring']/text()").extract())

            item["video_type"] = "|".join(response.xpath('//*[@id="info"]/span[@property="v:genre"]/text()').extract())

            S = "".join(response.xpath("//div[@id='info']").extract())
            M = AREA.search(S)
            if M is not None:
                item["video_area"] = "|".join([area.strip() for area in M.group(1).split("/")])
            else:
                item['video_area'] = ''

            A = "".join(response.xpath("//div[@id='info']").extract())
            AL = ALIAS.search(A)
            if AL is not None:
                item["video_alias"] = "|".join([alias.strip() for alias in AL.group(1).split("/")])
            else:
                item["video_alias"] = ""

            video_info = "".join(response.xpath("//div[@id='info']").extract())
            language = LANGUAGE.search(video_info)
            episodes = EPISODES.search(video_info)
            length = LENGTH.search(video_info)

            if language is not None:
                item["video_language"] = "|".join([language.strip() for language in language.group(1).split("/")])
            else:
                item['video_language'] = ''
            if length is not None:
                item["video_length"] = "|".join([runtime.strip() for runtime in length.group(1).split("/")])
            else:
                item["video_length"] = "".join(
                    response.xpath('//*[@id="info"]/span[@property="v:runtime"]/text()').extract())

            item['video_time'] = "/".join(
                response.xpath('//*[@id="info"]/span[@property="v:initialReleaseDate"]/text()').extract())
            if episodes is not None:
                item['video_bigtype'] = "电视剧"
                item["video_episodes"] = "|".join([episodes.strip() for episodes in episodes.group(1).split("/")])
            else:
                item['video_bigtype'] = "电影"
                item['video_episodes'] = ''
            item['video_tags'] = "|".join(
                response.xpath('//*[@class="tags"]/div[@class="tags-body"]/a/text()').extract())

            try:
                item['video_rating'] = "".join(response.xpath(
                    '//*[@class="rating_self clearfix"]/strong/text()').extract())
                item['video_votes'] = "".join(response.xpath(
                    '//*[@class="rating_self clearfix"]/div/div[@class="rating_sum"]/a/span/text()').extract())
            except Exception as error:
                item['video_rating'] = '0'
                item['video_votes'] = '0'
                log(error)

            yield item

        except Exception as error:
            log(error)

    def parse_review(self, response):
        try:
            item = VideoReviewItem()
            item['review_title'] = "".join(response.xpath('//*[@property="v:summary"]/text()').extract())
            content = "".join(
                response.xpath('//*[@id="link-report"]/div[@property="v:description"]/text()').extract())
            item['review_content'] = content.lstrip().rstrip().replace("\n", " ")
            item['review_author'] = "".join(response.xpath('//*[@property = "v:reviewer"]/text()').extract())
            item['review_video'] = "".join(response.xpath('//*[@class="main-hd"]/a[2]/text()').extract())
            item['review_time'] = "".join(response.xpath('//*[@class="main-hd"]/p/text()').extract())
            item['review_url'] = response.url
            yield item
        except Exception as error:
            log(error)
            
 新建musicspider.py
 
 # --*-- coding: utf-8 --*--
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from multi.items import MusicItem, MusicReviewItem
from scrapy import log
import re

class MusicSpider(CrawlSpider):
    name = "music"
    allowed_domains = ['music.douban.com']
    start_urls = [
        'https://music.douban.com/tag/',
        'https://music.douban.com/tag/?view=cloud'
    ]

    rules = (Rule(LinkExtractor(allow=r"/tag/((\d+)|([\u4e00-\u9fa5]+)|(\w+))$")),
             Rule(LinkExtractor(allow=r"/tag/((\d+)|([\u4e00-\u9fa5]+)|(\w+))\?start=\d+\&type=T$")),
             Rule(LinkExtractor(allow=r"/subject/\d+/reviews\?sort=time$")),
             Rule(LinkExtractor(allow=r"/subject/\d+/reviews\?sort=time\&start=\d+$")),
             Rule(LinkExtractor(allow=r"/subject/\d+/$"), callback="parse_music", follow=True),
             Rule(LinkExtractor(allow=r"/review/\d+/$"), callback="parse_review", follow=True),
             )

    def parse_music(self,response):
        item = MusicItem()
        try:
            item['music_name'] = response.xpath('//*[@id="wrapper"]/h1/span/text()').extract()[0]
            content = "".join(response.xpath('//*[@id="info"]').extract())
            info = response.xpath('//*[@id="info"]/span').extract()
            item['music_alias'] = ""
            item['music_singer'] = ""
            item['music_time'] = ""
            for i in range(0, len(info)):
                if "又名" in info[i]:
                    if i == 0:
                        item['music_alias'] = response.xpath('//*[@id="info"]/text()').extract()[1].replace("\xa0", "").replace("\n", "").rstrip()
                    elif i == 1:
                        item['music_alias'] = response.xpath('//*[@id="info"]/text()').extract()[2].replace("\xa0", "").replace("\n", "").rstrip()
                    elif i == 2:
                        item['music_alias'] = response.xpath('//*[@id="info"]/text()').extract()[3].replace("\xa0", "").replace("\n", "").rstrip()
                    else:
                        item['music_alias'] = ""

                if "表演者" in info[i]:
                    if i == 0:
                        item['music_singer'] = "|".join(response.xpath('//*[@id="info"]/span[1]/span/a/text()').extract())
                    elif i == 1:
                        item['music_singer'] = "|".join(
                            response.xpath('//*[@id="info"]/span[2]/span/a/text()').extract())

                    elif i == 2:
                        item['music_singer'] = "|".join(
                            response.xpath('//*[@id="info"]/span[3]/span/a/text()').extract())

                    else:
                        item['music_singer'] = ""
                if "发行时间" in info[i]:
                    nbsp = re.findall(r"<span class=\"pl\">发行时间:</span>(.*?)<br>", content, re.S)
                    item['music_time'] = "".join(nbsp).replace("\xa0", "").replace("\n", "").replace(" ", "")

            try:
                item['music_rating'] = "".join(response.xpath('//*[@class="rating_self clearfix"]/strong/text()').extract())
                item['music_votes'] = "".join(response.xpath('//*[@class="rating_self clearfix"]/div/div[@class="rating_sum"]/a/span/text()').extract())
            except Exception as error:
                item['music_rating'] = '0'
                item['music_votes'] = '0'
                log(error)
            item['music_tags'] = "|".join(response.xpath('//*[@id="db-tags-section"]/div/a/text()').extract())
            item['music_url'] = response.url
            yield item
        except Exception as error:
            log(error)

    def parse_review(self, response):
        try:
            item = MusicReviewItem()
            item['review_title'] = "".join(response.xpath('//*[@property="v:summary"]/text()').extract())
            content = "".join(
                response.xpath('//*[@id="link-report"]/div[@property="v:description"]/text()').extract()
            )
            item['review_content'] = content.lstrip().rstrip().replace("\n", " ")
            item['review_author'] = "".join(response.xpath('//*[@property = "v:reviewer"]/text()').extract())
            item['review_music'] = "".join(response.xpath('//*[@class="main-hd"]/a[2]/text()').extract())
            item['review_time'] = "".join(response.xpath('//*[@class="main-hd"]/p/text()').extract())
            item['review_url'] = response.url
            yield item
        except Exception as error:
            log(error)
            
 新建启动文件run.py
 # --*-- coding: utf-8 --*--
from scrapy import cmdline
cmdline.execute("scrapy crawl music".split())
cmdline.execute("scrapy crawl video".split())

原文地址:http://blog.51cto.com/haoyonghui/2061715

时间: 2024-10-08 18:02:19

scrapy爬虫实例的相关文章

Scrapy 爬虫实例教程(一)---简介及资源列表

Scrapy(官网 http://scrapy.org/)是一款功能强大的,用户可定制的网络爬虫软件包.其官方描述称:" Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, fro

Scrapy 爬虫实例 抓取豆瓣小组信息并保存到mongodb中

这个框架关注了很久,但是直到最近空了才仔细的看了下 这里我用的是scrapy0.24版本 先来个成品好感受这个框架带来的便捷性,等这段时间慢慢整理下思绪再把最近学到的关于此框架的知识一一更新到博客来. 先说明下这个玩具爬虫的目的 能够将种子URL页面当中的小组进行爬取 并分析出有关联的小组连接 以及小组的组员人数 和组名等信息 出来的数据大概是这样的 {    'RelativeGroups': [u'http://www.douban.com/group/10127/',           

scrapy爬虫实例w3school报错ImportError: No module named w3school.items

爬虫例程就不整个叙述了,百度一下超多的,贴上一篇经过验证可以爬取的例程的网址 http://blog.csdn.net/u012150179/article/details/32911511 下面是我在运行时中出现的错误 错误ImportError: No module named w3school.items 相应的错误的代码是 from w3school.items import W3SchoolItem 好坑啊,刚开始就觉得这个错误简直莫名其妙,在网上也看了很多都没解决,后来... 将it

Scrapy安装、爬虫入门教程、爬虫实例(豆瓣电影爬虫)

Scrapy在window上的安装教程见下面的链接:Scrapy安装教程 上述安装教程已实践,可行.本来打算在ubuntu上安装Scrapy的,但是Ubuntu 磁盘空间太少了,还没扩展磁盘空间,暂时不想再上面装太多软件. Scrapy的入门教程见下面链接:Scrapy入门教程 上面的入门教程是很基础的,先跟着作者走一遍,要动起来哟,不要只是阅读上面的那篇入门教程,下面我简单总结一下Scrapy爬虫过程: 1.在Item中定义自己要抓取的数据: movie_name就像是字典中的“键”,爬到的数

scrapy爬虫成长日记之将抓取内容写入mysql数据库

前面小试了一下scrapy抓取博客园的博客(您可在此查看scrapy爬虫成长日记之创建工程-抽取数据-保存为json格式的数据),但是前面抓取的数据时保存为json格式的文本文件中的.这很显然不满足我们日常的实际应用,接下来看下如何将抓取的内容保存在常见的mysql数据库中吧. 说明:所有的操作都是在“scrapy爬虫成长日记之创建工程-抽取数据-保存为json格式的数据”的基础上完成,如果您错过了这篇文章可以移步这里查看scrapy爬虫成长日记之创建工程-抽取数据-保存为json格式的数据 环

Python之Scrapy爬虫框架安装及简单使用

题记:早已听闻python爬虫框架的大名.近些天学习了下其中的Scrapy爬虫框架,将自己理解的跟大家分享.有表述不当之处,望大神们斧正. 一.初窥Scrapy Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中. 其最初是为了 页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫. 本文档将

Scrapy爬虫学习,及实践项目。

作为初学者,首先贴出自己看到的一个教程所提供的实例..后边会讲解我自身所完成的项目说明. 我自己所做项目下载地址为:Scrapy爬虫项目 自己项目说明: 爬取某网站流行时尚网页项目,并对具体项目内容进行二次爬取,将爬取到的内容拼接成为新的静态html,存入自身Ftp服务器,并将信息提交到某接口..(接口中进行数据操作.接口部分未上传) 示例 scrapy爬取了链接之后,如何继续进一步爬取该链接对应的内容? parse可以返回Request列表,或者items列表,如果返回的是Request,则这

Scrapy 爬虫模拟登陆的3种策略

1   Scrapy 爬虫模拟登陆策略 前面学习了爬虫的很多知识,都是分析 HTML.json 数据,有很多的网站为了反爬虫,除了需要高可用代理 IP 地址池外,还需要登录,登录的时候不仅仅需要输入账户名和密码,而且有可能验证码,下面就介绍 Scrapy 爬虫模拟登陆的几种策略. 1.1  策略一:直接POST请求登录 前面介绍的爬虫 scrapy 的基本请求流程是 start_request 方法遍历 start_urls 列表,然后 make_requests_from_url方法,里面执行

一篇文章教会你理解和定义Scrapy爬虫框架中items.py文件

在前面几篇文章中我们已经学会了如何了编写Spider去获取网页上所有的文章链接及其对应的网页目标信息.在这篇文章中,我们将主要介绍Scrapy中的Item. 在介绍Item之前,我们需要知道明确一点,网络爬虫的主要目标就是需要从非结构化的数据源中提取出结构化的数据,在提取出结构化的数据之后,怎么将这些数据进行返回呢?最简单的一种方式就是将这些字段放到一个字典当中来,然后通过字典返回给Scrapy.虽然字典很好用,但是字典缺少一些结构性的东西,比方说我们容易敲错字段的名字,容易导致出错,比方说我们