scrapy框架项目:抓取链家 全武汉的二手房信息

import scrapyimport refrom collections import Counterfrom lianjia.items import LianjiaItem

class LianjiaSpiderSpider(scrapy.Spider):    name = ‘lianjia_spider‘    allowed_domains = [‘wh.lianjia.com‘]    start_urls = [‘https://wh.lianjia.com/ershoufang/baibuting/‘]

def parse(self, response):        rsp = (response.body.decode("utf-8"))          #print(response.xpath("//div"))        item = LianjiaItem()        info_list = response.xpath("//div//ul//li[@class=‘clear LOGCLICKDATA‘]")        #print(len(info_list))        #print(info_list)        for i in info_list:            #print(i)

item["xiaoqu_name"] = i.xpath(‘.//div[@class="houseInfo"]//a[@target="_blank"]/text()‘).extract()[0]            #print(xiaoqu_name)

#xiaoqu_link = i.xpath(‘.//div[@class="houseInfo"]//@href‘).extract()[0]            #print(xiaoqu_link)

item["name"] = i.xpath(‘.//div[@class="info clear"]//a/text()‘).extract()[0]             #print(name)

item["area"] = i.xpath(‘.//div[@class="info clear"]//div[@class="positionInfo"]//a/text()‘).extract()[0]            #print(area)

item["link"] = i.xpath(".//div[@class=‘title‘]//@href").extract()[0]             #print(link)

item["summary"] = i.xpath(‘.//div[@class="houseInfo"]/text()‘).extract()[0]  # summary 总结 朝向 装修等,电梯等            #print(summary)

item["floor"] = i.xpath(‘.//div[@class="info clear"]//div[@class="positionInfo"]/text()‘).extract()[0]            #print(floor)

item["zongjia"] = i.xpath(‘.//div[@class="info clear"]//div[@class="totalPrice"]//span/text()‘).extract()[0]# + "万" #组合上单位            #print(zongjia)

item["danjia"] = i.xpath(‘.//div[@class="info clear"]//div[@class="unitPrice"]//span/text()‘).extract()[0]            #print(danjia)

yield item

     #经过分析发现,如果直接在 武昌 汉口 这样的大区域下搜索 ,最多显示30页数据,所以想要完全爬取,必须把所有小区域的链接挨个遍历      area_list = ["baibuting","dazhilu","dijiao","erqi2","houhu","huangpuyongqing","qianjinjianghan","sanyanglu","tazihu","yucaihuaqiao",                     "changqinglu","changfengchangmatou","changganglu","taibeixiangganglu","tangjiadun","wuguangwansongyuan","xinhualuwanda","yangchahu",                     "baofengchongren","changfengchangmatou","cbdxibeihu","gutian","hanzhengjie","jixian2","wujiashan","zongguan",                     "changqinghuayuan","dongxihuqita","jinyinhu","jiangjunlu","baishazhou","chuhehanjie","donghudongting","jiedaokou","jiyuqiao","shuiguohu","shouyi","shahu",                     "tuanjiedadao","wuchanghuochezhan","xudong","yangyuan","zhongbeilu","zhongnandingziqiao","zhuodaoquan","hongshanqita","qingshan1","huquanyangjiawan","luoshinanlu",                     "laonanhu","nanhuwoerma","xinnanhu","qilimiao","sixin","wangjiawan","zhongjiacun","guanxichangzhi","guangguguangchang","guanshandadao","guanggunan","guanggudong",                     "huakeda","jinronggang","minzudadao","sanhuannan","canglongdao","jiangxiaqita","miaoshan","wenhuadadao","caidianqita","dunkou",                     "hankoubei","huangbeiqita","panlongcheng","qianchuan","xinzhouqita","yangluo"]

     #counter = Counter(area_list) #查询列表中是否有重复        #print(counter)          #遍历所有区域后,再遍历0~30页 这样才能确保网站上的所有数据都被爬取,否则信息严重缺失        for i in area_list:            for num in range(0,30):                yield scrapy.Request("https://wh.lianjia.com/ershoufang/"+ i +"/pg"+ str(num), callback=self.parse)

items和pipelines无特别之处,按照常规写即可使用。

原文地址:https://www.cnblogs.com/cwkcwk/p/9710827.html

时间: 2024-10-02 01:43:08

scrapy框架项目:抓取链家 全武汉的二手房信息的相关文章

python爬虫----(6. scrapy框架,抓取亚马逊数据)

利用xpath()分析抓取数据还是比较简单的,只是网址的跳转和递归等比较麻烦.耽误了好久,还是豆瓣好呀,URL那么的规范.唉,亚马逊URL乱七八糟的.... 可能对url理解还不够. amazon ├── amazon │   ├── __init__.py │   ├── __init__.pyc │   ├── items.py │   ├── items.pyc │   ├── msic │   │   ├── __init__.py │   │   └── pad_urls.py │  

Python爬虫项目--爬取链家热门城市新房

本次实战是利用爬虫爬取链家的新房(声明: 内容仅用于学习交流, 请勿用作商业用途) 环境 win8, python 3.7, pycharm 正文 1. 目标网站分析 通过分析, 找出相关url, 确定请求方式, 是否存在js加密等. 2. 新建scrapy项目 1. 在cmd命令行窗口中输入以下命令, 创建lianjia项目 scrapy startproject lianjia 2. 在cmd中进入lianjia文件中, 创建Spider文件 cd lianjia scrapy genspi

Python的scrapy之爬取链家网房价信息并保存到本地

因为有在北京租房的打算,于是上网浏览了一下链家网站的房价,想将他们爬取下来,并保存到本地. 先看链家网的源码..房价信息 都保存在 ul 下的li 里面 ? 爬虫结构: ? 其中封装了一个数据库处理模块,还有一个user-agent池.. 先看mylianjia.py # -*- coding: utf-8 -*- import scrapy from ..items import LianjiaItem from scrapy.http import Request from parsel i

python爬虫:爬取链家深圳全部二手房的详细信息

1.问题描述: 爬取链家深圳全部二手房的详细信息,并将爬取的数据存储到CSV文件中 2.思路分析: (1)目标网址:https://sz.lianjia.com/ershoufang/ (2)代码结构: class LianjiaSpider(object): def __init__(self): def getMaxPage(self, url): # 获取maxPage def parsePage(self, url): # 解析每个page,获取每个huose的Link def pars

Python爬取链家二手房数据——重庆地区

最近在学习数据分析的相关知识,打算找一份数据做训练,于是就打算用Python爬取链家在重庆地区的二手房数据. 链家的页面如下: 爬取代码如下: import requests, json, time from bs4 import BeautifulSoup import re, csv def parse_one_page(url): headers={ 'user-agent':'Mozilla/5.0' } r = requests.get(url, headers=headers) so

爬取链家任意城市租房数据(北京朝阳)

1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 # @Time : 2019-08-16 15:56 4 # @Author : Anthony 5 # @Email : [email protected] 6 # @File : 爬取链家任意城市租房数据.py 7 8 9 import requests 10 from lxml import etree 11 import time 12 import xlrd 13 import os

爬取链家任意城市二手房数据(天津)

1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 # @Time : 2019-08-16 12:40 4 # @Author : Anthony 5 # @Email : [email protected] 6 # @File : 爬取链家任意城市二手房数据.py 7 8 9 import requests 10 from lxml import etree 11 import time 12 import xlrd 13 import o

网页调试技巧:抓取马上跳转的页面POST信息或者页面内容

http://www.qs5.org/Post/625.html 网页调试技巧:抓取马上跳转的页面POST信息或者页面内容 2016/02/02 | 心得分享 | 0 Replies 有时候调试网页或者抓别人网页的POST包的时候. 总会遇到这样的尴尬,我们需要抓取POST提交的信息. 或者获取POST完成页面返回的代码. 但是,目标页却马上就跳转了,导致,还没来得及Esc呢,页面就已经刷新了. 这种情况,起码谷歌浏览器的F12是搞不了了... 比如下面的情况 我把密码放在 被Post页面的源码

1-1 用Python抓取豆瓣及IMDB上的电影信息

下面的代码可以抓取豆瓣及IMDB上的电影信息,由于每段代码使用的数据源自上一段代码输出的数据,所以需要按顺序执行. step1_getDoubanMovies.py 1 # -*- coding: utf-8 -*- 2 ''' 3 该脚本得到豆瓣上所有电影的如下信息: 4 "rate": "7.5", 5 "cover_x": 2000, 6 "is_beetle_subject": false, 7 "title