43.scrapy爬取链家网站二手房信息-1

首先分析：目的：采集链家网站二手房数据1.先分析一下二手房主界面信息，显示情况如下：

url = https://gz.lianjia.com/ershoufang/pg1/显示总数据量为27589套，但是页面只给返回100页的数据，每页30条数据，也就是只给返回3000条数据。

2.再看一下筛选条件的情况：

100万以下（775）：https://gz.lianjia.com/ershoufang/pg1p1/（p1是筛选条件参数，pg1是页面参数）  页面返回26页信息100万-120万（471）：https://gz.lianjia.com/ershoufang/pg1p2/  页面返回16页信息 

以此类推也就是网站只给你返回查看最多100页，3000条的数据，登陆的话情况也是一样的情况。

3.采集代码如下：这个是 linjia.py 文件，这里需要注意的问题就是 setting里要设置

ROBOTSTXT_OBEY = False，不然页面不给返回数据。

# -*- coding: utf-8 -*-
import scrapy

class LianjiaSpider(scrapy.Spider):
    name = ‘lianjia‘
    allowed_domains = [‘gz.lianjia.com‘]
    start_urls = [‘https://gz.lianjia.com/ershoufang/pg1/‘]

    def parse(self, response):

        #获取当前页面url
        link_urls = response.xpath("//div[@class=‘info clear‘]/div[@class=‘title‘]/a/@href").extract()
        for link_url in link_urls:
            # print(link_url)
            yield scrapy.Request(url=link_url,callback=self.parse_detail)
        print(‘*‘*100)

        #翻页
        for i in range(1,101):
            url = ‘https://gz.lianjia.com/ershoufang/pg{}/‘.format(i)
            # print(url)
            yield scrapy.Request(url=url,callback=self.parse)

    def parse_detail(self,response):

        title = response.xpath("//div[@class=‘title‘]/h1[@class=‘main‘]/text()").extract_first()
        print(‘标题: ‘+ title)
        dist = response.xpath("//div[@class=‘areaName‘]/span[@class=‘info‘]/a/text()").extract_first()
        print(‘所在区域: ‘+ dist)
        contents = response.xpath("//div[@class=‘introContent‘]/div[@class=‘base‘]")
        # print(contents)
        house_type = contents.xpath("./div[@class=‘content‘]/ul/li[1]/text()").extract_first()
        print(‘房屋户型: ‘+ house_type)
        floor = contents.xpath("./div[@class=‘content‘]/ul/li[2]/text()").extract_first()
        print(‘所在楼层: ‘+ floor)
        built_area = contents.xpath("./div[@class=‘content‘]/ul/li[3]/text()").extract_first()
        print(‘建筑面积: ‘+ built_area)
        family_structure = contents.xpath("./div[@class=‘content‘]/ul/li[4]/text()").extract_first()
        print(‘户型结构: ‘+ family_structure)
        inner_area = contents.xpath("./div[@class=‘content‘]/ul/li[5]/text()").extract_first()
        print(‘套内面积: ‘+ inner_area)
        architectural_type = contents.xpath("./div[@class=‘content‘]/ul/li[6]/text()").extract_first()
        print(‘建筑类型: ‘+ architectural_type)
        house_orientation = contents.xpath("./div[@class=‘content‘]/ul/li[7]/text()").extract_first()
        print(‘房屋朝向: ‘+ house_orientation)
        building_structure = contents.xpath("./div[@class=‘content‘]/ul/li[8]/text()").extract_first()
        print(‘建筑结构: ‘+ building_structure)
        decoration_condition = contents.xpath("./div[@class=‘content‘]/ul/li[9]/text()").extract_first()
        print(‘装修状况: ‘+ decoration_condition)
        proportion = contents.xpath("./div[@class=‘content‘]/ul/li[10]/text()").extract_first()
        print(‘梯户比例: ‘+ proportion)
        elevator = contents.xpath("./div[@class=‘content‘]/ul/li[11]/text()").extract_first()
        print(‘配备电梯: ‘+ elevator)
        age_limit =contents.xpath("./div[@class=‘content‘]/ul/li[12]/text()").extract_first()
        print(‘产权年限: ‘+ age_limit)
        try:
            house_label = response.xpath("//div[@class=‘content‘]/a/text()").extract_first()
        except:
            house_label = ‘‘
        print(‘房源标签: ‘ + house_label)
        # decoration_description = response.xpath("//div[@class=‘baseattribute clear‘][1]/div[@class=‘content‘]/text()").extract_first()
        # print(‘装修描述 ‘+ decoration_description)
        # community_introduction = response.xpath("//div[@class=‘baseattribute clear‘][2]/div[@class=‘content‘]/text()").extract_first()
        # print(‘小区介绍: ‘+ community_introduction)
        # huxing_introduce = response.xpath("//div[@class=‘baseattribute clear‘]3]/div[@class=‘content‘]/text()").extract_first()
        # print(‘户型介绍: ‘+ huxing_introduce)
        # selling_point = response.xpath("//div[@class=‘baseattribute clear‘][4]/div[@class=‘content‘]/text()").extract_first()
        # print(‘核心卖点: ‘+ selling_point)
        # 以追加的方式及打开一个文件，文件指针放在文件结尾，追加读写！
        with open(‘text‘, ‘a‘, encoding=‘utf-8‘)as f:
            f.write(‘\n‘.join(
                [title,dist,house_type,floor,built_area,family_structure,inner_area,architectural_type,house_orientation,building_structure,decoration_condition,proportion,elevator,age_limit,house_label]))
            f.write(‘\n‘ + ‘=‘ * 50 + ‘\n‘)
        print(‘-‘*100)

4.这里采集的是全部，没设置筛选条件，只返回100也数据。采集数据情况如下：这里只采集了15个字段信息，其他的数据没采集。采集100页，算一下拿到了2704条数据。

4.这个是上周写的，也没做修改完善，之后会对筛选条件url进行整理，尽量采集网站多的数据信息。

原文地址：https://www.cnblogs.com/lvjing/p/9945613.html

时间： 2024-11-16 20:58:22

43.scrapy爬取链家网站二手房信息-1的相关文章

python 学习 - 爬虫入门练习爬取链家网二手房信息

import requests from bs4 import BeautifulSoup import sqlite3 conn = sqlite3.connect("test.db") c = conn.cursor() for num in range(1,101): url = "https://cs.lianjia.com/ershoufang/pg%s/"%num headers = { 'User-Agent': 'Mozilla/5.0 (Windo

Python的scrapy之爬取链家网房价信息并保存到本地

因为有在北京租房的打算,于是上网浏览了一下链家网站的房价,想将他们爬取下来,并保存到本地. 先看链家网的源码..房价信息都保存在 ul 下的li 里面 ? 爬虫结构: ? 其中封装了一个数据库处理模块,还有一个user-agent池.. 先看mylianjia.py # -*- coding: utf-8 -*- import scrapy from ..items import LianjiaItem from scrapy.http import Request from parsel i

Python爬取链家二手房数据——重庆地区

最近在学习数据分析的相关知识,打算找一份数据做训练,于是就打算用Python爬取链家在重庆地区的二手房数据. 链家的页面如下: 爬取代码如下: import requests, json, time from bs4 import BeautifulSoup import re, csv def parse_one_page(url): headers={ 'user-agent':'Mozilla/5.0' } r = requests.get(url, headers=headers) so

爬取链家任意城市二手房数据(天津)

1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 # @Time : 2019-08-16 12:40 4 # @Author : Anthony 5 # @Email : [email protected] 6 # @File : 爬取链家任意城市二手房数据.py 7 8 9 import requests 10 from lxml import etree 11 import time 12 import xlrd 13 import o

python爬虫：爬取链家深圳全部二手房的详细信息

1.问题描述: 爬取链家深圳全部二手房的详细信息,并将爬取的数据存储到CSV文件中 2.思路分析: (1)目标网址:https://sz.lianjia.com/ershoufang/ (2)代码结构: class LianjiaSpider(object): def __init__(self): def getMaxPage(self, url): # 获取maxPage def parsePage(self, url): # 解析每个page,获取每个huose的Link def pars

Python爬虫项目--爬取链家热门城市新房

本次实战是利用爬虫爬取链家的新房(声明: 内容仅用于学习交流, 请勿用作商业用途) 环境 win8, python 3.7, pycharm 正文 1. 目标网站分析通过分析, 找出相关url, 确定请求方式, 是否存在js加密等. 2. 新建scrapy项目 1. 在cmd命令行窗口中输入以下命令, 创建lianjia项目 scrapy startproject lianjia 2. 在cmd中进入lianjia文件中, 创建Spider文件 cd lianjia scrapy genspi

爬取链家任意城市租房数据(北京朝阳)

1 #!/usr/bin/env python 2 # -*- coding: utf-8 -*- 3 # @Time : 2019-08-16 15:56 4 # @Author : Anthony 5 # @Email : [email protected] 6 # @File : 爬取链家任意城市租房数据.py 7 8 9 import requests 10 from lxml import etree 11 import time 12 import xlrd 13 import os

scrapy爬取西刺网站ip

# scrapy爬取西刺网站ip # -*- coding: utf-8 -*- import scrapy from xici.items import XiciItem class XicispiderSpider(scrapy.Spider): name = "xicispider" allowed_domains = ["www.xicidaili.com/nn"] start_urls = ['http://www.xicidaili.com/nn/']

python 使用scrapy框架爬取一个图书网站的信息

1.新建项目 scrapy start_project book_project 2.编写items类 3.编写spider类 # -*- coding: utf-8 -*- import scrapy from book_project.items import BookItem class BookInfoSpider(scrapy.Spider): name = "bookinfo"#定义爬虫的名字 allowed_domains = ["allitebooks.com