用Python写爬虫爬取58同城二手交易数据

爬了14W数据,存入Mongodb,用Charts库展示统计结果,这里展示一个示意

模块1 获取分类url列表

from bs4 import BeautifulSoup
import requests,pymongo

main_url = ‘http://bj.58.com/sale.shtml‘
client = pymongo.MongoClient(‘localhost‘,27017)
tc_58 = client[‘58tc‘]
tab_link_list = tc_58[‘link_list‘]

web_data = requests.get(main_url)
soup = BeautifulSoup(web_data.text,‘lxml‘)
sub_menu_link = soup.select(‘ul.ym-submnu > li > b > a‘)

link_list = []
count = 0
for link in sub_menu_link:
    link = ‘http://bj.58.com‘ + link.get(‘href‘)
    #print(link)
    if link == ‘http://bj.58.com/shoujihao/‘:
        pass
    elif link == ‘http://bj.58.com/tongxunyw/‘:
        pass
    elif link == ‘http://bj.58.com/tiaozao/‘:
        count += 1
        if count == 1:
            data = {‘link‘:link}
            link_list.append(data)
    else:
        data = {‘link‘: link}
        link_list.append(data)

for i in link_list:
    tab_link_list.insert(i)

模块2 获取每个商品详情信息

from bs4 import BeautifulSoup
import requests,re,pymongo,sys
from multiprocessing import Pool

client = pymongo.MongoClient(‘localhost‘,27017)
tc_58 = client[‘58tc‘]
# detail_link = tc_58[‘detail_link‘]
tab_link_list = tc_58[‘link_list‘]
# tc_58_data = client[‘58tcData‘]

def getDetailUrl(page_url,tab):
    url_list = []
    web_data = requests.get(page_url)
    soup = BeautifulSoup(web_data.text,‘lxml‘)
    detail_url = soup.select(‘div.infocon > table > tbody > tr > td.t > a[onclick]‘)

    #获取详细页面url
    for url in detail_url:
        url_list.append(url.get(‘href‘).split(‘?‘)[0])

    #插入mongodb
    count = 0
    client = pymongo.MongoClient(‘localhost‘, 27017)
    tc_58 = client[‘58tc‘]
    tab_list = tc_58[tab+‘_list‘]
    for i in url_list:
        count += 1
        tab_list.insert({‘link‘:i})
    return count

original_price_patt = re.compile(‘原价:(.+)‘)
def getInfo(detail_url):
    try:
        web_data = requests.get(detail_url)
        soup = BeautifulSoup(web_data.text,‘lxml‘)
        title = soup.title.text.strip()
        view_count = soup.select(‘body > div.content > div > div.box_left > div.info_lubotu.clearfix > div.box_left_top > p > span.look_time‘)[0].text
        want_count = soup.select(‘body > div.content > div > div.box_left > div.info_lubotu.clearfix > div.box_left_top > p > span.want_person‘)[0].text
        current_price = soup.select(‘body > div.content > div > div.box_left > div.info_lubotu.clearfix > div.info_massege.left > div.price_li > span > i‘)
        current_price = current_price[0].text if current_price else None
original_price = soup.select(‘body > div.content > div > div.box_left > div.info_lubotu.clearfix > div.info_massege.left > div.price_li > span > b‘)
        original_price = original_price[0].text if original_price else None
original_price = re.findall(original_price_patt,original_price) if original_price else None
location = soup.select(‘body > div.content > div > div.box_left > div.info_lubotu.clearfix > div.info_massege.left > div.palce_li > span > i‘)[0].text
        tag = soup.select(‘body > div.content > div > div.box_left > div.info_lubotu.clearfix > div.info_massege.left > div.biaoqian_li‘)
        tag = list(tag[0].stripped_strings) if tag else None
seller_name = soup.select(‘body > div.content > div > div.box_right > div.personal.jieshao_div > div.personal_jieshao > p.personal_name‘)[0].text
        # level = soup.select(‘body > div.content > div > div.box_right > div.personal.jieshao_div > div.personal_jieshao > span‘)
        # level = str(level[0]).split(‘\n‘)
        #
        # full_count = 0
        # half_count = 0
        # for j in level:
        #     if ‘<span class="icon_png "></span>‘ == j:
        #         full_count += 1
        #     elif ‘<span class="icon_png smallScore"></span>‘ == j:
        #         half_count += 1
        full_count = len(soup.find_all(‘span‘, class_=‘icon_png ‘))
        half_count = len(soup.find_all(‘span‘, class_=‘icon_png smallScore‘))

        level_count = {‘full‘:full_count,‘half‘:half_count}
        desc = soup.select(‘body > div.content > div > div.box_left > div:nth-of-type(3) > div > div > p‘)
        desc = desc[0].text if desc else None
data = {
            ‘title‘:title,
            ‘view_count‘:view_count,
            ‘want_count‘:want_count,
            ‘current_price‘:current_price,
            ‘original_price‘:original_price,
            ‘location‘:location,
            ‘tag‘:tag,
            ‘seller_name‘:seller_name,
            #‘level‘:level,
            ‘level_count‘:level_count,
            ‘desc‘:desc,
            ‘link‘:detail_url
}
        return data
    except:
        print(sys.exc_info()[0], sys.exc_info()[1])
        return None
# for i in tab_link_list.find({},{‘link‘:1,‘_id‘:0}):
#     print(i[‘link‘])
#     getDetailUrl(i[‘link‘])

#规律每个页面最多70页
def insertDetailLin(sub_menu_list):
    patt = re.compile(‘.+?com/([a-z]+)/‘)
    tab_list = []
    for i in sub_menu_list.find({},{‘link‘:1,‘_id‘:0}):
    #for i in [{‘link‘:‘http://bj.58.com/shouji/‘}]:
        i = i[‘link‘]
        sub_menu_name = re.findall(patt,i)[0]
        print(sub_menu_name+‘: ‘,end=‘‘)
        url_list = []
        for j in range(1,71):
            link = i + ‘pn‘ + str(j)
            url_list.append(link)

        cnt = 0
        for k in url_list:
            cnt = cnt + getDetailUrl(k, sub_menu_name)
        print(str(cnt) + ‘ lines inserted‘)
        if cnt != 0:
            tab_list.append(sub_menu_name+‘_list‘)
    return tab_list

# for i in tab_link_list.find({},{‘link‘:1,‘_id‘:0}):
#     print(i)

#insertDetailLin(tab_link_list)

allMenCollectionName = tc_58.collection_names()
#allMenCollectionName.remove(‘detail_link‘)
allMenCollectionName.remove(‘link_list‘)
def insertData(tab_name):
    client = pymongo.MongoClient(‘localhost‘, 27017)
    tc_58 = client[‘58tc‘]
    tc_58_data = client[‘58tcDataNew‘]
    fenLei = tab_name[:-5]
    fenLei = tc_58_data[fenLei+‘_data‘]
    tab_name = tc_58[tab_name]
    #print(tab_name)
    for i in tab_name.find({},{‘link‘:1,‘_id‘:0}):
        data = getInfo(i[‘link‘])
        fenLei.insert(data)

def getContinuingly(fenlei):
    client = pymongo.MongoClient(‘localhost‘,27017)
    tc_58_data = client[‘58tcDataNew‘]
    tc_58 = client[‘58tc‘]
    fenlei_data = tc_58_data[fenlei+‘_data‘]
    fenlei_list = tc_58[fenlei+‘_list‘]
    db_urls = [item[‘link‘] for item in fenlei_data.find()]
    index_url = [item[‘link‘] for item in fenlei_list.find()]
    x=set(db_urls)
    y=set(index_url)
    rest_of_urls = y-x
    return list(rest_of_urls)

def startgetContinuingly(fenlei):
    client = pymongo.MongoClient(‘localhost‘, 27017)
    tc_58_data = client[‘58tcDataNew‘]
    fenLei = tc_58_data[fenlei+‘_data‘]
    #rest_of_urls = getContinuingly(‘chuang‘)
    rest_of_urls = getContinuingly(fenlei)
    #print(rest_of_urls)
    for i in rest_of_urls:
        data = getInfo(i)
        fenLei.insert(data)

# startgetContinuingly(‘bijiben‘)
pool = Pool()
pool.map(insertData,allMenCollectionName)
#pool.map(insertData,[‘chuang_list‘])
#insertData(allMenCollectionName)

模块3 分析

from collections import Counter
import pymongo,charts

def getTotalCount(database,host=None,port=None):
    client = pymongo.MongoClient(host,port)
    db = client[database]
    tab_list = db.collection_names()
    #print(tab_list)
    count = 0
    for i in tab_list:
        count = count + db[i].find({}).count()
    print(count)
    return count

#getTotalCount(‘58tcDataNew‘)
#14700

def getAreaByClassify(classify,database=‘58tcDataNew‘,host=None,port=None):
    client = pymongo.MongoClient(host, port)
    db = client[database]
    classify = classify + ‘_data‘
    #location_list = [ i[‘location‘][3:] if i[‘location‘] != ‘‘ and i[‘location‘][:2] == ‘北京‘ else None for i in db[‘bijiben_data‘].find(filter={},projection={‘location‘:1,‘_id‘:0})]
    location_list = [i[‘location‘][3:] for i in db[‘yueqi_data‘].find(filter={}, projection={‘location‘: 1, ‘_id‘: 0})
                     if i[‘location‘] != ‘‘ and i[‘location‘][:2] == ‘北京‘ and i[‘location‘][3:] != ‘‘]
    loc_name = list(set(location_list))
    dic_count = {}
    for i in loc_name:
        dic_count[i] = location_list.count(i)
    return dic_count

# bijiben_area_count = getAreaByClassify(classify=‘yueqi‘)
# print(bijiben_area_count)
# danche_area_count = getAreaByClassify(classify=‘danche‘)
# sum_area_count = Counter(bijiben_area_count) + Counter(danche_area_count)
# print(sum_area_count)

def myCounter(L,database=‘58tcDataNew‘,host=None,port=None):
    client = pymongo.MongoClient(host, port)
    db = client[database]
    tab_list = db.collection_names()
    dic_0 = {}
    for i in tab_list:
        loc = i[:-5] + ‘_area_count‘
        dic_0[loc] = 0

    if not L:
        return Counter(dic_0)
    else:
        return Counter(L[0]) + myCounter(L[1:])

def getAllCount(database=‘58tcDataNew‘,host=None,port=None):
    client = pymongo.MongoClient(host, port)
    db = client[database]
    tab_list = db.collection_names()
    dic_all_count = {}
    for i in tab_list:
        dic = getAreaByClassify(i[:-5])
        loc = i[:-5] + ‘_area_count‘
        dic_all_count[loc] = dic

    dic_val = [dic_all_count[x] for x in dic_all_count]
    my = myCounter(dic_val)

    dic_all_count[‘total_area_count‘] = dict(my)
    return dic_all_count

dic_all_count = getAllCount()
# print(dic_all_count[‘bijiben_area_count‘])
# print(dic_all_count[‘total_area_count‘])
#
#

tmp_list = []
for i in dic_all_count[‘total_area_count‘]:
    data = {
        ‘name‘:i,
        ‘data‘:[dic_all_count[‘total_area_count‘][i]],
        ‘type‘:‘column‘
    }
    tmp_list.append(data)

options = {
    ‘chart‘   : {‘zoomType‘:‘xy‘},
    ‘title‘   : {‘text‘: ‘北京58同城二手交易信息发布区域分布图‘},
    ‘subtitle‘: {‘text‘: ‘数据来源: 58.com‘},
    ‘xAxis‘   : {‘categories‘: [‘‘]},
    ‘yAxis‘   : {‘title‘:{‘text‘:‘数量‘}},
    ‘plotOptions‘: {‘column‘: {‘dataLabels‘: {‘enabled‘: True}}}
    }
charts.plot(tmp_list,show=‘inline‘,options=options)

时间: 2024-12-15 01:32:59

用Python写爬虫爬取58同城二手交易数据的相关文章

如何利用Python网络爬虫爬取微信朋友圈动态--附代码(下)

前天给大家分享了如何利用Python网络爬虫爬取微信朋友圈数据的上篇(理论篇),今天给大家分享一下代码实现(实战篇),接着上篇往下继续深入. 一.代码实现 1.修改Scrapy项目中的items.py文件.我们需要获取的数据是朋友圈和发布日期,因此在这里定义好日期和动态两个属性,如下图所示. 2.修改实现爬虫逻辑的主文件moment.py,首先要导入模块,尤其是要主要将items.py中的WeixinMomentItem类导入进来,这点要特别小心别被遗漏了.之后修改start_requests方

利用python爬取58同城简历数据

最近接到一个工作,需要获取58同城上面的简历信息(http://gz.58.com/qzyewu/).最开始想到是用python里面的scrapy框架制作爬虫.但是在制作的时候,发现内容不能被存储在本地变量 response 中.当我通过shell载入网页后,虽然内容能被储存在response中,用xpath对我需要的数据进行获取时,返回的都是空值.考虑到数据都在源码中,于是我使用python里的beautifulSoup通过下载源码的方式去获取数据,然后插入到数据库. 需要的python包ur

Python写爬虫-爬甘农大学校新闻

Python写网络爬虫(一) 关于Python: 学过C. 学过C++. 最后还是学Java来吃饭. 一直在Java的小世界里混迹. 有句话说: "Life is short, you need Python!"  翻译过来就是: 人生苦短, 我用Python 究竟它有多么强大,  多么简洁? 抱着这个好奇心, 趁不忙的几天. 还是忍不住的小学了一下.(- - 其实学了还不到两天) 随便用一个"HelloWorld"的例子 //Java class Main{ pu

Python多线程爬虫爬取电影天堂资源

最近花些时间学习了一下Python,并写了一个多线程的爬虫程序来获取电影天堂上资源的迅雷下载地址,代码已经上传到GitHub上了,需要的同学可以自行下载.刚开始学习python希望可以获得宝贵的意见. 先来简单介绍一下,网络爬虫的基本实现原理吧.一个爬虫首先要给它一个起点,所以需要精心选取一些URL作为起点,然后我们的爬虫从这些起点出发,抓取并解析所抓取到的页面,将所需要的信息提取出来,同时获得的新的URL插入到队列中作为下一次爬取的起点.这样不断地循环,一直到获得你想得到的所有的信息爬虫的任务

Python简易爬虫爬取百度贴吧图片

通过python 来实现这样一个简单的爬虫功能,把我们想要的图片爬取到本地.(Python版本为3.6.0) 一.获取整个页面数据 def getHtml(url): page=urllib.request.urlopen(url) html=page.read() return html 说明: 向getHtml()函数传递一个网址,就可以把整个页面下载下来. urllib.request 模块提供了读取web页面数据的接口,我们可以像读取本地文件一样读取www和ftp上的数据. 二.筛选页面

python制作爬虫爬取京东商品评论教程

作者:蓝鲸 类型:转载 本文是继前2篇Python爬虫系列文章的后续篇,给大家介绍的是如何使用Python爬取京东商品评论信息的方法,并根据数据绘制成各种统计图表,非常的细致,有需要的小伙伴可以参考下 本篇文章是python爬虫系列的第三篇,介绍如何抓取京东商城商品评论信息,并对这些评论信息进行分析和可视化.下面是要抓取的商品信息,一款女士文胸.这个商品共有红色,黑色和肤色三种颜色, 70B到90D共18个尺寸,以及超过700条的购买评论. 京东商品评论信息是由JS动态加载的,所以直接抓取商品详

Python 简单爬虫 爬取知乎神回复

看知乎的时候发现了一个 “如何正确地吐槽” 收藏夹,里面的一些神回复实在很搞笑,但是一页一页地看又有点麻烦,而且每次都要打开网页,于是想如果全部爬下来到一个文件里面,是不是看起来很爽,并且随时可以看到全部的,于是就开始动手了. 工具 1.Python 2.7 2.BeautifulSoup 分析网页 我们先来看看知乎上该网页的情况: 网址: ,容易看到,网址是有规律的,page慢慢递增,这样就能够实现全部爬取了. 再来看一下我们要爬取的内容: 我们要爬取两个内容:问题和回答,回答仅限于显示了全部

如何用Python网络爬虫爬取网易云音乐歌词

前几天小编给大家分享了数据可视化分析,在文尾提及了网易云音乐歌词爬取,今天小编给大家分享网易云音乐歌词爬取方法. 本文的总体思路如下: 找到正确的URL,获取源码: 利用bs4解析源码,获取歌曲名和歌曲ID: 调用网易云歌曲API,获取歌词: 将歌词写入文件,并存入本地. 本文的目的是获取网易云音乐的歌词,并将歌词存入到本地文件.整体的效果图如下所示: 基于Python网易云音乐歌词爬取 赵雷的歌曲 本文以民谣歌神赵雷为数据采集对象,专门采集他的歌曲歌词,其他歌手的歌词采集方式可以类推,下图展示

Python 利用爬虫爬取网页内容 (div节点的疑惑)

最近在写爬虫的时候发现利用beautifulsoup解析网页html 利用解析结果片段为: <td valign="top"><div class="pl2"><a class="" href="https://movie.douban.com/subject/26588308/"> 死侍2 / <span style="font-size:13px;">DP