简单爬取微医网

一.利用request和xpath爬取微医网

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#author tom
import requests
from lxml import etree
import pymongo

#爬取微医网类
class DoctorSpider():
    #初始化应该具有的一些属性
    def __init__(self):
        self.headers={‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36‘}
        self.base_url=‘https://www.guahao.com/expert/all/%E5%85%A8%E5%9B%BD/all/%E4%B8%8D%E9%99%90/p‘
        self.page_num=1
        self.info_list = []
        self.client=pymongo.MongoClient(host=‘127.0.0.1‘,port=27017)
        self.db=self.client[‘test‘]
        #抓取网页数据
    def crwal(self):
        print(‘正在爬取第{}页‘.format(self.page_num))
        url=self.base_url+str(self.page_num)
        res=requests.get(url=url,headers=self.headers).text
        #中国有38页
        if self.page_num<=38:
            self.page_num+=1
        return res
    # 网页内容解析
    def parse(self,res):
        page_text=res
        tree=etree.HTML(page_text)
        li_list=tree.xpath(‘//div[@class="g-doctor-items to-margin"]/ul/li‘)
        for li in li_list:
            name=li.xpath("./div[2]/a/text()")[0]
            skill=li.xpath("./div[2]/div[1]/p/text()")[0]
            #解析出来的有很多空格,回车,制表符,
            skill=skill.replace(‘\n‘,‘‘).replace(‘\r‘,‘‘).replace(‘ ‘,‘‘).strip()
            position=li.xpath("./div[1]/dl/dt/text()")[1]
            position = position.replace(‘\n‘,‘‘).strip()
            score=li.xpath("./div[1]/dl/dd/p[3]/span/em/text()")[0]
            num=li.xpath("./div[1]/dl/dd/p[3]/span/i/text()")[0]
            office=li.xpath("./div[1]/dl/dd/p[1]/text()")[0]
            hospital=li.xpath("./div[1]/dl/dd/p[2]/span/text()")[0]
            dic={
                ‘name‘:name,
                ‘skill‘:skill,
                ‘position‘:position,
                ‘score‘:score,
                ‘num‘:num,
                ‘office‘:office,
                ‘hospital‘:hospital,
            }
            self.save(dic)

    #保存函数(保存到mongodb)
    def save(self,dic):
        collection=self.db[‘weiyiwang‘]
        collection.save(dic)

    #项目启动函数
    def run(self):
        while self.page_num<=38:
            response=self.crwal()
            a=self.parse(response)

if __name__ == ‘__main__‘:
    doctor=DoctorSpider()
    doctor.run()

原文地址：https://www.cnblogs.com/tjp40922/p/10574746.html

时间： 2024-08-29 17:02:02

简单爬取微医网的相关文章

利用Scrapy爬取1905电影网

本文将从以下几个方面讲解Scrapy爬虫的基本操作 Scrapy爬虫介绍 Scrapy安装 Scrapy实例--爬取1905电影网相关资料 Scrapy 爬虫介绍 Scrapy是Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据.Scrapy用途广泛,可以用于数据挖掘.监测和自动化测试. Scrapy吸引人的地方在于它是一个框架,任何人都可以根据需求方便的修改.它也提供了多种类型爬虫的基类,如BaseSpider.sitemap爬虫等,最

python爬虫入门练习，使用正则表达式和requests爬取LOL官网皮肤

刚刚python入门,学会了requests模块爬取简单网页,然后写了个爬取LOL官网皮肤的爬虫,代码奉上 #获取json文件#获取英雄ID列表#拼接URL#下载皮肤 #导入re requests模块 import requestsimport reimport time def Download_LOL_Skin(): #英雄信息Json文件地址:https://lol.qq.com/biz/hero/champion.js #获取英雄信息列表 json_url = "https://lol.

使用 Scrapy 爬取去哪儿网景区信息

Scrapy 是一个使用 Python 语言开发,为了爬取网站数据,提取结构性数据而编写的应用框架,它用途广泛,比如:数据挖掘.监测和自动化测试.安装使用终端命令 pip install Scrapy 即可. Scrapy 比较吸引人的地方是:我们可以根据需求对其进行修改,它提供了多种类型的爬虫基类,如:BaseSpider.sitemap 爬虫等,新版本提供了对 web2.0 爬虫的支持. 1 Scrapy 介绍 1.1 组成 Scrapy Engine(引擎):负责 Spider.ItemP

零基础爬取堆糖网图片（一）

零基础爬取堆糖网图片(一) 全文介绍: 首先堆糖网是一个美图壁纸兴趣社区,有大量的美女图片今天我们实现搜索关键字爬取堆糖网上相关的美图. 当然我们还可以实现多线程爬虫,加快爬虫爬取速度涉及内容: 爬虫基本流程 requests库基本使用 urllib.parse模块 json包 jsonpath库图例说明: 请求与响应 sequenceDiagram 浏览器->>服务器: 请求服务器-->>浏览器: 响应爬虫基本流程 graph TD A[目标网站] -->|分析网

第一篇博客（python爬取小故事网并写入mysql）

前言: 这是一篇来自整理EVERNOTE的笔记所产生的小博客,实现功能主要为用广度优先算法爬取小故事网,爬满100个链接并写入mysql,虽然CS作为双学位已经修习了三年多了,但不仅理论知识一般,动手能力也很差,在学习的空余时间前前后后DEBUG了很多次,下面给出源代码及所遇到的BUG. 本博客参照代码及PROJECT来源:http://kexue.fm/archives/4385/ 源代码: 1 import requests as rq 2 import re 3 import codecs

Python爬取中国天气网天气

Python爬取中国天气网天气基于requests库制作的爬虫. 使用方法:打开终端输入 "python3 weather.py 北京(或你所在的城市)" 程序正常运行需要在同文件夹下加入一个"data.csv"文件,内容请参考链接:https://www.cnblogs.com/Rhythm-/p/9255190.html 运行效果: 源码: import sys import re import requests import webbrowser from

爬虫----爬取校花网视频

import requests import re import time import hashlib def get_page(url): print('GET %s' %url) try: response=requests.get(url) if response.status_code == 200: return response.content except Exception: pass def parse_index(res): obj=re.compile('class="i

selenium爬取煎蛋网

selenium爬取煎蛋网直接上代码 from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as ES import requests import urllib.requ

爬取古诗文网古诗词

#python3.6 #爬取古诗文网的诗文 import requests from bs4 import BeautifulSoup import html5lib import re import os def content(soup): b = 1 poetrydict = dict() for i in soup.find_all('a')[8:]: if i.get('href'): url = '%s%s' % ("https://so.gushiwen.org/",i.