python 爬虫 requests+BeautifulSoup 爬取巨潮资讯公司概况代码实例

第一次写一个算是比较完整的爬虫，自我感觉极差啊，代码low，效率差，也没有保存到本地文件或者数据库，强行使用了一波多线程导致数据顺序发生了变化。。。

贴在这里，引以为戒吧。

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 18 21:41:34 2018
@author: brave-man
blog: http://www.cnblogs.com/zrmw/
"""

import requests
from bs4 import BeautifulSoup
import json
from threading import Thread
# 获取上市公司的全称，英文名称，地址，法定代表人（也可以获取任何想要获取的公司信息）
def getDetails(url):
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0"}
    res = requests.get("{}".format(url), headers = headers)
    res.encoding = "GBK"
    soup = BeautifulSoup(res.text, "html.parser")

    details = {"code": soup.select(".table")[0].td.text.lstrip("股票代码：")[:6],
               "Entire_Name": soup.select(".zx_data2")[0].text.strip("\r\n "),
               "English_Name": soup.select(".zx_data2")[1].text.strip("\r\n "),
               "Address": soup.select(".zx_data2")[2].text.strip("\r\n "),
               "Legal_Representative": soup.select(".zx_data2")[4].text.strip("\r\n ")}
    # 这里将details转换成json字符串格式用作后期存储处理
    jd = json.dumps(details)
    jd1 = json.loads(jd)
    print(jd1)
# 此函数用来获取上市公司的股票代码
def getCode():
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0"}
    res = requests.get("http://www.cninfo.com.cn/cninfo-new/information/companylist", headers = headers)
    res.encoding = "gb1232"
    soup = BeautifulSoup(res.text, "html.parser")
#    print(soup.select(".company-list"))
    L = []
    l1 = []
    l2 = []
    l3 = []
    l4 = []
    for i in soup.select(".company-list")[0].find_all("a"):
        code = i.text[:6]
        l1.append(code)
    for i in soup.select(".company-list")[1].find_all("a"):
        code = i.text[:6]
        l2.append(code)
    for i in soup.select(".company-list")[2].find_all("a"):
        code = i.text[:6]
        l3.append(code)
    for i in soup.select(".company-list")[3].find_all("a"):
        code = i.text[:6]
        l4.append(code)
    L = [l1, l2, l3, l4]
    print(L[0])
    return getAll(L)

def getAll(L):
    def t1(L):
        for i in L[0]:
            url_sszb = "http://www.cninfo.com.cn/information/brief/szmb{}.html".format(i)
            getDetails(url_sszb)
    def t2(L):
        for i in L[1]:
            url_zxqyb = "http://www.cninfo.com.cn/information/brief/szsme{}.html".format(i)
            getDetails(url_zxqyb)
    def t3(L):
        for i in L[2]:
            url_cyb = "http://www.cninfo.com.cn/information/brief/szcn{}.html".format(i)
            getDetails(url_cyb)
    def t4(L):
        for i in L[3]:
            url_hszb = "http://www.cninfo.com.cn/information/brief/shmb{}.html".format(i)
            getDetails(url_hszb)
#    tt1 = Thread(target = t1, args = (L, ))
#    tt2 = Thread(target = t2, args = (L, ))
#    tt3 = Thread(target = t3, args = (L, ))
#    tt4 = Thread(target = t4, args = (L, ))
#
#    tt1.start()
#    tt2.start()
#    tt3.start()
#    tt4.start()
#
#    tt1.join()
#    tt2.join()
#    tt3.join()
#    tt4.join()
    t1(L)
    t2(L)
    t3(L)
    t4(L)

if __name__ == "__main__":
    getCode()

没有考虑实际生产中突发的状况，比如网速延迟卡顿等问题。

速度是真慢，有时间会分享给大家 selenium + 浏览器的爬取巨潮资讯的方法代码。晚安~

原文地址：https://www.cnblogs.com/zrmw/p/9333385.html

时间： 2024-12-22 11:44:16

python 爬虫 requests+BeautifulSoup 爬取巨潮资讯公司概况代码实例的相关文章

Python 爬虫入门之爬取妹子图

Python 爬虫入门之爬取妹子图来源:李英杰链接: https://segmentfault.com/a/1190000015798452 听说你写代码没动力?本文就给你动力,爬取妹子图.如果这也没动力那就没救了. GitHub 地址: https://github.com/injetlee/Python/blob/master/%E7%88%AC%E8%99%AB%E9%9B%86%E5%90%88/meizitu.py 爬虫成果当你运行代码后,文件夹就会越来越多,如果爬完的话会有2

Python 爬虫—— requests BeautifulSoup

本文记录下用来爬虫主要使用的两个库.第一个是requests,用这个库能很方便的下载网页,不用标准库里面各种urllib:第二个BeautifulSoup用来解析网页,不然自己用正则的话很烦. requests使用,1直接使用库内提供的get.post等函数,在比简单的情况下使用,2利用session,session能保存cookiees信息,方便的自定义request header,可以进行登陆操作. BeautifulSoup使用,先将requests得到的html生成BeautifulSo

【转载】教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神

原文:教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神本博文将带领你从入门到精通爬虫框架Scrapy,最终具备爬取任何网页的数据的能力.本文以校花网为例进行爬取,校花网:http://www.xiaohuar.com/,让你体验爬取校花的成就感. Scrapy,Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据.Scrapy用途广泛,可以用于数据挖掘.监测和自动化测试. Scrapy吸引人的地方在于它是一个框架,任何人都可

教你分分钟学会用python爬虫框架Scrapy爬取你想要的内容

教你分分钟学会用python爬虫框架Scrapy爬取心目中的女神 python爬虫学习课程,下载地址:https://pan.baidu.com/s/1v6ik6YKhmqrqTCICmuceug 课程代码原件:课程视频: 原文地址:http://blog.51cto.com/aino007/2123341

python爬虫-基础入门-爬取整个网站《3》

python爬虫-基础入门-爬取整个网站<3> 描述: 前两章粗略的讲述了python2.python3爬取整个网站,这章节简单的记录一下python2.python3的区别 python2.x 使用类库: >> urllib 库 >> urllib2 库 python3.x 使用的类库: >> urllib 库变化: -> 在python2.x中使用import urllib2 ----- 对应的,在python3.x 中会使用import url

Python爬虫入门 | 5 爬取小猪短租租房信息

小猪短租是一个租房网站,上面有很多优质的民宿出租信息,下面我们以成都地区的租房信息为例,来尝试爬取这些数据. 小猪短租(成都)页面:http://cd.xiaozhu.com/1.爬取租房标题按照惯例,先来爬下标题试试水,找到标题,复制xpath.多复制几个房屋的标题 xpath 进行对比: //[@id="page_list"]/ul/li[1]/div[2]/div/a/span//[@id="page_list"]/ul/li[2]/div[2]/div/a

python 爬虫（一） requests+BeautifulSoup 爬取简单网页代码示例

以前搞偷偷摸摸的事,不对,是搞爬虫都是用urllib,不过真的是很麻烦,下面就使用requests + BeautifulSoup 爬爬简单的网页. 详细介绍都在代码中注释了,大家可以参阅. # -*- coding: utf-8 -*- """ Created on Thu Jul 5 20:48:25 2018 @author: brave-man blog: http://www.cnblogs.com/zrmw/ python3 + anaconda(Spyder)

python利用selenium+requests+beautifulsoup爬取12306火车票信息

在高速发展的时代.乘车出远门是必不可少的,有些查询信息是要收费的.这里打造免费获取火车票信息想要爬取12306火车票信息,访问12306官方网站,输入出发地,目的地 ,时间之后点击确定,这是我们打开谷歌浏览器开发者模式找到 https://kyfw.12306.cn/otn/resources/js/framework/station_name.js 这里包含了所有城市的信息和所有城市的缩写字母.想要获取火车票信息 https://kyfw.12306.cn/otn/left

Python 爬虫入门(一)——爬取糗百

爬取糗百内容 GitHub 代码地址https://github.com/injetlee/Python/blob/master/qiubai_crawer.py 微信公众号:[智能制造专栏],欢迎关注. 本文目标掌握爬虫的基本概念 Requests 及 Beautiful Soup 两个 Python 库的基本使用通过以上知识完成糗百段子抓取爬虫基本概念爬虫也称网页蜘蛛,主要用于抓取网页上的特定信息.这在我们需要获取一些信息时非常有用,比如我们可以批量到美图网站下载图片,批量下载段子.