爬虫2：html页面+beautifulsoap模块+post方式+demo

　　爬取html页面，有时需要设置参数post方式请求，生成json，保存文件中。

1）引入模块

import requests
from bs4 import BeautifulSoup

url_ = ‘http://www.c.....................‘

2）设置参数

 datas = {
        ‘yyyy‘:‘2014‘,
        ‘mm‘:‘-12-31‘,
        ‘cwzb‘:"incomestatements",
        ‘button2‘:"%CC%E1%BD%BB",
    }

3）post请求

r = requests.post(url,data = datas)

4）设置编码

r.encoding = r.apparent_encoding

5）BeautifulSoup解析request请求

soup = BeautifulSoup(r.text)

6）find_all筛选

soup.find_all(‘strong‘,text=re.compile(u"股票代码"))[0].parent.contents[1]

7）css选择select

soup.select("option[selected]")[0].contents[0]

beautifulsoap的API请查看　　https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#beautifulsoup

Demo：文件读url，设置参数，post方式，beautifulsoap解析，生成json，保存文件中

import requests
from bs4 import BeautifulSoup
import re
import json
import time

fd = open(r"E:\aa.txt","r")
mylist = []
for line in fd:
    mylist.append(line)
url_pre = ‘http://www.c.....................‘
code = open(r"E:\a---.txt", "a")
for index in xrange(0,len(mylist)):

    print index
    url_id = mylist[index].split(‘?‘)[-1]
    url_id = url_id[-7:-1]

    datas = {
        ‘yyyy‘:‘2014‘,
        ‘mm‘:‘-12-31‘,
        ‘cwzb‘:"incomestatements",‘button2‘:"%CC%E1%BD%BB",
    }
    url = url_pre + str(url_id)
    print url
    print datas

    r = requests.post(url,data = datas)
    r.encoding = r.apparent_encoding
    print r
    soup = BeautifulSoup(r.text)

    r.encoding = r.apparent_encoding
    soup = BeautifulSoup(r.text)

    if len(soup.find_all("td",text=re.compile(u"营业收入"))) == 0:
        continue

    jsonMap = {}

    jsonMap[u‘股票代码‘] = soup.find_all(‘strong‘,text=re.compile(u"股票代码"))[0].parent.contents[1]
    jsonMap[u‘股票简称‘] = soup.find_all(‘strong‘,text=re.compile(u"股票代码"))[0].parent.contents[3]

    jsonMap[u‘年度‘] = soup.select("option[selected]")[0].contents[0]
    jsonMap[u‘报告期‘] = soup.select("option[selected]")[1].contents[0]

    yysr = soup.find_all("td",text=re.compile(u"营业收入"))[0].parent
    yysrsoup = BeautifulSoup(str(yysr))
    jsonMap[u‘营业收入‘] = yysrsoup.find_all(‘td‘)[1].contents[0]

    yylr = soup.find_all("td",text=re.compile(u"营业利润"))[0].parent
    yylrsoup = BeautifulSoup(str(yylr))
    jsonMap[u‘营业利润‘] = yylrsoup.find_all(‘td‘)[3].contents[0]

    strJson = json.dumps(jsonMap, ensure_ascii=False)
    print strJson
    #code.write(strJson)
    code.write(strJson.encode(‘utf-8‘) + ‘\n‘)
    time.sleep(0.1)
    code.flush()

时间： 2024-10-02 23:12:28

爬虫2：html页面+beautifulsoap模块+post方式+demo的相关文章

爬虫3：pdf页面+pdfminer模块+demo

本文介绍下pdf页面的爬取,需要借助pdfminer模块 demo一般流程: 1)设置url url = 'http://www.------' + '.PDF' 2)requests模块获取url import requestsr = requests.get(inner_url) 3)写入.pdf文件 myFile = open("PDF/" + i[u'associateAnnouncement'] + '.pdf', "wb") myFile.write(

python爬虫主要就是五个模块：爬虫启动入口模块，URL管理器存放已经爬虫的URL和待爬虫URL列表，html下载器，html解析器，html输出器同时可以掌握到urllib2的使用、bs4（BeautifulSoup）页面解析器、re正则表达式、urlparse、python基础知识回顾（set集合操作）等相关内容。

本次python爬虫百步百科,里面详细分析了爬虫的步骤,对每一步代码都有详细的注释说明,可通过本案例掌握python爬虫的特点: 1.爬虫调度入口(crawler_main.py) # coding:utf-8from com.wenhy.crawler_baidu_baike import url_manager, html_downloader, html_parser, html_outputer print "爬虫百度百科调度入口" # 创建爬虫类class SpiderMai

爬虫2：html页面+beautifulsoap模块+post方式+demo

爬虫2：html页面+beautifulsoap模块+post方式+demo的相关文章

爬虫3：pdf页面+pdfminer模块+demo

PHP基于HTTPD模块的方式跟MYSQL连接

哎呀，发现自己不会用模块的方式用kprobe啊，弱爆了

小蚂蚁学习页面静态化（2）——更新生成纯静态化页面的三种方式

iOS页面间传值的方式（Delegate/NSNotification/Block/NSUserDefault/单例）

iOS页面间传值的方式

网络笔记01-3 socket 实现百度页面的两种方式

swift详解之二十二-----------UINavigationController的基本用法和页面传值几种方式