Python网络爬虫与信息提取（中国大学mooc）

Python网络爬虫与信息提取

淘宝商品比价定向爬虫
股票数据定向爬虫

1. 淘宝商品比价定向爬虫

功能描述

目标：获取淘宝搜索页面的信息

理解：淘宝的搜索接口翻页的处理

技术路线：requests-re[^footnote].

代码如下：

#CrowTaobaoPrice.py
import requests
import re

def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def parsePage(ilt, html):
    try:
        plt = re.findall(r‘\"view_price\"\:\"[\d\.]*\"‘,html)
        tlt = re.findall(r‘\"raw_title\"\:\".*?\"‘,html)
        for i in range(len(plt)):
            price = eval(plt[i].split(‘:‘)[1])
            title = eval(tlt[i].split(‘:‘)[1])
            ilt.append([price , title])
    except:
        print("")

def printGoodsList(ilt):
    tplt = "{:4}\t{:8}\t{:16}"
    print(tplt.format("序号", "价格", "商品名称"))
    count = 0
    for g in ilt:
        count = count + 1
        print(tplt.format(count, g[0], g[1]))

def main():
    goods = ‘书包‘
    depth = 3
    start_url = ‘https://s.taobao.com/search?q=‘ + goods
    infoList = []
    for i in range(depth):
        try:
            url = start_url + ‘&s=‘ + str(44*i)
            html = getHTMLText(url)
            parsePage(infoList, html)
        except:
            continue
    printGoodsList(infoList)

main()

流程图：
步骤1：提交商品搜索请求，循环获取页面
步骤2：对于每个页面，提取商品名称和价格信息
步骤3：将信息输出到屏幕上

开始提交商品搜索请求，循环获取页面对于每个页面，提取商品名称和价格信息将信息输出到屏幕上结束

2. 股票数据定向爬虫

1. 列表内容

功能描述
目标：获取上交所和深交所所有股票的名称和交易信息
输出：保存到文件中
技术路线：requests-bs4-re

新浪股票：http://finance.sina.com.cn/stock/
百度股票：https://gupiao.baidu.com/stock/

2.爬取网站原则

选取原则：股票信息静态存在于HTML页面中，非js代码生成，没有Robots协议限制
选取方法：浏览器F12，源代码查看等
选取心态：不要纠结于某个网站，多找信息源尝试

程序结构如下

开始步骤1：从东方财富网获取股票列表步骤2：根据股票列表逐个到百度股票获取个股信息步骤3：将结果存储到文件结束

代码如下

#CrawBaiduStocksA.py
import requests
from bs4 import BeautifulSoup
import traceback
import re

def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def getStockList(lst, stockURL):
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html, ‘html.parser‘)
    a = soup.find_all(‘a‘)
    for i in a:
        try:
            href = i.attrs[‘href‘]
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue

def getStockInfo(lst, stockURL, fpath):
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, ‘html.parser‘)
            stockInfo = soup.find(‘div‘,attrs={‘class‘:‘stock-bets‘})

            name = stockInfo.find_all(attrs={‘class‘:‘bets-name‘})[0]
            infoDict.update({‘股票名称‘: name.text.split()[0]})

            keyList = stockInfo.find_all(‘dt‘)
            valueList = stockInfo.find_all(‘dd‘)
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val

            with open(fpath, ‘a‘, encoding=‘utf-8‘) as f:
                f.write( str(infoDict) + ‘\n‘ )
        except:
            traceback.print_exc()
            continue

def main():
    stock_list_url = ‘http://quote.eastmoney.com/stocklist.html‘
    stock_info_url = ‘https://gupiao.baidu.com/stock/‘
    output_file = ‘D:/BaiduStockInfo.txt‘
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

main()

代码优化

1.编码识别优化
2.增加动态进度显示

优化后代码如下

import requests
from bs4 import BeautifulSoup
import traceback
import re

def getHTMLText(url, code="utf-8"):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = code
        return r.text
    except:
        return ""

def getStockList(lst, stockURL):
    html = getHTMLText(stockURL, "GB2312")
    soup = BeautifulSoup(html, ‘html.parser‘)
    a = soup.find_all(‘a‘)
    for i in a:
        try:
            href = i.attrs[‘href‘]
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue

def getStockInfo(lst, stockURL, fpath):
    count = 0
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, ‘html.parser‘)
            stockInfo = soup.find(‘div‘,attrs={‘class‘:‘stock-bets‘})

            name = stockInfo.find_all(attrs={‘class‘:‘bets-name‘})[0]
            infoDict.update({‘股票名称‘: name.text.split()[0]})

            keyList = stockInfo.find_all(‘dt‘)
            valueList = stockInfo.find_all(‘dd‘)
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val

            with open(fpath, ‘a‘, encoding=‘utf-8‘) as f:
                f.write( str(infoDict) + ‘\n‘ )
                count = count + 1
                print("\r当前进度: {:.2f}%".format(count*100/len(lst)),end="")
        except:
            count = count + 1
            print("\r当前进度: {:.2f}%".format(count*100/len(lst)),end="")
            continue

def main():
    stock_list_url = ‘http://quote.eastmoney.com/stocklist.html‘
    stock_info_url = ‘https://gupiao.baidu.com/stock/‘
    output_file = ‘D:/BaiduStockInfo.txt‘
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

main()

来自
Python网络爬虫与信息提取
中国大学mooc
http://www.icourse163.org/learn/BIT-1001870001?tid=1001962001#/learn/content?type=detail&id=1002699548&cid=1003101008

时间： 2024-10-12 04:49:37

Python网络爬虫与信息提取（中国大学mooc）的相关文章

python网络爬虫与信息提取【笔记】

以下是''网络爬虫''课程(中国MOOC)学习笔记 [万能的b站] 核心思想: The Website is the API 课程大纲: 一.Requests与robots.txt 1.Requeests 自动爬取HTML页面,自动网络请求提交 2.robots.txt 网络爬虫排除标准二.BeautifulSoup解析HTML页面三.Re正则表达式详解,提前页面关键信息四.Scrapy网络爬虫原理介绍,专业爬虫框架介绍

MOOC《Python网络爬虫与信息提取》学习过程笔记【requests库】第一周1-3

一得到百度网页的html源代码: >>> import requests >>> r=requests.get("http://www.baidu.com") >>> r.status_code #查看状态码,为200表示访问成功,其他表示访问失败 200 >>> r.encoding='utf-8' #更改编码为utf-8编码 >>> r.text #打印网页内容 >>> r.

python网络爬虫与信息提取——6.Re（正则表达式）库入门

1.正则表达式常用操作符 . 表示任何单个字符[ ] 字符集,对单个字符给出取值范围 [abc]表示a.b.c,[a‐z]表示a到z单个字符[^ ] 非字符集,对单个字符给出排除范围 [^abc]表示非a或b或c的单个字符* 前一个字符0次或无限次扩展 abc* 表示 ab.abc.abcc.abccc等+ 前一个字符1次或无限次扩展 abc+ 表示 abc.abcc.abccc等? 前一个字符0次或1次扩展 abc? 表示 ab.a

Python网络爬虫与信息提取（二）—— BeautifulSoup

Boautiful Soup BeautifulSoup官方介绍: Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式. 官方网站:https://www.crummy.com/software/BeautifulSoup/ 1.安装在"C:\Windows\System32"中找到"cmd.exe",使用管理员身份运行,在命令行中输入:"pip in

Python网络爬虫与信息提取-Requests库网络爬去实战

实例1:京东商品页面的爬取 import requests url="https://item.jd.com/2967929.html" try: r=requests.get(url) r.raise_for_status() r.encoding=r.apparent_encoding print(r.text[:1000]) except: print("爬取失败") 实例2:亚马逊商品页面的爬取 import requests url="https

Python网络爬虫与信息提取-Beautiful Soup 库入门

一.Beautiful Soup 库的安装 Win平台:"以管理员身份运行" cmd 执行 pip install beautifulsoup4 安装小测:from bs4 import BeautifulSoup soup=BeautifulSoup('<p>data</p>','html.parser') print(soup.prettify()) 二.Beautiful Soup 库的基本元素 1.BeautifulSoup类 from bs4 impo

python网络爬虫与信息提取mooc------爬取实例

实例一--爬取页面 1 import requests 2 url="https//itemjd.com/2646846.html" 3 try: 4 r=requests.get(url) 5 r.raise_for_status() 6 r.encoding=r.apparent_encoding 7 print(r.text[:1000]) 8 except: 9 print("爬取失败") 正常页面爬取实例二--爬取页面 1 import requests

PYTHON网络爬虫与信息提取[正则表达式的使用](单元七)

正则表达式由字符和操作符构成 . 表示任何单个字符 []字符集,对单个字符给出取值范围 [abc]或者关系 [a-z]表示 [^abc]表示非这里面的东西非字符集 * 表示星号之前的字符出现0次或者无限次扩展 + 表示星号之前的字符出现一次或者无限次扩展 ? 表示出现0次或1扩展 | 表示左右表达式人取其一 abc|def --------------------------------------------------------------------- {m} 扩展前一个字符m次