python爬虫之requests模块

一. 登录事例

a. 查找汽车之家新闻标题链接图片写入本地

import requests
from bs4 import BeautifulSoup
import uuid

response = requests.get(
    ‘http://www.autohome.com.cn/news/‘
)
response.encoding = ‘gbk‘
soup = BeautifulSoup(response.text,‘html.parser‘)        # HTML会转换成对象
tag = soup.find(id=‘auto-channel-lazyload-article‘)
li_list = tag.find_all(‘li‘)

for i in li_list:
    a = i.find(‘a‘)
    if a:
        print(a.attrs.get(‘href‘))
        txt = a.find(‘h3‘).text
        print(txt)
        img_url = txt = a.find(‘img‘).attrs.get(‘src‘)
        print(img_url)

        img_response = requests.get(url=img_url)
        file_name = str(uuid.uuid4()) + ‘.jpg‘
        with open(file_name,‘wb‘) as f:
            f.write(img_response.content)

用到BeautifulSoup模块寻找标签

b. 抽屉点赞获取页面和登录都会获取gpsd 点赞会使用获取页面的gpsd 而不是登录的gpsd

import requests

#先获取页面

r1 = requests.get(‘http://dig.chouti.com/‘)
r1_cookies = r1.cookies.get_dict()

#登录
post_dict = {
    "phone":"8615131255089",
    "password":"woshiniba",
    "oneMonth":"1"
}

r2 = requests.post(
    url="http://dig.chouti.com/login",
    data = post_dict,
    cookies=r1_cookies
)

r2_cookies = r2.cookies.get_dict()

# 访问其他页面
r3 = requests.post(
    url="http://dig.chouti.com/link/vote?linksId=13921091",
    cookies={‘gpsd‘:r1_cookies[‘gpsd‘]}
)
print(r3.text)

抽屉网页面的(gpsd)

c. 登录githup 携带cookie登录

import requests
from bs4 import BeautifulSoup

r1 = requests.get(‘https://github.com/login‘)
s1 = BeautifulSoup(r1.text,‘html.parser‘)

# 获取csrf_token
token = s1.find(name=‘input‘,attrs={‘name‘:"authenticity_token"}).get(‘value‘)
r1_cookie_dict = r1.cookies.get_dict()

# 将用户名 密码 token 发送到服务端 post
r2 = requests.post(
    ‘https://github.com/session‘,
    data={
    ‘commit‘:‘Sign in‘,
    ‘utf8‘:‘?‘,
    ‘authenticity_token‘:token,
    ‘login‘:‘[email protected]‘,
    ‘password‘:‘alex3714‘
    },
    cookies=r1_cookie_dict
)

#  获取登录后cookie
r2_cookie_dict = r2.cookies.get_dict()

#合并登录前的cookie和登录后的cookie
cookie_dict = {}
cookie_dict.update(r1_cookie_dict)
cookie_dict.update(r2_cookie_dict)

r3 = requests.get(
    url=‘https://github.com/settings/emails‘,
    cookies=cookie_dict
)

print(r3.text)

二. requests 参数

- method:  提交方式
            - url:     提交地址
            - params:  在URL中传递的参数,GET
            - data:    在请求体里传递的数据
            - json     在请求体里传递的数据
            - headers  请求头
            - cookies  Cookies
            - files    上传文件
            - auth     基本认知(headers中加入加密的用户名和密码)
            - timeout  请求和响应的超市时间
            - allow_redirects  是否允许重定向
            - proxies  代理
            - verify   是否忽略证书
            - cert     证书文件
            - stream   村长下大片
            - session: 用于保存客户端历史访问信息

a. file 发送文件

import requests

requests.post(
    url=‘xxx‘,
    filter={
        ‘name1‘: open(‘a.txt‘,‘rb‘),   #名称对应的文件对象
        ‘name2‘: (‘bbb.txt‘,open(‘b.txt‘,‘rb‘))     #表示上传到服务端的名称为 bbb.txt
    }
)

b. auth 认证

#配置路由器访问192.168.0.1会弹出小弹窗,输入用户名,密码 点击登录不是form表单提交,是基本登录框，这种框会把输入的用户名和密码 经过加密放在请求头发送过去

import requests

requests.post(
    url=‘xxx‘,
    filter={
        ‘name1‘: open(‘a.txt‘,‘rb‘),   #名称对应的文件对象
        ‘name2‘: (‘bbb.txt‘,open(‘b.txt‘,‘rb‘))     #表示上传到服务端的名称为 bbb.txt
    }
)

c. stream 流

#如果服务器文件过大,循环下载

def param_stream():
    ret = requests.get(‘http://127.0.0.1:8000/test/‘, stream=True)
    print(ret.content)
    ret.close()

    # from contextlib import closing
    # with closing(requests.get(‘http://httpbin.org/get‘, stream=True)) as r:
    # # 在此处理响应。
    # for i in r.iter_content():
    # print(i)

d. session 和django不同事例：简化抽屉点赞

    import requests

    session = requests.Session()

    ### 1、首先登陆任何页面，获取cookie

    i1 = session.get(url="http://dig.chouti.com/help/service")

    ### 2、用户登陆，携带上一次的cookie，后台对cookie中的 gpsd 进行授权
    i2 = session.post(
        url="http://dig.chouti.com/login",
        data={
            ‘phone‘: "8615131255089",
            ‘password‘: "xxxxxx",
            ‘oneMonth‘: ""
        }
    )

    i3 = session.post(
        url="http://dig.chouti.com/link/vote?linksId=8589623",
    )
    print(i3.text)

时间： 2024-12-25 19:54:08

python爬虫之requests模块的相关文章

python 爬虫基于requests模块的get请求

需求:爬取搜狗首页的页面数据 import requests # 1.指定url url = 'https://www.sogou.com/' # 2.发起get请求:get方法会返回请求成功的响应对象 response = requests.get(url=url) # 3.获取响应中的数据:text属性作用是可以获取响应对象中字符串形式的页面数据 page_data = response.text # 4.持久化数据 with open("sougou.html","w&

python 爬虫基于requests模块发起ajax的get请求

基于requests模块发起ajax的get请求需求:爬取豆瓣电影分类排行榜 https://movie.douban.com/中的电影详情数据用抓包工具捉取使用ajax加载页面的请求鼠标往下下滚轮拖动页面,会加载更多的电影信息,这个局部刷新是当前页面发起的ajax请求, 用抓包工具捉取页面刷新的ajax的get请求,捉取滚轮在最底部时候发起的请求这个get请求是本次发起的请求的url ajax的get请求携带参数获取响应内容不再是页面数据,是json字符串,是通过异步请求获取的电影

爬虫学习 04.Python网络爬虫之requests模块（1）

爬虫学习 04.Python网络爬虫之requests模块(1) 引入 Requests 唯一的一个非转基因的 Python HTTP 库,人类可以安全享用. 警告:非专业使用其他 HTTP 库会导致危险的副作用,包括:安全缺陷症.冗余代码症.重新发明轮子症.啃文档症.抑郁.头疼.甚至死亡. 今日概要基于requests的get请求基于requests模块的post请求基于requests模块ajax的get请求基于requests模块ajax的post请求综合项目练习:爬取国家药品监

爬虫学习 06.Python网络爬虫之requests模块（2）

爬虫学习 06.Python网络爬虫之requests模块(2) 今日内容 session处理cookie proxies参数设置请求代理ip 基于线程池的数据爬取知识点回顾 xpath的解析流程 bs4的解析流程常用xpath表达式常用bs4解析方法了解cookie和session - 无状态的http协议如上图所示,HTTP协议是无状态的协议,用户浏览服务器上的内容,只需要发送页面请求,服务器返回内容.对于服务器来说,并不关心,也并不知道是哪个用户的请求.对于一般浏览性的网页来说

Python爬虫教程-09-error 模块

Python爬虫教程-09-error模块今天的主角是error,爬取的时候,很容易出现错,所以我们要在代码里做一些,常见错误的处,关于urllib.error URLError URLError 产生的原因: 1.无网络连接 2.服务器连接失败 3.找不到指定的服务器 4.URLError是OSError的子类案例v9文件:https://xpwi.github.io/py/py%E7%88%AC%E8%99%AB/py09error.py # 案例v9 # URLError的使用 fro

04，Python网络爬虫之requests模块（1）

Requests 唯一的一个非转基因的 Python HTTP 库,人类可以安全享用. 警告:非专业使用其他 HTTP 库会导致危险的副作用,包括:安全缺陷症.冗余代码症.重新发明轮子症.啃文档症.抑郁.头疼.甚至死亡. 今日概要基于requests的get请求基于requests模块的post请求基于requests模块ajax的get请求基于requests模块ajax的post请求综合项目练习:爬取国家药品监督管理总局中基于中华人民共和国化妆品生产许可证相关数据知识点回顾常见

python网络爬虫之requests模块

什么是requests模块: requests模块是python中原生的基于网路请求的模块,其主要作用是用来模拟浏览器发送请求,功能强大,用法简洁高效,在爬虫的领域占半壁江山如何使用requests模块: 安装:pip install requests 使用流程: 1.指定url 2.发送请求 3.获取数据 4.持久化存储爬虫之反爬机制未完待续原文地址:https://www.cnblogs.com/xinjie123/p/10798095.html

Python学习---爬虫学习[requests模块]180411

模块安装安装requests模块 pip3 install requests 安装beautifulsoup4模块 [更多参考]https://blog.csdn.net/sunhuaqiang1/article/details/65936616 pip install beautifulsoup4 初识requests模块 [更多参考]http://www.cnblogs.com/wupeiqi/articles/6283017.html requests.post(url=""

爬虫之requests模块

引入在学习爬虫之前可以先大致的了解一下HTTP协议~ HTTP协议:https://www.cnblogs.com/peng104/p/9846613.html 爬虫的基本流程简介简介:Requests是用python语言基于urllib编写的,采用的是Apache2 Licensed开源协议的HTTP库,Requests它会比urllib更加方便,可以节约我们大量的工作.一句话,requests是python实现的最简单易用的HTTP库,建议爬虫使用requests库.默认安装好pyth