python 爬虫urllib基础示例

环境使用python3.5.2 urllib3-1.22

下载安装

wget https://www.python.org/ftp/python/3.5.2/Python-3.5.2.tgz

tar -zxf Python-3.5.2.tgz

cd Python-3.5.2/

./configure --prefix=/usr/local/python

make && make install

mv /usr/bin/python /usr/bin/python275

ln -s /usr/local/python/bin/python3 /usr/bin/python

wget https://files.pythonhosted.org/packages/ee/11/7c59620aceedcc1ef65e156cc5ce5a24ef87be4107c2b74458464e437a5d/urllib3-1.22.tar.gz

tar zxf urllib3-1.22.tar.gz

cd urllib3-1.22/

python setup.py install

浏览器模拟示例

添加headers一：build_opener()
import urllib.request
url="http://www.baidu.com"
headers=("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36")
opener=urllib.request.build_opener()
opener.addheaders=[headers]
data=opener.open(url).read()
fl=open("/home/urllib/test/1.html","wb")
fl.write(data)
fl.close()

添加headers二：add_header()
import urllib.request
url="http://www.baidu.com"
req=urllib.request.Request(url)
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36")
data=urllib.request.urlopen(req).read()
fl=open("/home/urllib/test/2.html","wb")
fl.write(data)
fl.close()

增加超时设置

timeout超时
import urllib.request
for i in range(1,100):
	try:
		file=urllib.request.urlopen("http://www.baidu.com",timeout=1)
		data=file.read()
		print(len(data))
	except Exception as e:
		print("出现异常---->"+str(e))

HTTP协议GET请求一

get请求
import urllib.request
keywd="hello"
url="http://www.baidu.com/s?wd="+keywd
req=urllib.request.Request(url)
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36")
data=-urllib.request.urlopen(req).read()
fl=open("/home/urllib/test/3.html","wb")
fl.write(data)
fl.close()

HTTP协议GET请求二

get请求 （编码）
import urllib.request
keywd="中国"
url="http://www.baidu.com/s?wd="
key_code=urllib.request.quote(keywd)
url_all=url+key_code
req=urllib.request.Request(url_all)
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36")
data=-urllib.request.urlopen(req).read()
fl=open("/home/urllib/test/4.html","wb")
fl.write(data)
fl.close()

HTTP协议POST请求

post请求
import urllib.request
import urllib.parse
url="http://www.baidu.com/mypost/"
postdata=urllib.parse.urlencode({
"user":"testname",
"passwd":"123456"
}).encode('utf-8')
req=urllib.request.Request(url,postdata)
red.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36")
data=urllib.request.urlopen(req).read()
fl=open("/home/urllib/test/5.html","wb")
fl.write(data)
fl.close()

使用代理服务器

def use_proxy(proxy_addr,url):
	import urllib.request
	proxy=urllib.request.ProxyHandler({'http':proxy_addr})
	opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
	urllib.request.install_opener(opener)
	data=urllib.request.urlopen(url).read().decode('utf-8')
	return data
proxy_addr="201.25.210.23:7623"
url="http://www.baidu.com"
data=use_proxy(proxy_addr,url)
fl=open("/home/urllib/test/6.html","wb")
fl.write(data)
fl.close()

开启DebugLog

import urllib.request
url="http://www.baidu.com"
httpd=urllib.request.HTTPHandler(debuglevel=1)
httpsd=urllib.request.HTTPSHandler(debuglevel=1)
opener=urllib.request.build_opener(opener)
urllib.request.install_opener(opener)
data=urllib.request.urlopen(url)
fl=open("/home/urllib/test/7.html","wb")
fl.write(data)
fl.close()

URLError异常处理

URLError异常处理
import urllib.request
import urllib.error
try:
	urllib.request.urlopen("http://blog.csdn.net")
except urllib.error.URLError as e:
	print(e.reason)

HTTPError处理
import urllib.request
import urllib.error
try:
	urllib.request.urlopen("http://blog.csdn.net")
except urllib.error.HTTPError as e:
	print(e.code)
	print(e.reason)

结合使用
import urllib.request
import urllib.error
try:
	urllib.request.urlopen("http://blog.csdn.net")
except urllib.error.HTTPError as e:
	print(e.code)
	print(e.reason)
except urllib.error.URLError as e:
	print(e.reason)

推荐方法：
import urllib.request
import urllib.error
try:
	urllib.request.urlopen("http://blog.csdn.net")
except urllib.error.URLError as e:
	if hasattr(e,"code"):
		print(e.code)
	if hasattr(e,"reason"):
		print(e.reason)

示例仅供参考

原文地址：http://blog.51cto.com/superleedo/2121859

时间： 2024-12-06 16:41:35

python 爬虫urllib基础示例的相关文章

python爬虫Urllib实战

Urllib基础 urllib.request.urlretrieve(url,filenname) 直接将网页下载到本地 import urllib.request >>> urllib.request.urlretrieve("http://www.hellobi.com",filename="D:\/1.html") ('D:\\/1.html', <http.client.HTTPMessage object at 0x0000000

python爬虫之基础学习（一）

python爬虫之基础学习(一) 网络爬虫网络爬虫也叫网络蜘蛛.网络机器人.如今属于数据的时代,信息采集变得尤为重要,可以想象单单依靠人力去采集,是一件无比艰辛和困难的事情.网络爬虫的产生就是代替人力在互联网中自动进行信息采集和整理. 网络爬虫的组成网络爬虫由控制节点.爬虫节点以及资源库构成,简单而言就是控制节点控制爬虫节点爬取和处理网页存储到资源库中.网络爬虫中有多个控制节点和爬虫节点,一个控制节点控制着多个爬虫节点,同一个控制节点下的多个爬虫节点可以相互通信,多个控制节点也可以相互通信.

Python 爬虫 --- urllib

对于互联网数据,Python 有很多处理网络协议的工具,urllib 是很常用的一种. 一.urllib.request,request 可以很方便的抓取 URL 内容. urllib.request.urlopen(url) 返回请求 url 后的二进制对象· 参数:url='http://www.baidu.com',请求的 url. data=None,请求的数据,可有可无,bytes 类型. timeout=3,设置访问超时时间,可有可无 cafile=None,HTTPS 请求 CA

Python爬虫--Urllib库

Urllib库 Urllib是python内置的HTTP请求库,包括以下模块:urllib.request (请求模块).urllib.error( 异常处理模块).urllib.parse (url解析模块).urllib.robotparser (robots.txt解析模块) 一.urllib.request 请求模块 1.urllib.request.urlopen urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=N

python爬虫---urllib库的基本用法

urllib是python自带的请求库,各种功能相比较之下也是比较完备的,urllib库包含了一下四个模块: urllib.request 请求模块 urllib.error 异常处理模块 urllib.parse url解析模块 urllib.robotparse robots.txt解析模块下面是一些urllib库的使用方法. 使用urllib.request import urllib.request response = urllib.request.urlopen(

Python爬虫-urllib的基本用法

from urllib import response,request,parse,error from http import cookiejar if __name__ == '__main__': #response = urllib.request.urlopen("http://www.baidu.com") #print(response.read().decode("utf-8")) #以post形式发送,没有data就是get形式 #请求头 #dat

python爬虫 urllib库基本使用

以下内容均为python3.6.*代码学习爬虫,首先有学会使用urllib库,这个库可以方便的使我们解析网页的内容,本篇讲一下它的基本用法解析网页 #导入urllib from urllib import request # 明确url base_url = 'http://www.baidu.com/' # 发起一个http请求,返回一个类文件对象 response = request.urlopen(base_url) # 获取网页内容 html = response.read().de

python爬虫相关基础概念

什么是爬虫爬虫就是通过编写程序模拟浏览器上网,然后让其去互联网上抓取数据的过程. 哪些语言可以实现爬虫 1.php:可以实现爬虫.但是php在实现爬虫中支持多线程和多进程方面做得不好. 2.java:可以实现爬虫.java可以非常好的处理和实现爬虫,是唯一可以与python并驾齐驱的.但是java实现爬虫代码较为臃肿,重构成本较大. 3.c.c++:可以实现爬虫.相比较来说难度比较大. 4.python:可以实现爬虫.python实现和处理爬虫语法简单,代码优美学习成本低,支持的模块比较多,具

python爬虫---mongodb基础

一,mongodb简介 MongoDB是一个基于分布式文件存储的数据库.由C++语言编写.旨在为WEB应用提供可扩展的高性能数据存储解决方案.MongoDB是一个介于关系数据库和非关系数据库之间的产品,是非关系数据库当中功能最丰富,最像关系数据库的.它支持的数据结构非常松散,是类似json的bson格式,因此可以存储比较复杂的数据类型.Mongo最大的特点是它支持的查询语言非常强大,其语法有点类似于面向对象的查询语言,几乎可以实现类似关系数据库单表查询的绝大部分功能,而且还支持对数据建立索引.