关于python基础认证(用于爬虫)

转自http://www.voidspace.org.uk/python/articles/authentication.shtml

先转来,由于是python2的版本之后会翻译文章以及移植到python3

Introduction

This tutorial aims to explain and illustrate what basic authentication is, and how to deal with it from Python. You can download the code from this tutorial from the Voidspace Python Recipebook.

The first example, So Let‘s Do It, shows how to do it manually. This illustrates how authentication works.

The second example, Doing it Properly, shows how to handle it automatically - with a handler.

These examples make use of the Python module urllib2. This provides a simple interface to fetching pages across the internet, the urlopen function. It provides a more complex interface to specialised situations in the form of openers and handlers. These are often confusing to even intermediate level programmers. For a good introduction to using urllib2, read my urllib2 tutorial.

Basic Authentication

There is a system for requiring a username/password before a client can visit a webpage. This is called authentication and is implemented on the server. It allows a whole set of pages (called a realm) to be protected by authentication.

This scheme (or schemes) are defined by the HTTP spec, and so whilst python supports authentication it doesn‘t document it very well. HTTP documentation is in the form of RFC[1] which are technical documents and so not the most readable  .

The two normal [2] authentication schemes are basic and digest authentication. Between these two, basic is overwhelmingly the most common. As you might guess, it is also the simpler of the two.

A summary of basic authentication goes like this :

  • client makes a request for a webpage
  • server responds with an error, requesting authentication
  • client retries request - with authentication details encoded in request
  • server checks details and sends the page requested, or another error

The following sections covers these steps in more details.

Making a Request

A client is any program that makes requests over the internet. It could be a browser - or it could be a python program. When a client asks for a web page, it is sending a request to a server. The request is made up of headers with information about the request. These are the ‘http request headers‘.

Getting A Response

When the request reaches the server it sends a response back. The request may still fail (the page may not be found for example), but the response will still contain headers from the server. These are ‘http response headers‘.

If there is a problem then this response will include an error code that describes the problem. You will already be familiar with some of these codes - 404 : Page not found500 : Internal Server Error, etc. If this happens; an exception [3] will be raised by urllib2, and it will have a ‘code‘ attribute. The code attribute is an integer that corresponds to the http error code [4].

Error 401 and realms

If a page requires authentication then the error code is 401. Included in the response headers will be a ‘WWW-authenticate‘ header. This tells us the authentication scheme the server is using for this page and also something called a realm. It is rarely just a single page that is protected by authentication but a section - a ‘realm‘ of a website. The name of the realm is included in this header line.

The ‘WWW-Authenticate‘ header line looks like WWW-Authenticate: SCHEME realm="REALM".

For example, if you try to access the popular website admin application cPanel your browser will be sent a header that looks like : WWW-Authenticate: Basic realm="cPanel"

If the client already knows the username/password for this realm then it can encode them into the request headers and try again. If the username/password combination are correct, then the request will succeed as normal. If the client doesn‘t know the username/password it should ask the user. This means that if you enter a protected ‘realm‘ the client effectively has to request each page twice. The first time it will get an error code and be told what realm it is attempting to access - the client can then get the right username/password for that realm (on that server) and repeat the request.

HTTP is a ‘stateless‘ protocol. This means that a server using basic authentication won‘t ‘remember‘ you are logged in [5] and will need to be sent the right header for every protected page you attempt to access.

First Example

Suppose we attempt to fetch a webpage protected by basic authentication. :

theurl = ‘http://www.someserver.com/somepath/someprotectedpage.html‘
req = urllib2.Request(theurl)
try:
    handle = urllib2.urlopen(req)
except IOError, e:
    if hasattr(e, ‘code‘):
        if e.code != 401:
            print ‘We got another error‘
            print e.code
        else:
            print e.headers
            print e.headers[‘www-authenticate‘]

Note

If the exception has a ‘code‘ attribute it also has an attribute called ‘headers‘. This is a dictionary like object with all the headers in - but you can also print it to display all the headers. See the last line that displays the ‘www-authenticate‘ header line which ought to be present whenever you get a 401 error.

A typical output from above example looks like :

WWW-Authenticate: Basic realm="cPanel"
Connection: close
Set-Cookie: cprelogin=no; path=/
Server: cpsrvd/9.4.2

Content-type: text/html

Basic realm="cPanel"

You can see the authentication scheme and the ‘realm‘ part of the ‘www-authenticate‘ header. Assuming you know the username and password you can then navigate around that website - whenever you get a 401 error with the same realm you can just encode the username/password into your request headers and your request should succeed.

The Username/Password

Lets assume you need to access pages which are all in the same realm. Assuming you have got the username and password from the user, you can extract the realm from the header. Then whenever you get a 401 error in the same realm you know the username and password to use. So the only detail left, is knowing how to encode the username/password into the request header  . This is done by encoding it as a base 64 string. It doesn‘t actually look like clear text - but it is only the most vaguest of ‘encryption‘. This means basic authentication is just that - basic. Anyone sniffing your traffic who sees an authentication request header will be able to extract your username and password from it. Many websites like yahoo or ebay, use javascript hashing/encryption and other tricks to authenticate a login. This is much harder to detect and mimic from python ! You may need to use a proxy client server and see what information your browser is actually sending to the website [6].

base64

There is a very simple recipe base64 recipe over on the Activestate Python Cookbook (It‘s actually in the comments of that page). It shows how to encode a username/password into a request header. It goes like this :

import base64
base64string = base64.encodestring(‘%s:%s‘ % (username, password))[:-1]
req.add_header("Authorization", "Basic %s" % base64string)

Where req is our request object like in the first example.

So Let‘s Do It

Let‘s wrap all this up with an example that shows accessing a page, extracting the realm, then doing the authentication. We‘ll use a regular expression to pull the scheme and realm out of the response header. :

import urllib2
import sys
import re
import base64
from urlparse import urlparse

theurl = ‘http://www.someserver.com/somepath/somepage.html‘
# if you want to run this example you‘ll need to supply
# a protected page with your username and password

username = ‘johnny‘
password = ‘XXXXXX‘            # a very bad password

req = urllib2.Request(theurl)
try:
    handle = urllib2.urlopen(req)
except IOError, e:
    # here we *want* to fail
    pass
else:
    # If we don‘t fail then the page isn‘t protected
    print "This page isn‘t protected by authentication."
    sys.exit(1)

if not hasattr(e, ‘code‘) or e.code != 401:
    # we got an error - but not a 401 error
    print "This page isn‘t protected by authentication."
    print ‘But we failed for another reason.‘
    sys.exit(1)

authline = e.headers[‘www-authenticate‘]
# this gets the www-authenticate line from the headers
# which has the authentication scheme and realm in it

authobj = re.compile(
    r‘‘‘(?:\s*www-authenticate\s*:)?\s*(\w*)\s+realm=[‘"]([^‘"]+)[‘"]‘‘‘,
    re.IGNORECASE)
# this regular expression is used to extract scheme and realm
matchobj = authobj.match(authline)

if not matchobj:
    # if the authline isn‘t matched by the regular expression
    # then something is wrong
    print ‘The authentication header is badly formed.‘
    print authline
    sys.exit(1)

scheme = matchobj.group(1)
realm = matchobj.group(2)
# here we‘ve extracted the scheme
# and the realm from the header
if scheme.lower() != ‘basic‘:
    print ‘This example only works with BASIC authentication.‘
    sys.exit(1)

base64string = base64.encodestring(
                ‘%s:%s‘ % (username, password))[:-1]
authheader =  "Basic %s" % base64string
req.add_header("Authorization", authheader)
try:
    handle = urllib2.urlopen(req)
except IOError, e:
    # here we shouldn‘t fail if the username/password is right
    print "It looks like the username or password is wrong."
    sys.exit(1)
thepage = handle.read()

When the code has run the contents of the page we‘ve fetched is saved as a string in the variable ‘thepage‘. The regular expression used to match the authentication header in this example is r‘‘‘(?:\s*www-authenticate\s*:)?\s*(\w*)\s+realm=[‘"]([^‘"]+)[‘"]‘‘‘. This doesn‘t work where there is a space in the realm, which you can fix by replacing \w+ with [^‘"]+. This gives us the regular expression:

r‘‘‘(?:\s*www-authenticate\s*:)?\s*(\w*)\s+realm=[‘"]([^‘"]+)[‘"]‘‘‘

Warning

If you are writing an http client of any sort that has to deal with basic authentication, don‘t do it this way. The next example that shows using a handler is the right way of doing it.

Doing it Properly

In actual fact the proper way to do BASIC authentication with Python is to install an opener that uses an authentication handler. The authentication handler needs a passowrd manager - and then you‘re away  .

Every time you use urlopen you are using handlers to deal with your request - whether you know it or not. The default opener has handlers for all the standard situations installed [7]. What we need to do is create an opener that has a handler that can deal with basic authentication. The right handler for our needs is calledurllib2.HTTPBasicAuthHandler. As I mentioned it also needs a password manager -urllib2.HTTPPasswordMgr.

Unfortunately our friend HTTPPasswordMgr has a slight problem - you must already know the realm you‘re fetching. Luckily it has a near cousin HTTPPasswordMgrWithDefaultRealm. Despite the keyboard busting name, it‘s a bit more friendly to use. If you don‘t know the name of the realm - then pass in None for the realm, and it will try the username and password you give it - whatever the realm. Seeing as you are going to specify a specific URL, it is likely that this will be sufficient. If you aren‘t convinced then you can always use HTTPPasswordMgr and extract the realm from the authentication header the first time you meet it.

This example goes through the following steps :

  • establishes the top level url, username and password
  • Create our password manager (with default realm)
  • Gives the password to the manager
  • Creates the handler with the manager
  • Creates an opener with the handler installed

At this point we have a choice. We can either use the open method of the opener directly. This leavesurllib2.urlopen using the default opener. Alternatively we can make our opener the default one. This means all future calles to urlopen will use this opener. As all openers have the default handlers installed as well as the ones you pass it, it shouldn‘t break urlopen to do this. In the example below we install it, making it the default opener :

import urllib2

theurl = ‘http://www.someserver.com/toplevelurl/somepage.htm‘
username = ‘johnny‘
password = ‘XXXXXX‘
# a great password

passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
# this creates a password manager
passman.add_password(None, theurl, username, password)
# because we have put None at the start it will always
# use this username/password combination for  urls
# for which `theurl` is a super-url

authhandler = urllib2.HTTPBasicAuthHandler(passman)
# create the AuthHandler

opener = urllib2.build_opener(authhandler)

urllib2.install_opener(opener)
# All calls to urllib2.urlopen will now use our handler
# Make sure not to include the protocol in with the URL, or
# HTTPPasswordMgrWithDefaultRealm will be very confused.
# You must (of course) use it when fetching the page though.

pagehandle = urllib2.urlopen(theurl)
# authentication is now handled automatically for us

Hurrah - not so bad hey  .

A Word About Cookies

Some websites may also use cookies alongside authentication. Luckily there is a library that will allow you to have automatic cookie management without having to think about it. This is ClientCookie. In Python 2.4 it becomes part of the python standard library as cookielib. See my article on cookielib - for an example of how to use it.


Footnotes

[1] http://www.faqs.org/rfcs/rfc2617.html is the RFC that describes basic and digest authentication
[2] There is also a M$ proprietary authentication scheme called NTLM, but it‘s usually found on intranets - I‘ve never had to deal with it live on the web.
[3] An HTTPError, which is a subclass of IOError
[4] Or at least state management is a separate subject. Using cookies the server may well have details of your session - but you will still need to authenticate each request.
[5] See http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html for a full list of error codes
[6] See this comp.lang.python thread for suggestions of several proxy servers that can do this.
[7] See the urllib2 tutorial for a slightly more detailed discussion of openers and handlers.

关于python基础认证(用于爬虫)

时间: 2024-07-30 04:59:13

关于python基础认证(用于爬虫)的相关文章

下载大数据实战课程第一季Python基础和网络爬虫数据分析

python语言近年来越来越被程序相关人员喜欢和使用,因为其不仅简单容易学习和掌握,而且还有丰富的第三方程序库和相应完善的管理工具:从命令行脚本程序到gui程序,从B/S到C/S,从图形技术到科学计算,软件开发到自动化测试,从云计算到虚拟化,所有这些领域都有python的身影:python已经深入到程序开发的各个领域,并且会越来越多的人学习和使用. 大数据实战课程第一季Python基础和网络爬虫数据分析,刚刚入手,转一注册文件,视频的确不错,可以先下载看看:链接:http://pan.baidu

Nginx 结合Python Ldap认证用于Kibana权限登陆

参考及依赖 https://github.com/nginxinc/nginx-ldap-auth http://nginx.org/ nginx-1.14.2 http_auth_request_module nginx-ldap-auth python2.7 python-ldap Nginx支持ldap 部署nginx,注意需要http_auth_request_module支持 wget http://nginx.org/download/nginx-1.14.2.tar.gz tar

大数据实战课程第一季Python基础和网络爬虫数据分析

分享下载地址--https://pan.baidu.com/s/1c3eMFJE 密码: eew4 备用地址--https://pan.baidu.com/s/1htwp1Ak 密码: u45n 内容简介本课程面向从未接触过Python的学员,从最基础的语法开始讲起,逐步进入到目前各种流行的应用.整个课程分为基础和实战两个单元. 基础部分包括Python语法和面向对象.函数式编程两种编程范式,基础部分会介绍Python语言中的各种特色数据结构,如何使用包和函数,帮助同学快速通过语法关. 在实战部

Python 基础学习 网络小爬虫

<span style="font-size:18px;"># # 百度贴吧图片网络小爬虫 # import re import urllib def getHtml(url): page = urllib.urlopen(url) html = page.read() return html def getImg(html): reg = r'src="(.+?\.jpg)" pic_ext' imgre = re.compile(reg) imgli

python爬虫主要就是五个模块:爬虫启动入口模块,URL管理器存放已经爬虫的URL和待爬虫URL列表,html下载器,html解析器,html输出器 同时可以掌握到urllib2的使用、bs4(BeautifulSoup)页面解析器、re正则表达式、urlparse、python基础知识回顾(set集合操作)等相关内容。

本次python爬虫百步百科,里面详细分析了爬虫的步骤,对每一步代码都有详细的注释说明,可通过本案例掌握python爬虫的特点: 1.爬虫调度入口(crawler_main.py) # coding:utf-8from com.wenhy.crawler_baidu_baike import url_manager, html_downloader, html_parser, html_outputer print "爬虫百度百科调度入口" # 创建爬虫类class SpiderMai

零基础自学Python 3开发网络爬虫(二): 用到的数据结构简介以及爬虫Ver1.0 alpha

上一回, 我学会了 用伪代码写出爬虫的主要框架; 用Python的urllib.request库抓取指定url的页面; 用Python的urllib.parse库对普通字符串转符合url的字符串. 这一回, 开始用Python将伪代码中的所有部分实现. 由于文章的标题就是"零基础", 因此会先把用到的两种数据结构队列和集合介绍一下. 而对于"正则表达式"部分, 限于篇幅不能介绍, 但给出我比较喜欢的几个参考资料. Python的队列 在爬虫程序中, 用到了广度优先搜

学爬虫,需要掌握哪些Python基础?

入手爬虫确实不要求你精通Python编程,但基础知识还是不能忽视的,那么我们需要哪些Python基础呢? 首先我们先来看看一个最简单的爬虫流程: 第一步要确定爬取页面的链接,由于我们通常爬取的内容不止一页,所以要注意看看翻页.关键字变化时链接的变化,有时候甚至要考虑到日期:另外还需要主要网页是静态.动态加载的. 第二步请求资源,这个难度不大,主要是Urllib,Request两个库的使用,必要时候翻翻官方文档即可 第三步是解析网页.请求资源成功后,返回的整个网页的源代码,这时候我们就需要定位,清

python 基础知识点整理 和详细应用

Python教程 Python是一种简单易学,功能强大的编程语言.它包含了高效的高级数据结构和简单而有效的方法,面向对象编程.Python优雅的语法,动态类型,以及它天然的解释能力,使其成为理想的语言,脚本和应用程序高速开发在大多数平台上的很多领域. Python解释器及其扩展标准库的源代码和编译版本号能够从Python的Web网站,http://www.python.org/全部主要平台可自由查看,而且能够自由公布.该网站上也包括了分配和指针到非常多免费的第三方Python模块,程序,工具,以

python 基础知识点整理 和具体应用

Python教程 Python是一种简单易学,功能强大的编程语言.它包括了高效的高级数据结构和简单而有效的方法,面向对象编程.Python优雅的语法,动态类型,以及它天然的解释能力,使其成为理想的语言,脚本和应用程序快速开发在大多数平台上的许多领域. Python解释器及其扩展标准库的源码和编译版本可以从Python的Web站点,http://www.python.org/所有主要平台可自由查看,并且可以自由发布.该站点上也包含了分配和指针到很多免费的第三方Python模块,程序,工具,以及附加