Step by Step of "Web scraping with Python" ----Richard Lawson ---3/n

when trying the sample code of "link_crawler3.py", it will always fail with below message:

/usr/bin/python3 /home/cor/webscrappython/Web_Scraping_with_Python/chapter01/link_crawler3.py
Downloading:http://example.webscraping.com
Downloading--2
Downloading:http://example.webscraping.com
Downloading --- 5
{‘User-agent‘: {‘User-agent‘: ‘GoodCrawler‘}}
http://example.webscraping.com
Traceback (most recent call last):
  File "/home/cor/webscrappython/Web_Scraping_with_Python/chapter01/link_crawler3.py", line 150, in <module>
    link_crawler(‘http://example.webscraping.com‘, ‘/(index|view)‘, delay=0, num_retries=1, max_depth=1, user_agent=‘GoodCrawler‘)
  File "/home/cor/webscrappython/Web_Scraping_with_Python/chapter01/link_crawler3.py", line 36, in link_crawler
    html = download(url, headers, proxy=proxy, num_retries=num_retries)
  File "/home/cor/webscrappython/Web_Scraping_with_Python/chapter01/common.py", line 75, in download5
    htmlrsp = opener.open(requestnew)
  File "/usr/lib/python3.5/urllib/request.py", line 466, in open
    response = self._open(req, data)
  File "/usr/lib/python3.5/urllib/request.py", line 484, in _open
    ‘_open‘, req)
  File "/usr/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.5/urllib/request.py", line 1282, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.5/urllib/request.py", line 1254, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "/usr/lib/python3.5/http/client.py", line 1107, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python3.5/http/client.py", line 1147, in _send_request
    self.putheader(hdr, value)
  File "/usr/lib/python3.5/http/client.py", line 1083, in putheader
    if _is_illegal_header_value(values[i]):
TypeError: expected string or bytes-like object

  and I have searched on the internet for seral times, and I think the code is right

def download5(url, user_agent=‘wswp‘, proxy=None, num_retries=2):
    """Download function with support for proxies"""
    print(‘Downloading:%s‘%url)
    print(‘Downloading --- 5‘)
    headers = {‘User-agent‘: user_agent}
    print(headers)
    print(url)
    requestnew = request.Request(url, headers=headers)
    opener = request.build_opener()
    if proxy:
        proxy_params = {urlparse.urlparse(url).scheme: proxy}
        opener.add_handler(request.ProxyHandler(proxy_params))
    try:
        #html = opener.open(requestnew).read().decode(‘utf-8‘)
        htmlrsp = opener.open(requestnew)
        html = htmlrsp.read().decode(‘utf-8‘)

    except request.URLError as e:
        print(‘Download error:%s‘%e.reason)
        html = None
        if num_retries > 0:
            if hasattr(e, ‘code‘) and 500 <= e.code < 600:
                # retry 5XX HTTP errors
                html = download5(url, user_agent, proxy, num_retries-1)
    return html

  then we check the "/usr/lib/python3.5/http/client.py"

# the patterns for both name and value are more lenient than RFC
# definitions to allow for backwards compatibility
_is_legal_header_name = re.compile(rb‘[^:\s][^:\r\n]*‘).fullmatch
_is_illegal_header_value = re.compile(rb‘\n(?![ \t])|\r(?![ \t\n])‘).search

    def putheader(self, header, *values):
        """Send a request header line to the server.

        For example: h.putheader(‘Accept‘, ‘text/html‘)
        """
        if self.__state != _CS_REQ_STARTED:
            raise CannotSendHeader()

        if hasattr(header, ‘encode‘):
            header = header.encode(‘ascii‘)

        if not _is_legal_header_name(header):
            raise ValueError(‘Invalid header name %r‘ % (header,))

        values = list(values)
        for i, one_value in enumerate(values):
            if hasattr(one_value, ‘encode‘):
                values[i] = one_value.encode(‘latin-1‘)
            elif isinstance(one_value, int):
                values[i] = str(one_value).encode(‘ascii‘)

            if _is_illegal_header_value(values[i]):
                raise ValueError(‘Invalid header value %r‘ % (values[i],))

        value = b‘\r\n\t‘.join(values)
        header = header + b‘: ‘ + value
        self._output(header)

  #

>>> _is_illegal_header_value = re.compile(rb‘\n(?![ \t])|\r(?![ \t\n])‘).search
>>> _is_illegal_header_value(‘identity‘)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: cannot use a bytes pattern on a string-like object
>>> vl=‘identity‘
>>> type(vl)
<class ‘str‘>
>>> _is_illegal_header_value(vl)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: cannot use a bytes pattern on a string-like object

  

原文地址:https://www.cnblogs.com/winditsway/p/12598829.html

时间: 2024-10-01 23:49:02

Step by Step of "Web scraping with Python" ----Richard Lawson ---3/n的相关文章

Web Scraping using Python Scrapy_BS4 - Introduction

What is Web Scraping This is also referred to as web harvesting and web data extraction. This is the process of automatically downloading a web page's data and extracting information from it. Benefits of Web Scraping Component of applications used fo

Web Scraping with Python第一章

1. 认识urllib urllib是python的标准库,它提供丰富的函数例如从web服务器请求数据.处理cookie等,在python2中对应urllib2库,不同于urllib2,python3的urllib被分为若干子模块:urllib.request.urllib.parse.urllib.error等,urllib库的使用可以参考https://docs.python.org/3/library/urllib.html from urllib.request import urlop

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl

1.函数调用它自身,这样就形成了一个循环,一环套一环: from urllib.request import urlopen from bs4 import BeautifulSoup import re pages = set() def getLinks(pageUrl): global pages html = urlopen("http://en.wikipedia.org"+pageUrl) bsObj = BeautifulSoup(html,"lxml"

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href

1.查找以<a>开头的所有文本,然后判断href是否在<a>里面,如果<a>里面有href,就像<a href=" " >,然后提取href的值. from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon") bsObj = Beaut

The Linux Mint 18:Eclipse Run The C++ And Python ConfigorationWhen You achieve above step,you can run the c++ and python! (Next OTL ,PYOTL is Project That Write By Ruimin Shen(ability man) )

# Copyright (c) 2016, 付刘伟 (Liuwei Fu)# All rights reserved.# 转载请注明出处 1.Install The Eclipse,g++ Use The SynapTic Package Manager: Eclipse :you should select the eclipse and eclipse-cdt-qt e: g++:just select g++ e: 2.Run The Eclipse By The Root Use The

[py]python写一个通讯录step by step V3.0

python写一个通讯录step by step V3.0 参考: http://blog.51cto.com/lovelace/1631831 更新功能: 数据库进行数据存入和读取操作 字典配合函数调用实现switch功能 其他:函数.字典.模块调用 注意问题: 1.更优美的格式化输出 2.把日期换算成年龄 3.更新操作做的更优雅 准备工作 db准备 - 创建数据库 mysql> create database txl charset utf8; Query OK, 1 row affecte

study Mvc step by step (一) 什么是Mvc啊?

当我们开始逐步把Net平台上面的Web开发从webform过度到MVC 开发的时候.我们总想弄清楚Mvc到底是什么??其实Mvc并不是Net特有的一种开发技术.而是一种软件开发的模式.早在上个世界80年代.Xerox PARC为编程语言Smalltalk-80发明的一种软件设计模式,已被广泛使用.那么什么是Mvc呢? MVC 是一种使用 MVC(Model View Controller 模型-视图-控制器)设计创建 Web 应用程序的模式: Model(模型)表示应用程序核心(比如数据库记录列

WPF Step By Step 完整布局介绍

WPF Step By Step 完整布局介绍 回顾 上一篇,我们介绍了基本控件及控件的重要属性和用法,我们本篇详细介绍WPF中的几种布局容器及每种布局容器的使用场景,当 然这些都是本人在实际项目中的使用经验,可能还存在错误之处,还请大家指出. 本文大纲 1.Grid 2.StackPanel 3.DockPanel 4.WrapPanel Grid 1.Row和Column 我们下面来介绍Grid的行的用法,及我们在UI设计过程中需要注意的细节. 由于前面我们在第一章中已经介绍了基本的关于Gr

EF框架step by step(7)—Code First DataAnnotations(1)

Data annotation特性是在.NET 3.5中引进的,给ASP.NET web应用中的类提供了一种添加验证的方式.Code First允许你使用代码来建立实体框架模型,同时允许用Data annotation特性来配置类和属性的某些特性. 其实在前面的几篇文章中,有用到几个,在这一篇里,进行一次比较全面的介绍 Key EF框架要求每个实体必须有主键字段,他需要根据这个主键字段跟踪实体.CodeFirst方法在创建实体时,也必须指定主键字段,默认情况下属性被命名为ID.id或者[Clas