阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl

1.函数调用它自身，这样就形成了一个循环，一环套一环：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org"+pageUrl)
    bsObj = BeautifulSoup(html,"lxml")
    try:
        print(bsObj.h1.get_text())
        print(bsObj.find(id ="mw-content-text").findAll("p")[0])                     //找到网页中 id=mw-content-text,然后在这个基础上查找"p"这个标签的内容 [0]则代表选择第0个
        print(bsObj.find(id="ca-edit").find("span").find("a").attrs[‘href‘])         //找到id=ca-edit里面的span标签里面的a标签里面的href的值
    except AttributeError:
        print("This page is missing something! No worries though!")

    for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")):
        if ‘href‘ in link.attrs:
            if link.attrs[‘href‘] not in pages:
                #We have encountered a new page
                newPage = link.attrs[‘href‘]
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)

getLinks("")

　2.对网址进行处理，通过"/"对网址中的字符进行分割

def splitAddress(address):
    addressParts = address.replace("http://", "").split("/")
    return addressParts

addr = splitAddress("https://hao.360.cn/?a1004")
print(addr)

运行结果为：

runfile(‘C:/Users/user/Desktop/chensimin.py‘, wdir=‘C:/Users/user/Desktop‘)
[‘https:‘, ‘‘, ‘hao.360.cn‘, ‘?a1004‘]                   //两个//之间没有内容，所用用‘‘表示

def splitAddress(address):
    addressParts = address.replace("http://", "").split("/")
    return addressParts

addr = splitAddress("http://www.autohome.com.cn/wuhan/#pvareaid=100519")
print(addr)

运行结果为：

runfile(‘C:/Users/user/Desktop/chensimin.py‘, wdir=‘C:/Users/user/Desktop‘)
[‘www.autohome.com.cn‘, ‘wuhan‘, ‘#pvareaid=100519‘]

时间： 2024-11-09 03:34:38

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---Crawl的相关文章

阅读OReilly.Web.Scraping.with.Python.2015.6笔记---找出网页中所有的href

1.查找以<a>开头的所有文本,然后判断href是否在<a>里面,如果<a>里面有href,就像<a href=" " >,然后提取href的值. from urllib.request import urlopen from bs4 import BeautifulSoup html = urlopen("http://en.wikipedia.org/wiki/Kevin_Bacon") bsObj = Beaut

Web Scraping using Python Scrapy_BS4 - Introduction

What is Web Scraping This is also referred to as web harvesting and web data extraction. This is the process of automatically downloading a web page's data and extracting information from it. Benefits of Web Scraping Component of applications used fo

Web Scraping with Python第一章

1. 认识urllib urllib是python的标准库,它提供丰富的函数例如从web服务器请求数据.处理cookie等,在python2中对应urllib2库,不同于urllib2,python3的urllib被分为若干子模块:urllib.request.urllib.parse.urllib.error等,urllib库的使用可以参考https://docs.python.org/3/library/urllib.html from urllib.request import urlop

Step by Step of "Web scraping with Python" ----Richard Lawson ---3/n

when trying the sample code of "link_crawler3.py", it will always fail with below message: /usr/bin/python3 /home/cor/webscrappython/Web_Scraping_with_Python/chapter01/link_crawler3.py Downloading:http://example.webscraping.com Downloading--2 Do

Web Scraping（网页抓取）基本原理 - 白话篇

本文主要介绍 Web Scraping 的基本原理,基于Python语言,大白话,面向可爱的小白(^-^). 易混淆的名称: 很多时候,大家会把,在网上获取Data的代码,统称为"爬虫", 但实际上,所谓的"爬虫",并不是特别准确,因为"爬虫"也是分种的, 常见的"爬虫"有两种: 网路爬虫 (Web Crawler),又称 Spider:Spiderbot 网页抓取 (Web Scraper),又称 Web Harvestin

像web一样使用python

使用传统的web开发技术,也就是html+js,然后搭配一个后端语言,已经成为当今web开发的固定模式了,为此也形成了众多的toolkit,譬如ror,django,各种js图形库更是玲琅满目,从很大程度上也加速了开发过程.但传统web应用也很自然地有一些诟病,有些特殊效果,c端可以轻而易举地完成,但b端就会很纠结了,从根本上讲,这是因为html这种语言是内容驱动行为的服务模式,导致js没有状态保留的功能,这在我和我的同事使用webkit结合html+js来搭建一个hybrid应用的时候让我深有

jsoup web scraping

jsoup简介 jsoup是一款HTML解析器,可用与解析URL地址.HTML文本内同等,操作类似于jQuery,可通过DOM查找数据,操作数据, 使用时需引入jsoup jar jsoup可以从包含字符串.url及本地文件加载html文档,生成Document对象,通过Document对象即可操作文档中的数据 eg: //通过url Document doc = Jsoup.connect("http://www.cnblogs.com/wishyouhappy").get(); /

Free web scraping | Data extraction | Web Crawler | Octoparse, Free web scraping

Free web scraping | Data extraction | Web Crawler | Octoparse, Free web scraping 人才知了

《Flask Web开发——基于Python的Web应用开发实践》一字一句上机实践（下）

目录前言第8章用户认证第9章用户角色第10章用户资料第11章博客文章第12章关注者第13章用户评论第14章应用编程接口前言第1章-第7章学习实践记录请参见:<Flask Web开发——基于Python的Web应用开发实践>一字一句上机实践(上) 本文记录自己学习<Flask Web开发——基于Python的Web应用开发实践>的第8章-第14章内容.相比于刚开始学习第1-7章内容来说,本部分内容实战性更强,而且在书本上遇到的问题也相对较少,如果