Python 简单爬虫功能实现

当Google创始人用python写下他们第一个简陋的爬虫, 运行在同样简陋的服务器上的时候 ;
很少有人能够想象 , 在接下的数十年间 , 他们是怎样地颠覆了互联网乃至于人类的世界。

有网络的地方就有爬虫，爬虫英文名称spider。它是用来抓取网站数据的程序。比如: 我们通过一段程序，定期去抓取类似百度糯米、大众点评上的数据，将这些信息存储到数据库里，然后加上展示页面，一个团购导航站就问世了。毫无疑问，爬虫是很多网站的初期数据来源。

一、第一个爬虫功能的实现

——查看博文目录第一篇文章的URL

首先需要引入urllib模块，使用find函数查找url，经过字符处理就都得到了需要的URL。

#!/usr/bin/env python
import urllib
url = [‘‘]*40
i = 0
con = urllib.urlopen(‘http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html‘).read()
title = con.find(r‘<a title=‘)
href = con.find(r‘href=‘,title)
html = con.find(r‘.html‘,href)
url = con[href +6 :html +5 ]
print url

二、查看博文目录第一页所有文章的URL

A：

#!/usr/bin/env python
import urllib
url = [‘‘]*40
i = 0
con = urllib.urlopen(‘http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html‘).read()
title = con.find(r‘<a title=‘)
href = con.find(r‘href=‘,title)
html = con.find(r‘.html‘,href)
url[0] = con[href +6 :html +5 ]
print url
while title != -1 and href != -1 and html != -1 and i < 40:
    url[i] = con[href +6 :html +5 ]
    print url[i]
    title = con.find(r‘<a title=‘,html)
    href = con.find(r‘href=‘,title)
    html = con.find(r‘.html‘,href)
    i = i +1

或者B：

#!/usr/bin/env python
import urllib
i = 0
con = urllib.urlopen(‘http://blog.sina.com.cn/s/articlelist_1191258123_0_1.html‘).read()
title = con.find(r‘<a title=‘)
href = con.find(r‘href=‘,title)
html = con.find(r‘.html‘,href)
url = con[href +6 :html +5 ]
while title != -1 and href != -1 and html != -1 and i < 50:
    title = con.find(r‘<a title=‘,html)
    href = con.find(r‘href=‘,title)
    html = con.find(r‘.html‘,href)
    url = con[href +6 :html +5 ]
    print url
    i = i + 1

三、下载博文目录第一页所有的文章

A：

#!/usr/bin/env python
import urllib
i = 0
url = [‘‘]*40
con = urllib.urlopen(‘http://www.zhihu.com/collection/19668036‘).read()
target = con.find(r‘<a target="_blank‘)
base = con.find(r‘href=‘,target)
end = con.find(‘>‘,base)
url[0] = ‘http://www.zhihu.com‘ + con[target +25 :end - 1]
print url[0]
while i < 20:
  url[0] = ‘http://www.zhihu.com‘ + con[target +25 :end - 1]
  print url[0]
  target = con.find(r‘<a target="_blank‘,end)
  base = con.find(r‘href=‘,target)
  end = con.find(‘>‘,base)
  i = i + 1
while j < 30:
    content = urllib.urlopen(url[j]).read()
    print url[0]
    open(r‘zhihu/‘+url[j],‘w+‘).write(content)
    print ‘downloading‘,
    j = j + 1
    time.sleep(15)

或者B：

#!/usr/bin/env python
import time
import urllib
i = 0
j = 0
url = [‘‘]*30
name = [‘‘]*30
con = urllib.urlopen(‘http://www.zhihu.com/collection/19668036‘).read()
target = con.find(r‘<a target="_blank‘)
base = con.find(r‘href=‘,target)
end = con.find(‘>‘,base)
url[0] = ‘http://www.zhihu.com‘ + con[target +25 :end - 1]
while target != -1 and base != -1 and end != -1 and i < 30:
  url[0] = ‘http://www.zhihu.com‘ + con[target +25 :end - 1]
  name[0] =  con[base +16 :end - 1]
  target = con.find(r‘<a target="_blank‘,end)
  base = con.find(r‘href=‘,target)
  end = con.find(‘>‘,base)
  content = urllib.urlopen(url[0]).read()
  open(r‘zhihu/‘+name[0]+‘.html‘,‘w+‘).write(content)
  print ‘downloading‘,name[0]
  time.sleep(5)
  i = i + 1

四、下载所有文章

A：

import time
import urllib
page = 1
url = [‘‘]*350
i = 0
link = 1
while page <= 7:
  con = urllib.urlopen(‘http://blog.sina.com.cn/s/articlelist_1191258123_0_‘+str(page)+‘.html‘).read()
  title = con.find(r‘<a title=‘)
  href = con.find(r‘href=‘,title)
  html = con.find(r‘.html‘,href)
  while title != -1 and href != -1 and html != -1 and i < 350:
    url[i] = con[href +6 :html +5 ]
    print link,url[i]
    title = con.find(r‘<a title=‘,html)
    href = con.find(r‘href=‘,title)
    html = con.find(r‘.html‘,href)
    link = link + 1
    i = i +1
  else:
    print ‘find end!‘
  page = page + 1
else:
    print ‘all find end‘
j = 0
while j < 50:
    content = urllib.urlopen(url[j]).read()
    open(r‘tmp/‘+url[j][-26:],‘w+‘).write(content)
    j = j + 1
    time.sleep(5)
else:
    print ‘Download over!‘

B：

#!/usr/bin/env python
import time
import urllib
i = 0
link = 1
page = 1
url = [‘‘]*350
while page <= 7:
  con = urllib.urlopen(‘http://blog.sina.com.cn/s/articlelist_1191258123_0_‘+str(page)+‘.html‘).read()
  title = con.find(r‘<a title=‘)
  href = con.find(r‘href=‘,title)
  html = con.find(r‘.html‘,href)
  while title != -1 and href != -1 and html != -1 and i < 350:
    url[i] = con[href +6 :html +5 ]
    print link,url[i]
    title = con.find(r‘<a title=‘,html)
    href = con.find(r‘href=‘,title)
    html = con.find(r‘.html‘,href)
    content = urllib.urlopen(url[i]).read()
    open(r‘/tmp/sina/‘+url[i][-26:],‘w+‘).write(content)
    time.sleep(5)
    link = link + 1
    i = i +1
  page = page + 1
else:
    print ‘Download Over!‘

运行结果：

时间： 2024-10-12 11:57:16

Python 简单爬虫功能实现的相关文章

Python简单爬虫入门二

接着上一次爬虫我们继续研究BeautifulSoup Python简单爬虫入门一上一次我们爬虫我们已经成功的爬下了网页的源代码,那么这一次我们将继续来写怎么抓去具体想要的元素首先回顾以下我们BeautifulSoup的基本结构如下 #!/usr/bin/env python # -*-coding:utf-8 -*- from bs4 import BeautifulSoup import requests headers = { 'User-Agent':'Mozilla/5.0 (Win

Python简单爬虫第六蛋！（完结撒花）

第六讲: 今天我们来实战一个项目,我本人比较喜欢看小说,有一部小时叫<圣墟>不知道大家有没有听说过,个人觉得还是不错的,现在联网的时候,都可以随时随地用手机打开浏览器搜索查看,但是有时候也会遇到没有网络的情况,这个就很扎心了,有什么办法呢?所以这个项目基于这么一个现实背景来分析实现一下,把我们前几次讲到一些技术方法都运用一遍. (有人可能会说直接下载一个txt格式的小说文本文件不就好了,虽然是挺方便的,但是懒惰是不好的习惯,而且也没有运用到所学的知识,那么我们何必要学习呢?为什么要学,看完实例

Python 简单爬虫

? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 import os import time import webbrowser as web import random count = random.randint(20,40) j = 0 while j < count: i = 0 while i <= 5: web.open_new_tab('http://www.cnblogs.com/evilxr/p/37642

Python 简单爬虫案例

Python 简单爬虫案例 import requests url = "https://www.sogou.com/web" # 封装参数 wd = input('enter a word') param = { 'query':wd } response = requests.get(url=url,params=param) page_text = response.content fileName = wd+'.html' with open(fileName,'wb') as

[python爬虫]简单爬虫功能

在我们日常上网浏览网页的时候,经常会看到某个网站中一些好看的图片,它们可能存在在很多页面当中,我们就希望把这些图片保存下载,或者用户用来做桌面壁纸,或者用来做设计的素材. 我们最常规的做法就是通过鼠标右键,选择另存为.但有些图片鼠标右键的时候并没有另存为选项,还有办法就通过就是通过截图工具截取下来,但这样就降低图片的清晰度.就算可以弄下来,但是我们需要几千个页面当中的图片,如果一个一个下载,你的手将残.好吧-!其实你很厉害的,右键查看页面源代码. 我们可以通过python 来实现这样一个简单的爬

python专题-爬虫功能

在我们日常上网浏览网页的时候,经常会看到一些好看的图片,我们就希望把这些图片保存下载,或者用户用来做桌面壁纸,或者用来做设计的素材. 我们最常规的做法就是通过鼠标右键,选择另存为.但有些图片鼠标右键的时候并没有另存为选项,还有办法就通过就是通过截图工具截取下来,但这样就降低图片的清晰度.好吧-!其实你很厉害的,右键查看页面源代码. 我们可以通过python 来实现这样一个简单的爬虫功能,把我们想要的代码爬取到本地.下面就看看如何使用python来实现这样一个功能. 一,获取整个页面数据首先我们

python简单爬虫

爬虫真是一件有意思的事儿啊,之前写过爬虫,用的是urllib2.BeautifulSoup实现简单爬虫,scrapy也有实现过.最近想更好的学习爬虫,那么就尽可能的做记录吧.这篇博客就我今天的一个学习过程写写吧. 一正则表达式正则表达式是一个很强大的工具了,众多的语法规则,我在爬虫中常用的有: . 匹配任意字符(换行符除外) * 匹配前一个字符0或无限次 ? 匹配前一个字符0或1次 .* 贪心算法 .*? 非贪心算法 (.*?) 将匹配到的括号中的结果输出 \d 匹配数字 re.S 使得.

python简单爬虫的实现

python强大之处在于各种功能完善的模块.合理的运用可以省略很多细节的纠缠,提高开发效率. 用python实现一个功能较为完整的爬虫,不过区区几十行代码,但想想如果用底层C实现该是何等的复杂,光一个网页数据的获得就需要字节用原始套接字构建数据包,然后解析数据包获得,关于网页数据的解析,更是得喝一壶. 下面具体分析分析用python如何构建一个爬虫. 0X01 简单的爬虫主要功能模块 URL管理器:管理待抓取URL集合和已抓取URL集合,防止重复抓取.防止循环抓取.主要需要实现:添加新URL到

python 简单爬虫（beatifulsoup)

---恢复内容开始--- python爬虫学习从0开始第一次学习了python语法,迫不及待的来开始python的项目.首先接触了爬虫,是一个简单爬虫.个人感觉python非常简洁,相比起java或其他面向对象的编程语言,动态语言不需要声明函数或变量类型.python有20年的发展历史,以简洁高效闻名,python最初只是一个马戏团的名字,它的哲学是'用一种方法完成一件事情'.我第一次使用python时就被它的简洁高效迷住了,相比起c++和java,他简直太棒了.而且现阶段的大数据和人工智能领