Python爬虫抓取csdn博客

昨天晚上为了下载保存某位csdn大牛的全部博文，写了一个爬虫来自动抓取文章并保存到txt文本，当然也可以保存到html网页中。这样就可以不用Ctrl+C 和Ctrl+V了，非常方便，抓取别的网站也是大同小异。

为了解析抓取的网页，用到了第三方模块，BeautifulSoup，这个模块对于解析html文件非常有用，当然也可以自己使用正则表达式去解析，但是比较麻烦。

由于csdn网站的robots.txt文件中显示禁止任何爬虫，所以必须把爬虫伪装成浏览器，而且不能频繁抓取，得sleep一会再抓，使用频繁会被封ip的，但可以使用代理ip。

#-*- encoding: utf-8 -*-
'''
Created on 2014-09-18 21:10:39

@author: Mangoer
@email: [email protected]
'''

import urllib2
import re
from bs4 import BeautifulSoup
import random
import time

class CSDN_Blog_Spider:
     def __init__(self,url):

          print '\n'
          print('已启动网络爬虫。。。')
          print  '网页地址： ' + url

          user_agents = [
                    'Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11',
                    'Opera/9.25 (Windows NT 5.1; U; en)',
                    'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
                    'Mozilla/5.0 (compatible; Konqueror/3.5; Linux) KHTML/3.5.5 (like Gecko) (Kubuntu)',
                    'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
                    'Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 GNUTLS/1.2.9',
                    "Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.7 (KHTML, like Gecko) Ubuntu/11.04 Chromium/16.0.912.77 Chrome/16.0.912.77 Safari/535.7",
                    "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:10.0) Gecko/20100101 Firefox/10.0 ",
                   ]
          # use proxy ip
          # ips_list = ['60.220.204.2:63000','123.150.92.91:80','121.248.150.107:8080','61.185.21.175:8080','222.216.109.114:3128','118.144.54.190:8118',
          #           '1.50.235.82:80','203.80.144.4:80']

          # ip = random.choice(ips_list)
          # print '使用的代理ip地址： ' + ip

          # proxy_support = urllib2.ProxyHandler({'http':'http://'+ip})
          # opener = urllib2.build_opener(proxy_support)
          # urllib2.install_opener(opener)

          agent = random.choice(user_agents)

          req = urllib2.Request(url)
          req.add_header('User-Agent',agent)
          req.add_header('Host','blog.csdn.net')
          req.add_header('Accept','*/*')
          req.add_header('Referer','http://blog.csdn.net/mangoer_ys?viewmode=list')
          req.add_header('GET',url)
          html = urllib2.urlopen(req)
          page = html.read().decode('gbk','ignore').encode('utf-8')

          self.page = page
          self.title = self.getTitle()
          self.content = self.getContent()
          self.saveFile()

     def printInfo(self):
          print('文章标题是：   '+self.title + '\n')
          print('内容已经存储到out.txt文件中！')

     def getTitle(self):
          rex = re.compile('<title>(.*?)</title>',re.DOTALL)
          match = rex.search(self.page)
          if match:
                return match.group(1)

          return 'NO TITLE'

     def getContent(self):
          bs = BeautifulSoup(self.page)
          html_content_list = bs.findAll('div',{'id':'article_content','class':'article_content'})
          html_content = str(html_content_list[0])

          rex_p = re.compile(r'(?:.*?)>(.*?)<(?:.*?)',re.DOTALL)
          p_list = rex_p.findall(html_content)

          content = ''
          for p in p_list:
               if p.isspace() or p == '':
                    continue
               content = content + p
          return content

     def saveFile(self):

          outfile = open('out.txt','a')
          outfile.write(self.content)

     def getNextArticle(self):
          bs2 = BeautifulSoup(self.page)
          html_nextArticle_list = bs2.findAll('li',{'class':'prev_article'})
          # print str(html_nextArticle_list[0])
          html_nextArticle = str(html_nextArticle_list[0])
          # print html_nextArticle

          rex_link = re.compile(r'<a href=\"(.*?)\"',re.DOTALL)
          link = rex_link.search(html_nextArticle)
          # print link.group(1)

          if link:
               next_url = 'http://blog.csdn.net' + link.group(1)
               return next_url

          return None

class Scheduler:
     def __init__(self,url):
          self.start_url = url

     def start(self):
          spider = CSDN_Blog_Spider(self.start_url)
          spider.printInfo()

          while True:
               if spider.getNextArticle():
                    spider = CSDN_Blog_Spider(spider.getNextArticle())
                    spider.printInfo()
               elif spider.getNextArticle() == None:
                    print 'All article haved been downloaded!'
                    break

               time.sleep(10)

#url = input('请输入CSDN博文地址：')
url = "http://blog.csdn.net/mangoer_ys/article/details/38427979"

Scheduler(url).start()

程序中有个问题一直不能解决：不能使用标题去命名文件，所以所有的文章全部放在一个out.txt中，说的编码的问题，希望大神可以解决这个问题。

时间： 2024-07-30 14:23:14

Python爬虫抓取csdn博客的相关文章

脚本一: #!/usr/bin/env python #coding:utf-8 from bs4 import BeautifulSoup import urllib import re art = {} for page in range(1,5): page = str(page) url = 'http://yujianglei.blog.51cto.com/all/7215578/page/' + page response = urllib.urlopen(url).read

python爬虫抓取51cto博客大牛的文章保存到MySQL数据库

脚本实现:获取51cto网站某大牛文章的url,并存储到数据库中. #!/usr/bin/env python #coding:utf-8 from bs4 import BeautifulSoup import urllib import re import MySQLdb k_art_name = [] v_art_url = [] db = MySQLdb.connect('192.168.115.5','blog','blog','blog') cursor = db.cursor

Hello Python!用python写一个抓取CSDN博客文章的简单爬虫

网络上一提到python,总会有一些不知道是黑还是粉的人大喊着:python是世界上最好的语言.最近利用业余时间体验了下python语言,并写了个爬虫爬取我csdn上关注的几个大神的博客,然后利用leancloud一站式后端云服务器存储数据,再写了一个android app展示数据,也算小试了一下这门语言,给我的感觉就是,像python这类弱类型的动态语言相比于java来说,开发者不需要分太多心去考虑编程问题,能够把精力集中于业务上,思考逻辑的实现.下面分享一下我此次写爬虫的一下小经验,抛砖引玉

python 爬虫抓取心得

quanwei9958 转自 python 爬虫抓取心得分享 urllib.quote('要编码的字符串') 如果你要在url请求里面放入中文,对相应的中文进行编码的话,可以用: urllib.quote('要编码的字符串') query = urllib.quote(singername) url = 'http://music.baidu.com/search?key='+query response = urllib.urlopen(url) text = response.read()

python 爬虫抓取心得分享

/** author: insun title:python 爬虫抓取心得分享 blog:http://yxmhero1989.blog.163.com/blog/static/112157956201311821444664/ **/ 0x1.urllib.quote('要编码的字符串') 如果你要在url请求里面放入中文,对相应的中文进行编码的话,可以用: urllib.quote('要编码的字符串') query = urllib.quote(singername) url = 'h

抓取指定博客的内容

1.指定博客的地址周国平的博客地址:http://blog.sina.com.cn/s/articlelist_1193111400_0_1.html 打开上述链接,然后按F12,找到<a title="" target="_blank" href="http://blog.sina.com.cn/s/blog_471d6f680102x7cu.html">太现实的爱情算不上爱情</a> 2.代码的实现指定的网址为:h

Python爬虫抓取网页图片

本文通过python 来实现这样一个简单的爬虫功能,把我们想要的图片爬取到本地. 下面就看看如何使用python来实现这样一个功能. # -*- coding: utf-8 -*- import urllib import re import time import os #显示下载进度 def schedule(a,b,c): ''''' a:已经下载的数据块 b:数据块的大小 c:远程文件的大小 ''' per = 100.0 * a * b / c if per > 100 : per =

python爬虫抓取站长之家IP库，仅供练习用！

python爬虫抓取站长之家IP库,单线程的,仅供练习,IP库数据有43亿条,如果按此种方法抓取至少得数年,所以谨以此作为练手,新手代码很糙,请大家见谅. #!/usr/bin/python #coding=UTF-8 import urllib2 import re import os import csv import codecs user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers = { 'User-