Python - 爬取博客园某一目录下的随笔 - 保存为docx

 1 #coding:utf-8
 2 import requests
 3 from bs4 import BeautifulSoup
 4 import MySQLdb
 5
 6
 7 def get_html(url):
 8     ‘‘‘
 9     获取页面HTML源码，并返回
10     ‘‘‘
11     html = requests.get(url)
12     content = html.text.encode(‘utf-8‘)
13     return content
14
15 def get_blog_html_list(html_content):
16     global addr
17     ‘‘‘
18     在HTML源码中获取到blog相关的值，并返回
19     ‘‘‘
20     bs = BeautifulSoup(html_content,‘lxml‘)
21     # 此处的class要加上下划线，否则会和系统内预定的class冲突
22     blog_list = bs.find_all(‘div‘, class_ = ‘entrylistItem‘)
23     for i in range(len(blog_list)):
24         sub_dict = {}
25         sub_dict[‘title‘] = blog_list[i].a.get_text()
26         sub_dict[‘link‘] = blog_list[i].a.get(‘href‘)
27         sub_dict[‘abstract‘] = blog_list[i].div.get_text()
28         insert_to_db(sub_dict)
29     return sub_dict
30
31 def insert_to_db(dict={} ):
32     conn = MySQLdb.connect(host = ‘localhost‘, user = ‘root‘, passwd = ‘12345a‘,
33                            db = ‘pytest01‘, charset = ‘utf8‘)
34     cur = conn.cursor()
35
36     try:
37         cur.execute(‘insert into blog_details (title, link, abstract) values (%s, %s, %s)‘,
38                      (dict[‘title‘], dict[‘link‘], dict[‘abstract‘]))
39     except MySQLdb.Error, e:
40         pass
41     conn.commit()
42     cur.close()
43     conn.close()

 1 #coding:utf-8
 2 import urllib
 3 import requests
 4 from bs4 import BeautifulSoup
 5 import MySQLdb
 6 import Crawler.get_file
 7 from docx import Document
 8 import time
 9 import docx.image   # 超链接图片抛出的异常在docx.image中
10
11 index = ‘0‘
12
13 def get_blog_pic(pic_url, index):
14     print ‘download picture‘
15     ‘‘‘
16     下载指定URL的图片，并且以index命名保存到temp_pic 文件夹内
17     ‘‘‘
18     path = ‘D:\\workspace_Java\\cnBlogCrawler\\temp_pic\\‘ + index + ‘.png‘
19     try:
20         pic = urllib.urlretrieve(pic_url, path)
21     except AttributeError,e:
22         path = r‘C:\Users\Lisen\Desktop\error.jpg‘
23         print ‘download picture error, the link is ‘ + pic_url
24         pass
25     finally:
26         return path
27
28 def get_blog_body(url, title):
29     print ‘download blog‘
30     print title, time.strftime(‘%Y-%m-%d %H:%M:%S‘)
31     file_path = ‘D:\\workspace_Java\\cnBlogCrawler\\temp_doc\\‘ + title + ‘.docx‘
32     header = {‘Accept‘:‘text/html,application/xhtml+xml,application/xml‘,
33               ‘Accept-Encoding‘:‘gzip, deflate, sdch‘,
34               ‘Accept-Language‘:‘zh-CN,zh;q=0.8‘,
35               ‘Cache-Control‘:‘max-age=0‘,
36               ‘Connection‘:‘keep-alive‘,
37               ‘User-Agent‘:‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.75 Safari/537.36‘}
38     html = requests.get(url,header).text
39     bs = BeautifulSoup(html, ‘lxml‘)
40     print ‘html ready‘
41     ‘‘‘
42     获取html中blog正文部分
43     ‘‘‘
44     body = bs.find_all(‘p‘)
45     ‘‘‘
46     创建文档
47     ‘‘‘
48     doc = Document()
49     for each in body:
50         # 在函数中改变全局变量需要用global声明下
51         global index
52         ‘‘‘
53         如果p节点有子节点a的话，就说明其中是带着图片的链接或者是文本本身就是链接，因此要分三种情况去爬取
54         1. 没有子节点：直接获取text部分
55         2. 有子节点 a， 但是该子节点没有text部分，说明这是图片
56         3. 有子节点a， 且该子节点有text部分，则里面是文本
57         ‘‘‘
58         if each.a != None:
59             txt = each.a.get_text()
60             if txt == ‘‘:
61                 print ‘pic‘
62                 url = each.a.get(‘href‘)
63                 index = str(int(index) + 1)
64                 pic_name = get_blog_pic(url, index)
65                 try:
66                     doc.add_picture(pic_name)
67                 except docx.image.exceptions.UnrecognizedImageError, e:
68                     pass
69             else:
70                 print ‘URL‘
71                 content = each.a.get_text()
72                 paragraph = doc.add_paragraph(content)
73         else:
74             print ‘txt‘
75             content = each.get_text()
76             paragraph = doc.add_paragraph(content)
77     try:
78         print ‘saving file‘
79         doc.save(file_path)
80     except IOError,e:
81         print ‘file saving error‘
82         print ‘error url is ‘ + url
83         pass

 1 import MySQLdb
 2 import Crawler.get_file
 3 import Crawler.get_blog_details
 4 import time
 5
 6 if __name__ == ‘__main__‘:
 7
 8     conn = MySQLdb.connect(host = ‘localhost‘, user = ‘root‘, passwd = ‘12345a‘,
 9                             db = ‘pytest01‘, charset = ‘utf8‘)
10     cur = conn.cursor()
11     cur.execute(‘truncate blog_details‘)
12     url = r‘http://www.cnblogs.com/omygod/category/460309.html‘
13     print ‘next parsing html‘
14     html = Crawler.get_blog_details.get_html(url)
15     print ‘next parsing blog address‘
16     dict = Crawler.get_blog_details.get_blog_html_list(html)
17     print ‘next update MySQL‘
18     Crawler.get_blog_details.insert_to_db(dict)
19
20     cur.execute(‘select * from blog_details‘)
21     lines = cur.fetchall()
22     for line in lines:
23         Crawler.get_file.get_blog_body(line[1],line[0])
24         print ‘done‘ + time.strftime(‘%Y-%m-%d %H:%M:%S‘)
25         time.sleep(2)
26         print ‘wait over‘
27     cur.close()
28     conn.close()

由于博客园的原创博客都是通过随笔的形式保存的，因此我们可以通过对某一随笔目录进行解析，获取出该目录下所有博文的标题，链接以及摘要，存储到MySQL数据库中（主要是因为可以持久记录相关信息，后续有新博文的时候可以通过对比判断直接下载新的博文）。然后再对每个条目进行单独解析，将博文的内容，图片保存到Word文档中。

主要用到的包有： requests， BeautifulSoup，python-docx， MySQLdb

requests和BeautifulSoup主要是用来获取网页源代码以及对其进行解析；
python-docx是用来生成Word文档，手册可参照： https://python-docx.readthedocs.io/en/latest/user/quickstart.html
MySQLdb是用来通过Python操控MySQL数据库，具体可参照SegmentFault中的一篇文章：https://segmentfault.com/a/1190000000709735

整个代码分为三部分：

第一部分是get_blog_details.py，主要是获取目录下各个博文的详细信息，并存到MySQL数据库中
第二部分是get_file.py，获取各个博文的地址等并生成Word文档
第三部分是__init__.py，程序开始

在编写代码过程中，有几个地方需要注意：

若在函数中改变全局变量，需要在函数中声明： global g_var
在用BeautifulSoup解析某个节点的class属性是，不能直接用class否则会报错，因为class是保留关键字，要在class后加_，如：blog_list = bs.find_all(‘div‘, class_ = ‘entrylistItem‘)
在py文件中import其他py文件时，一要在文件目录中添加 __init__.py 文件，二是要在导入是加上对应文件夹的名字，如：import Crawler.get_file

　　现阶段缺点：

程序运行速度慢，后期准备学习下多线程爬虫
Word文档中图片太大以至于不方便阅读，准备用python-docx中的Inches修改
刚才跑的时候发现写数据库表的时候有一个地方没注意，相同主键不能再次写入，这个需要修改下现在每次解析博文列表时都会先进行删除数据库的操作
没有例外处理，程序容易崩溃 一定要添加例外处理，否则指不定程序在哪里就崩溃了。。
在爬取博文时发现有些博文内容含有表格，以及某些图片是超链接形式，表格在下一版中修改，超链接图片在此版中作为异常处理

　　体会：

　　1. 真的真的要做异常处理，否则会很痛苦

　　2. 分析不全面，没有考虑到含有表格或者其他形式

　　3. 争取做出GUI，方便处理

　　4. 添加log 文档，方便查看出错的地方

时间： 2024-11-03 05:44:08

Python - 爬取博客园某一目录下的随笔 - 保存为docx的相关文章

爬虫实战【1】使用python爬取博客园的某一篇文章

第一次实战,我们以博客园为例. Cnblog是典型的静态网页,通过查看博文的源代码,可以看出很少js代码,连css代码也比较简单,很适合爬虫初学者来练习. 博客园的栗子,我们的目标是获取某个博主的所有博文,今天先将第一步. 第一步:已知某一篇文章的url,如何获取正文? 举个栗子,我们参考'农民伯伯'的博客文章吧,哈哈.他是我关注的一个博主. http://www.cnblogs.com/over140/p/4440137.html 这是他的一篇名为"[读书笔记]长尾理论"的文章. 我

python爬取博客园首页文章

先上代码,比较长. 1 # -*- coding=utf-8 -*- 2 __author__ = 'lhyz' 3 4 import urllib 5 import re 6 import socket 7 import time 8 import os 9 10 #使用当前时间创建文件夹 11 ISOTIMEFORMAT='%Y-%m-%d-%X' 12 times=time.strftime( ISOTIMEFORMAT, time.localtime() ) 13 dir='./%s'%

Python爬虫爬取博客园并保存

Python爬虫爬取博客园并保存爬取博客园指定用户的文章修饰后全部保存到本地首先定义爬取的模块文件: crawlers_main.py 执行入口 url_manager.py url管理器 download_manager.py 下载模块 parser_manager.py html解析器(解析html需要利用的内容) output_manager.py 输出html网页全部内容文件(包括css,png,js等) crawlers_main.py 执行入口 1 # coding

运用python抓取博客园首页的所有数据，而且定时持续抓取新公布的内容存入mongodb中

原文地址:运用python抓取博客园首页的所有数据,而且定时持续抓取新公布的内容存入mongodb中依赖包: 1.jieba 2.pymongo 3.HTMLParser # -*- coding: utf-8 -*- """ @author: jiangfuqiang """ from HTMLParser import HTMLParser import re import time from datetime import date im

运用python抓取博客园首页的全部数据，并且定时持续抓取新发布的内容存入mongodb中

原文地址:运用python抓取博客园首页的全部数据,并且定时持续抓取新发布的内容存入mongodb中依赖包: 1.jieba 2.pymongo 3.HTMLParser # -*- coding: utf-8 -*- """ @author: jiangfuqiang """ from HTMLParser import HTMLParser import re import time from datetime import date im

java爬虫爬取博客园数据

网络爬虫编辑网络爬虫(又称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁.自动索引.模拟程序或者蠕虫. 网络爬虫按照系统结构和实现技术,大致可以分为以下几种类型:通用网络爬虫(General Purpose Web Crawler).聚焦网络爬虫(Focused Web Crawler).增量式网络爬虫(Incremental Web Crawler).深层网络爬虫(Deep We

Python+webdriver爬取博客园“我的闪存”并保存到本地

前篇用webdriver+phantomjs实现无浏览器的自动化过程本篇想法与实现我想要将博客园“我的闪存”部分内容爬取备份到本地文件,用到了WebDriver和Phantomjs的无界面浏览器.对于xpath的获取与校验需要用到firefox浏览器,安装firebug和firepath插件.代码如下: # -*- coding: utf-8 -*- import os,time from selenium import webdriver from selenium.webdriver

nodejs爬取博客园的博文

其实写这篇文章,我是很忐忑的,因为爬取的内容就是博客园的,万一哪个顽皮的小伙伴拿去干坏事,我岂不成共犯了? 好了,进入主题. 首先,爬虫需要用到的模块有: express ejs superagent (nodejs里一个非常方便的客户端请求代理模块) cheerio (nodejs版的jQuery) 前台布局使用bootstrap 分页插件使用 twbsPagination.js 完整的爬虫代码,在我的github中可以下载.主要的逻辑代码在 router.js 中. 1. 爬取某个栏目第1页

webmagic爬取博客园所有文章

最近学习了下webmagic,学webmagic是因为想折腾下爬虫,但是自己学java的,又不想太费功夫,所以webmagic是比较好的选择了. 写了几个demo,源码流程大致看了一遍.想着把博客园的文章列表爬下来吧. 首页显示的就是第一页文章的列表, 但是翻页按钮不是链接,而是动态的地址: 实际请求的地址及参数: 针对这个动态页面的情况,有两种解决方案: 1. webmagic模拟post请求,获取返回页面. 1 public class CnblogsSpider implements Pa