node小爬虫

本身就想玩一下爬虫这玩意，看到http://www.imooc.com/video/7965，诺，关于node的爬虫视频，就省的自己研究了，当然，视频中爬的html和现在的有不同，并不是很影响。

为了复习一下fs模块，就爬了之后写成自己想要的形式。

将文件爬出来，并把需要的部分存成自己想要的格式，然后写入一个txt文件，当然，后期通过这也可以写入数据库，想怎么玩就怎么玩了。以下是爬出来的文件。

小伙伴肯定迫不及待看源代码了吧。

直接贴源代码了：

--------------------------------------------------------------------我是分割线

var http = require(‘http‘);
var fs = require(‘fs‘);
var util = require(‘util‘);
var cheerio = require(‘cheerio‘);

var imooc_url = ‘http://www.imooc.com/learn/348‘;

function printCourseInfo(courseData) {
//console.log(util.inspect(courseData));
courseData.forEach(function(item) {
var chapterTitle = item.chapterTitle;
fs.appendFileSync(‘crawler.txt‘, chapterTitle);
fs.appendFileSync(‘crawler.txt‘, ‘\n‘);
item.videos.forEach(function(video) {
var v = ‘ \t 【 ‘ + video.id + ‘ 】 ‘ + video.title;
fs.appendFileSync(‘crawler.txt‘, v);
fs.appendFileSync(‘crawler.txt‘, ‘\n‘);
})

});

}

http.get(imooc_url, function(res) {
var html = ‘‘;
res.on(‘data‘, function(data) {
html += data;
});
res.on(‘end‘, function() {
var courseData = filterChapters(html);
printCourseInfo(courseData);
});

}).on(‘error‘, function() {
console.log(‘获取出错‘);
})

--------------------------------------------------------------------------我是分割线

}).on(‘error‘, function() {
console.log(‘获取出错‘);
})

显示请求了慕课的一个网址，然后把数据保存为变量，传给章节过滤函数，然后打印信息。

过滤函数中，为了方便我查看，先是把它存为html文件。

fs.stat(‘crawler.html‘, function(err, stats) {
if (err) {
fs.writeFile(‘crawler.html‘, html, function(err) {
if (err) return console.log(err);
console.log(‘写入成功!‘);
});
} else {
console.log(‘文件已存在‘);
}
});

加载cheerio模块，构建自己需要的数据，由于跟jquery操作差不多，就不需要多解释了。

打印的时候将其以同步追加的方式写入txt文件中，

courseData.forEach(function(item) {
var chapterTitle = item.chapterTitle;
fs.appendFileSync(‘crawler.txt‘, chapterTitle);
fs.appendFileSync(‘crawler.txt‘, ‘\n‘);
item.videos.forEach(function(video) {
var v = ‘ \t 【 ‘ + video.id + ‘ 】 ‘ + video.title;
fs.appendFileSync(‘crawler.txt‘, v);
fs.appendFileSync(‘crawler.txt‘, ‘\n‘);
})

});

//如果慕课网这个页面又变了就得改改构造数据的那一块了。

现在的chapter结构是这样的

<span class="icon-drop_down js-close"></span>
<strong>
<i class="icon-chapter"></i>
第1章前言
<div class="icon-info chapter-info">
<i class="icon-drop_up triangle">
<div class="chapter-introubox">
<div class="chapter-content">带你了解为什么要学习 Nodejs。</div>
</div>
</i>
</div>
</strong>

</h3>


<ul class="video">
<li data-media-id="6687">
<a href=‘/video/6687‘ class="J-media-item">
<i class="icon-video type"></i>
1-1 Node.js基础-前言
(01:20)

</a>


</li>
<li data-media-id="6688">
<a href=‘/video/6688‘ class="J-media-item">
<i class="icon-video type"></i>
1-2 为什么学习Nodejs
(05:43)

时间： 2024-10-13 22:31:41

node小爬虫

node小爬虫的相关文章

Node.js（四）【HTTP小爬虫】

Node.js 爬虫批量下载美剧 from 人人影视 HR-HDTV

http 小爬虫

nodejs .http模块, cheerio模块实现小爬虫.

Java豆瓣电影爬虫——小爬虫成长记（附源码）

HTTP小爬虫，nodejs学习(二)

用NodeJs做一个小爬虫

今天来做一个PHP电影小爬虫。

python之小爬虫