1.引言
项目需要做爬虫并能提供个性化信息检索及推送,发现各种爬虫框架。其中比较吸引的是这个:
Nutch+MongoDB+ElasticSearch+Kibana 搭建搜索引擎
E文原文在:http://www.aossama.com/search-engine-with-apache-nutch-mongodb-and-elasticsearch/
考虑用docker把系统搭建起来测试:
docker来源如下:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html
https://store.docker.com/community/images/pure/nutch-mongo
然而,docker下载image时实在是太慢,放弃docker!
Mac 设置JAVA_HOME:
vi ~/.bash_profile
export JAVA_HOME=$(/usr/libexec/java_home)
export PATH=$JAVA_HOME/bin:$PATH
export CLASS_PATH=$JAVA_HOME/lib
2.安装Mongo
Mac下直接用brew安装,此时最新版本是3.4.7。
安装好后建/data/db目录,mongod启动服务。
测试可用mongo命令连接,输入dbs查看数据库。
brew install mongo sudo mkdir /data/db sudo chown <你都用户名> /data mongod
3.安装es+kibana
下载es, 最新版是5.5.1. 地址:https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.5.1.tar.gz
修改配置
$ vim config/elasticsearch.yml
cluster.name: my-application
node.name: "node-1"
node.master: true
node.data: true
path.data: /opt/elasticsearch/data
network.bind_host: 127.0.0.1
network.publish_host: 127.0.0.1
network.host: 127.0.0.1
运行命令:bin/elasticsearch
浏览器访问:http://localhost:9200
下载kibana, 最新版是5.5.1,地址:Mac
运行命令:bin/kibana
浏览器访问:http://localhost:5601
4.安装Apache nutch
下载Apache Nutch 2.3.1 (src.tar.gz): http://nutch.apache.org/downloads.html
配置环境变量:export NUTCH_HOME=$(pwd)
修改配置
$ cat conf/nutch-site.xml
<configuration>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.mongodb.store.MongoStore</value>
<description>Default class for storing data</description>
</property>
</configuration>
解除注释mongodb相关注释:
$NUTCH_HOME/ivy/ivy.xml:
<dependency org="org.apache.gora" name="gora-mongodb" rev="0.5" conf="*->default" />
$NUTCH_HOME/conf/gora.properties
############################
# MongoDBStore properties #
############################
gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
gora.mongodb.override_hadoop_configuration=false
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=nutch
重要!需要更新elastic插件!原插件版本1.4.1,现最新是5.5.1.
修改
cd src/plugin/indexer-elastic/
vi src/plugin/indexer-elastic/ivy.xml
...
<dependencies>
<dependency org="org.elasticsearch" name="elasticsearch"
rev="5.5.1" conf="*->default" />
</dependencies>
...
ant -f ./build-ivy.xml
ls lib 查看版本,更新plugin.xml中版本号。
<library name="HdrHistogram-2.1.9.jar"/>
<library name="elasticsearch-5.5.1.jar"/>
<library name="hppc-0.7.1.jar"/>
<library name="jackson-core-2.8.6.jar"/>
<library name="jackson-dataformat-cbor-2.8.6.jar"/>
<library name="jackson-dataformat-smile-2.8.6.jar"/>
<library name="jackson-dataformat-yaml-2.8.6.jar"/>
<library name="jna-4.4.0.jar"/>
<library name="joda-time-2.9.5.jar"/>
<library name="jopt-simple-5.0.2.jar"/>
<library name="log4j-api-2.8.2.jar"/>
<library name="lucene-analyzers-common-6.6.0.jar"/>
<library name="lucene-backward-codecs-6.6.0.jar"/>
<library name="lucene-core-6.6.0.jar"/>
<library name="lucene-grouping-6.6.0.jar"/>
<library name="lucene-highlighter-6.6.0.jar"/>
<library name="lucene-join-6.6.0.jar"/>
<library name="lucene-memory-6.6.0.jar"/>
<library name="lucene-misc-6.6.0.jar"/>
<library name="lucene-queries-6.6.0.jar"/>
<library name="lucene-queryparser-6.6.0.jar"/>
<library name="lucene-sandbox-6.6.0.jar"/>
<library name="lucene-spatial-6.6.0.jar"/>
<library name="lucene-spatial-extras-6.6.0.jar"/>
<library name="lucene-spatial3d-6.6.0.jar"/>
<library name="lucene-suggest-6.6.0.jar"/>
<library name="securesm-1.1.jar"/>
<library name="snakeyaml-1.15.jar"/>
<library name="t-digest-3.0.jar"/>
然而!更大的坑是这个plugin代码出错了!不折腾了,放弃!
开始编译:ant runtime (跑了33分钟!)
结论
1. nutch 2.x 与 elasticsearch 5.x暂时不能很好兼容,不想折腾,放弃。
2. 下次尝试新的架构:scrapy + scrapy-redis + mongodb + elasticsearch