Setting up Nutch 2.1 with MySQL to handle UTF-8

原文地址: http://nlp.solutions.asia/?p=180

These instructions assume Ubuntu 12.04 and Java 6 or 7 installed and JAVA_HOME configured.

Install MySQL Server and MySQL Client using the Ubuntu software center or sudo apt-get install mysql-server mysql-client at the command line.

As MySQL defaults to latin (are we still in the 1990s?) we need to edit sudo vi /etc/mysql/my.cnf and under [mysqld] add

innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
max_allowed_packet=500M

The innodb options are to help deal with the small primary key size restriction of MySQL. Restart your machine for the changes to take effect. The max_allowed_packet option is so you don’t run into issues as your database and the pages you store in it get larger.

Check to make sure MySQL is running by typing sudo netstat -tap | grep mysql  and you should see something like

tcp        0      0 localhost:mysql         *:*                     LISTEN

We need to set up the nutch database manually as the current Nutch/Gora/MySQL generated db schema defaults to latin. Log into mysql at the command line using your previously set up MySQL id and password type

mysql -u xxxxx -p

then in the MySQL editor type the following:

CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;

and enter followed by

use nutch;

and enter and then copy and paste the following altogether:

CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` mediumtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED
DEFAULT CHARSET=utf8mb4;

Then type enter. You are done setting up the MySQL database for Nutch.

Set up Nutch 2.1 by downloading the latest version from http://www.apache.org/dyn/closer.cgi/nutch/. Untar the contents of the file you just downloaded and going forward we will refer to this folder as ${APACHE_NUTCH_HOME}.

From inside the nutch folder ensure the MySQL dependency for Nutch is available by editing the following in ${APACHE_NUTCH_HOME}/ivy/ivy.xml

<!– Uncomment this to use MySQL as database with SQL as Gora store. –>
<dependency org=”mysql” name=”mysql-connector-java” rev=”5.1.18″ conf=”*->default”/>

Edit the ${APACHE_NUTCH_HOME}/conf/gora.properties file either deleting or commenting out the Default SqlStore Properties using #. Then add the MySQL properties below replacing xxxxx with the user and password you set up when installing MySQL earlier.

###############################
# MySQL properties            #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=xxxxx
gora.sqlstore.jdbc.password=xxxxx

Edit the ${APACHE_NUTCH_HOME}/conf/gora-sql-mapping.xml file changing the length of the primarykey from 512 to 767 in both places.
<primarykey column=”id” length=”767″/>

Configure ${APACHE_NUTCH_HOME}/conf/nutch-site.xml to put in a name in the value field under http.agent.name. It can be anything but cannot be left blank. Add additional languages if you want (I have added Japanese ja-jp below) and utf-8 as default as well. You must specify Sqlstore.

<property>
<name>http.agent.name</name>
<value>Your Nutch Spider</value>
</property>

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>

Install ant using the Ubuntu software center or sudo apt-get install ant at the command line.

From the command line cd to your nutch folder type ant runtime
This may take a few minutes to compile.

Start your first crawl by typing the lines below at the terminal (replace ‘http://nutch.apache.org/’ with whatever site you want to crawl):

cd ${APACHE_NUTCH_HOME}/runtime/local
mkdir -p urls
echo ‘http://nutch.apache.org/‘ > urls/seed.txt
bin/nutch crawl urls -depth 3 -topN 5

You can easily add more urls to search by hand in seed.txt if you want. For the crawl, depth is the number of rounds of generate/fetch/parse/update you want to do (not depth of links as you might think at first) and topNis the max number of links you want to actually parse each time. Note however Nutch keeps track of all links it encounters in the webpage table (it just limits the amount it actually parses to TopN so don’t be surprised by seeing many more rows in the webpage table than you expect by limiting with TopN).

Check your crawl results by looking at the webpage table in the nutch database.

mysql -u xxxxx -p
use nutch;
SELECT * FROM nutch.webpage;

You should see the results of your crawl (around 159 rows). It will be hard to read the columns so you may want to install MySQL Workbench via sudo apt-get install mysql-workbench and use that instead for viewing the data. You may also want to run the following SQL command select * from webpage where status = 2; to limit the rows in the webpage table to only urls that were actually parsed.

Set up and index with Solr If you are using Nutch 2.1 at this time you are into the bleeding edge and probably want the latest version of Solr 4.0 as well. Untar it to to $HOME/apache-solr-4.0.0-XXXX. This folder will be now referred to as ${APACHE_SOLR_HOME}.
Download http://nlp.solutions.asia/wp-content/uploads/2012/08/schema.xml  and use it to replace ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.

From the terminal start solr:
cd ${APACHE_SOLR_HOME}/example
java -jar start.jar

You can check this is running by opening http://localhost:8983/solr in your web browser.

Leave that terminal running and from a different terminal type the following:
cd ${APACHE_NUTCH_HOME}/runtime/local/
bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex

You can now run queries using Solr versus your crawled content. Openhttp://localhost:8983/solr/#/collection1/query and assuming you have crawled nutch.apache.org in the input box titled “q” you can do a search by inputting text:nutch and you should see something like this:

There remains a lot to configure to get a good web search going but you are at least started.

Setting up Nutch 2.1 with MySQL to handle UTF-8

时间: 2024-11-02 13:47:21

Setting up Nutch 2.1 with MySQL to handle UTF-8的相关文章

Mysql之表的操作&amp;索引&amp;explain&amp;profile

创建一个表create table(help create table) =>rename table A to B  更改表名 =>alter table A rename to B 更改表 =>drop table A   删除表 mysql> show create database gtms; #查看建库语句 +----------+---------------------------------------------------------------+ | Data

MySQL的Auto-Failover功能

今天来体验一下MySQL的Auto-Failover功能,这里用到一个工具MySQL Utilities,它的功能很强大.此工具提供如下功能:(1)管理工具 (克隆.复制.比较.差异.导出.导入)(2)复制工具 (安装.配置)(3)一般工具 (磁盘使用情况.冗余索引.搜索元数据)而我们用它来实现Master-Slave的自动Failover,下面开始 Master:192.168.13.194Slave:192.168.13.159此Failover需要建立在GTID的基础上所以MySQL版本必

重装mysql出现无响应解决

问题 在安装mysql数据库时,如果重新安装,很容易遇见apply security setting error,即在配置mysql启动服务时,在启动apply security setting时会出错, 在安装最后一步出现无响应卡死状态 原因 是卸载mysql时并没有完全删除文件,所以有必要手动清除这些,要清除的文件 解决方法 第一步:删除mysql的安装目录,一般为C:\Program Files目录下. 第二步:删除mysql的数据存放目录,一般在C:\Documents and Sett

mysql的两阶段提交协议

http://www.cnblogs.com/hustcat/p/3577584.html 前两天和百度的一个同学聊MySQL两阶段提交,当时自信满满的说了一堆,后来发现还是有些问题的理解还是比较模糊,可能是因为时间太久了,忘记了吧.这里再补一下:) 5.3.1事务提交流程 MySQL的事务提交逻辑主要在函数ha_commit_trans中完成.事务的提交涉及到binlog及具体的存储的引擎的事务提交.所以MySQL用2PC来保证的事务的完整性.MySQL的2PC过程如下: [email pro

使用 sqoop 将mysql数据导入到hive(import)

Sqoop 将mysql 数据导入到hive(import) 1.创建mysql表 CREATE TABLE `sqoop_test` ( `id` int(11) DEFAULT NULL, `name` varchar(255) DEFAULT NULL, `age` int(11) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 插入数据 2.hive 建表 hive> create external table sqoop_test

mysql 5.5 安装配置方法图文教程

1.首先进入的是安装引导界面 2.然后进入的是类型选择界面,这里有3个类型:Typical(典型).Complete(完全).Custom(自定义).这里建议 选择"自定义"(Custom)安装,这样可以自定义选择MySQL的安装目录,然后点"Next"下一步,出现自定义安装界面,为了数据安全起见,不建议将MySQL安装系统盘C目录. 3.准备安装 4.安装完成之后会出现MySQL配置的引导界面 5.这里有个引导配置MySQL的选项(Luanch the MySQL

mysql错误【一】[ERROR] Missing system table mysql.proxies_priv

环境:mysql一主一从架构,主库是mysql5.1,从库是mysql5.6:系统均为CentOS6.2 问题: 在主库上面执行的SQL语句 1.创建表 CREATE TABLE `app_versions` (  `date` date NOT NULL,  `app` char(16) NOT NULL,  `ver` char(16) NOT NULL,  `val` int(11) DEFAULT '0',  PRIMARY KEY (`date`,`app`,`ver`)) ENGIN

MySQL复制: Galera

MySQL复制: Galera mysql 主从复制 大纲 MySQL复制: Galera 前言 Galera Replication简介 MariaDB-Galera-Server 环境部署 配置步骤 总结 前言 之前介绍了MySQL复制的各种解决方案, 但是我个人还是感觉Galera最好用也最实用, 什么是Galera, 它强大在哪里, 这篇文章就带你认识这个强大的工具 Galera Replication简介 Galera Repplication Galera复制发生在事务提交时, 通过

How to Install MySQL 5.6 from Official Yum Reposit

Tags: MySQL Distribution: CentOS Submitted by: Morgan TockerMySQL Community Manager @ Oracle Introduction In October 2013, the MySQL development team officially launched support for yum repositories. This means that you can now ensure that you have t