hadoop 读写 elasticsearch 初探

1、参考文档:

http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/configuration.html

http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/mapreduce.html#_emphasis_old_emphasis_literal_org_apache_hadoop_mapred_literal_api

2、Mapreduce相关配置

//以下ES配置主要是提供给ES的Format类进行读取使用

Configuration conf = new Configuration();

conf.set(ConfigurationOptions.ES_NODES, "127.0.0.1");

conf.set(ConfigurationOptions.ES_PORT, "9200");

conf.set(ConfigurationOptions.ES_INDEX_AUTO_CREATE, "yes");

//设置读取和写入的资源index/type

conf.set(ConfigurationOptions.ES_RESOURCE, "helloes/demo"); //read Target index/type

//假如只是想检索部分数据，可以配置ES_QUERY

//conf.set(ConfigurationOptions.ES_QUERY, "?q=me*");

//配置Elasticsearch为hadoop开发的format等

Job job = Job.getInstance(conf,ElasticsearchIndexMapper.class.getSimpleName());

job.setJarByClass(ElasticsearchIndexBuilder.class);

job.setSpeculativeExecution(false);//Disable speculative execution

job.setInputFormatClass(EsInputFormat.class);

//假如数据输出到HDFS，指定Map的输出Value的格式。并且选择Text格式

job.setOutputFormatClass(TextOutputFormat.class);

job.setMapOutputValueClass(Text.class);

job.setMapOutputKeyClass(NullWritable.class);

//如果选择输入到ES

job.setOutputFormatClass(EsOutputFormat.class);//输出到

job.setMapOutputValueClass(LinkedMapWritable.class);//输出的数值类

job.setMapOutputKeyClass(Text.class); //输出的Key值类

job.setMapperClass(ElasticsearchIndexMapper.class);

FileInputFormat.addInputPath(job, new Path("hdfs://localhost:9000/es_input"));

FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/es_output"));

job.setNumReduceTasks(0);

job.waitForCompletion(true);

3、对应的Mapper类ElasticsearchIndexMapper

public class ElasticsearchIndexMapper extends Mapper {

@Override

protected void map(Object key, Object value, Context context)

throws IOException, InterruptedException {

//假如我这边只是想导出数据到HDFS

LinkedMapWritable doc = (LinkedMapWritable) value;

Text docVal = new Text();

docVal.set(doc.toString());

context.write(NullWritable.get(), docVal);

}

4、小结

hadoop-ES读写最主要的就是ESInputFormat、ESOutputFormat的参数配置（Configuration）。

另外其它数据源操作（Mysql等）也是类似，找到对应的InputFormat，OutputFormat配置上环境参数。

时间： 2024-11-10 04:00:11

hadoop 读写 elasticsearch 初探的相关文章

基于Nutch+Hadoop+Hbase+ElasticSearch的网络爬虫及搜索引擎

网络爬虫架构在Nutch+Hadoop之上,是一个典型的分布式离线批量处理架构,有非常优异的吞吐量和抓取性能并提供了大量的配置定制选项.由于网络爬虫只负责网络资源的抓取,所以,需要一个分布式搜索引擎,用来对网络爬虫抓取到的网络资源进行实时的索引和搜索. 搜索引擎架构在ElasticSearch之上,是一个典型的分布式在线实时交互查询架构,无单点故障,高伸缩.高可用.对大量信息的索引与搜索都可以在近乎实时的情况下完成,能够快速实时搜索数十亿的文件以及PB级的数据,同时提供了全方面的选项,可以对

Elasticsearch初探

elasticsearch中的概念同传统数据库的类比如下: Relational DB -> Databases -> Tables -> Rows -> ColumnsElasticsearch -> Indices -> Types -> Documents -> Fields 导入数据API: curl -XPOST 'http://localhost:9200/prd/xjb3/_bulk' --data-binary @last.json l

全文搜索引擎Elasticsearch初探

前言: 在Web应用或后台数据管理中,随着数据量的倍数增长,搜索引擎特别是全文搜索引擎的应用越来越迫切.基于技术和成本考虑,我们不可能去开发一个搜索引擎以满足我们的需求,庆幸的是业界已有许多优秀的开源搜索引擎可供我们使用,Elasticsearch便是其中之一. 简介: Elasticsearch是一个基于Apache Lucene(TM)的开源搜索引擎.无论在开源还是专有领域,Lucene可以被认为是迄今为止最先进.性能最好的.功能最全的搜索引擎库.但是,Lucene只是一个库.想要使用它,你

ElasticSearch初探（一）

ElasticSearch的官网 https://www.elastic.co/ 一.安装 ElasticSearch是基于Lence的,而Lence是用Java编写的开源库,需要依赖Java的运行环境.现在使用的ELasticSearch版本是1.6,它需要jdk1.7或以上的版本. 本文使用的是linux系统,安装配置好Java环境,把download下来,解压后直接执行启动就可以了. 1.安装启动elasticsearch: cd到elasticsearch-1.6.0.tar.gz 放置

Hadoop 读写数据流

Hadoop文件读取 1)客户端通过调用FileSystem对象中的open()函数来读取它做需要的数据.FileSystem是HDFS中DistributedFileSystem的一个实例. 2)DistributedFileSystem会通过RPC协议调用NameNode来确定请求文件块所在的位置. 这里需要注意的是,NameNode只会返回所调用文件中开始的几个块而不是全部返回.对于每个返回的块,都包含块所在的DataNode地址.随后,这些返回的DataNode会按照Hadoop定义的集

记一次netty的Hadoop和elasticsearch冲突jar包

在一个项目中同时使用hbase和elasticsearch出现netty的jar包冲突的问题事件: 在同一maven项目中使用hbase的同时又用了es 程序运行后出错 1 java.lang.NoSuchMethodError: io.netty.util.AttributeKey.newInstance(Ljava/lang/String;)Lio/netty/util/AttributeKey; 上网查了一些原因,说是netty的版本不同的原因,自己在编译后的目录也看到了不同,分别用了4

hadoop学习笔记--hadoop读写文件过程

读取文件: 下图是HDFS读取文件的流程: 这里是详细解释: 1.当客户端开始读取一个文件时,首先客户端从NameNode取得这个文件的前几个block的DataNode信息.(步骤1,2) 2.开始调用read(),read()方法里,首先去读取第一次从NameNode取得的几个Block,当读取完成后,再去NameNode拿下一批Block的DataNode信息.(步骤3,4,5) 3. 调用Close方法完成读取.(步骤6) 当读取一个Block时如果出错了怎么办呢.客户端会去另一个最佳

Elasticsearch集成Hadoop最佳实践.pdf（内含目录）

Elasticsearch服务器开发(第2版) 介绍: ElasticSearch是一个开源的分布式搜索引擎,具有高可靠性,支持非常多的企业级搜索用例.ElasticsearchHadoop作为一个完美的工具,用来连接Elasticsearch和Hadoop的生态系统.通过Kibana技术,ElasticsearchHadoop很容易从Hadoop生态系统中获得大数据分析的结果. 本书全面介绍ElasticsearchHadoop技术用于大数据分析以及数据可视化的方法.内容共分7章,包括Hado

ElasticSearch集群服务器配置

一.安装简单的安装与启动于前文ElasticSearch初探(一)已有讲述,这里不再重复说明. 二.启动 1.自带脚本启动 1)bin/elasticsearch,不太任何参数,默认在前端启动 2)bin/elasticsearch-d,带参-d,表示在后台作为服务线程启动还可以设置更多的参数:bin/elasticsearch-Xmx2g-Xms2g-Des.index.store.type=memory--node.name=my-node 注意:如果是在局域网中运行elasticsea