HDFS Scribe Integration 【转】

It is finally here: you can configure the open source log-aggregator, scribe, to log data directly into the Hadoop distributed file system.

Many Web 2.0 companies have to deploy a bunch of costly filers to capture weblogs being generated by their application. Currently, there is no option other than a costly filer because the write-rate for this stream is huge. The Hadoop-Scribe integration allows this write-load to be distributed among a bunch of commodity machines, thus reducing the total cost of this infrastructure.

The challenge was to make HDFS be real-timeish in behaviour. Scribe uses libhdfs which is the C-interface to the HDFs client. There were various bugs in libhdfs that needed to be solved first. Then came the FileSystem API. One of the major issues was that the FileSystem API caches FileSystem handles and always returned the same FileSystem handle when called from multiple threads. There was no reference counting of the handle. This caused problems with scribe, because Scribe is highly multi-threaded. A new API FileSystem.newInstance() was introduced to support Scribe.

Making the HDFS write code path more real-time was painful. There are various timeouts/settings in HDFS that were hardcoded and needed to be changed to allow the application to fail fast. At the bottom of this blog-post, I am attaching the settings that we have currently configured to make the HDFS-write very real-timeish. The last of the JIRAS, HADOOP-2757 is in the pipeline to be committed to Hadoop trunk very soon.

What about Namenode being the single point of failure? This is acceptable in a warehouse type of application but cannot be tolerated by a realtime application. Scribe typically aggregates click-logs from a bunch of webservers, and losing *all* click log data of a website for a 10 minutes or so (minimum time for a namenode restart) cannot be tolerated. The solution is to configure two overlapping clusters on the same hardware. Run two separate namenodes N1 and N2 on two different machines. Run one set of datanode software on all slave machines that report to N1 and the other set of datanode software on the same set of slave machines that report to N2. The two datanode instances on a single slave machine share the same data directories. This configuration allows HDFS to be highly available for writes!

The highly-available-for-writes-HDFS configuration is also required for software upgrades on the cluster. We can shutdown one of the overlapping HDFS clusters, upgrade it to new hadoop software, and then put it back online before starting the same process for the second HDFS cluster.

What are the main changes to scribe that were needed? Scribe already had the feature that it buffers data when it is unable to write to the configured storage. The default scribe behaviour is to replay this buffer back to the storage when the storage is back online. Scribe is configured to support no-buffer-replay when the primary storage is back online. Scribe-hdfs is configured to write data to a cluster N1 and if N1 fails then it writes data to cluster N2. Scribe treats N1 and N2 as two equivalent primary stores.

转自:http://hadoopblog.blogspot.hk/2009/06/hdfs-scribe-integration.html

HDFS Scribe Integration 【转】,布布扣,bubuko.com

时间: 2024-10-13 03:23:37

HDFS Scribe Integration 【转】的相关文章

开源日志收集系统Scribe 参数说明

一.scribe配置参数的两种方式: 1) 通过命令行,-c commandname 2) 通过指定配置文件 二.全局参数 1)port: (number) scribe监听的端口 默认为0 可以通过命令行-p指定 2)max_msg_per_second: (number) 每秒最大日志并发数 默认为0,0则表示没有限制 在scribeHandler::throttleDeny中使用 3)max_queue_site:(byte) 队列最大可以为多少 默认为5,000,000 bytes 在s

kettle连接hadoop&hdfs图文详解

1 引言: 项目最近要引入大数据技术,使用其处理加工日上网话单数据,需要kettle把源系统的文本数据load到hadoop环境中 2 准备工作: 1 首先 要了解支持hadoop的Kettle版本情况,由于kettle资料网上较少,所以最好去官网找,官网的url: http://wiki.pentaho.com/display/BAD/Configuring+Pentaho+for+your+Hadoop+Distro+and+Version 打开这个url 到页面最下面的底端,如下图: ar

开源日志系统比较:scribe、chukwa、kafka、flume

1. 背景介绍 许多公司的平台每天会产生大量的日志(一般为流式数据,如,搜索引擎的pv,查询等),处理这些日志需要特定的日志系统,一般而言,这些系统需要具有以下特征: (1) 构建应用系统和分析系统的桥梁,并将它们之间的关联解耦: (2) 支持近实时的在线分析系统和类似于Hadoop之类的离线分析系统: (3) 具有高可扩展性.即:当数据量增加时,可以通过增加节点进行水平扩展. 本文从设计架构,负载均衡,可扩展性和容错性等方面对比了当今开源的日志系统,包括facebook的scribe,apac

[转载] scribe配置

目录(?)[-] Scribe can be configured with Global Configuration Variables Store Configuration Store Configuration Variables File Store Configuration Network Store Configuration Buffer Store Configuration Bucket Store Configuration Null Store Configuratio

flume简介与监听文件目录并sink至hdfs实战

场景 1. flume是什么 1.1 背景 flume 作为 cloudera 开发的实时日志收集系统,受到了业界的认可与广泛应用.Flume 初始的发行版本目前被统称为 Flume OG(original generation),属于 cloudera.但随着 FLume 功能的扩展,Flume OG 代码工程臃肿.核心组件设计不合理.核心配置不标准等缺点暴露出来,尤其是在 Flume OG 的最后一个发行版本 0.94.0 中,日志传输不稳定的现象尤为严重,为了解决这些问题,2011 年 1

scribe、chukwa、kafka、flume日志系统对比

scribe.chukwa.kafka.flume日志系统对比 1. 背景介绍许多公司的平台每天会产生大量的日志(一般为流式数据,如,搜索引擎的pv,查询等),处理 这些日志需要特定的日志系统,一般而言,这些系统需要具有以下特征:(1) 构建应用系统和分析系统的桥梁,并将它们之间的关联解耦:(2) 支持近实时的在线分析系统和类似于Hadoop之类的离线分析系统:(3) 具有高可扩展性.即:当数据量增加时,可以通过增加节点进行水平扩展. 本文从设计架构,负载均衡,可扩展性和容错性等方面对比了当今开

Scribe配置文件解析

scribe配置文件详解 1.全局配置项 (1)port:指示scribe服务器在哪一个端口上监听,默认是0,通过命令行参数选项-P可以指定端口,也能够通过配置文件指定.在源代码中就赋值给变量port. (2)max_msg_per_second:默认值是0,如果这个参数值是0将被忽略.随着最近的改变这个参数很少被关联使用到,max_queue_size参数将被应用到限制每秒最大的消息数.在scribeHandler::throttleDeny被使用. (3)max_queue_size(按字节

scribe conf,编译了一天,鬼知道用不用得到,找到的conf解释

Scribe的配置文件由全局的section和一个或多个store的section组成.这篇来了解一下scribe的配置文件,在源码包的examples目录下有多个配置文件实例: ? 1 2 3 4 5 6 7 8 9 examples/ ├── example1.conf  #模拟服务端 ├── example2central.conf  #在同一台机器模拟服务端的配置 ├── example2client.conf  #在同一台机器模拟客户端的配置 ├── hdfs_example2.con

_00017 Flume的体系结构介绍以及Flume入门案例(往HDFS上传数据)

博文作者:妳那伊抹微笑 个性签名:世界上最遥远的距离不是天涯,也不是海角,而是我站在妳的面前,妳却感觉不到我的存在 技术方向:hadoop,数据分析与挖掘 转载声明:可以转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明,谢谢合作! qq交流群:214293307  (期待与你一起学习,共同进步) # 学习前言 想学习一下Flume,网上找了好多文章基本上都说的很简单,只有一半什么的,简直就是坑爹,饿顿时怒火就上来了,学个东西真不容易,然后自己耐心的把这些零零碎碎的东西整理整理,各种