Awesome Hadoop

A curated list of amazingly awesome Hadoop and Hadoop ecosystem resources. Inspired by Awesome PHPAwesome Pythonand Awesome Sysadmin

Hadoop

  • Apache Hadoop - Apache Hadoop
  • Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop
  • SpatialHadoop - SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data.
  • GIS Tools for Hadoop - Big Data Spatial Analytics for the Hadoop Framework
  • Elasticsearch Hadoop - Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive and Apache Pig.
  • dumbo - Python module that allows you to easily write and run Hadoop programs.
  • hadoopy - Python MapReduce library written in Cython.
  • mrjob - mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming jobs.
  • pydoop - Pydoop is a package that provides a Python API for Hadoop.
  • hdfs-du - HDFS-DU is an interactive visualization of the Hadoop distributed file system.
  • White Elephant - Hadoop log aggregator and dashboard
  • Kiji Project
  • Genie - Genie provides REST-ful APIs to run Hadoop, Hive and Pig jobs, and to manage multiple Hadoop resources and perform job submissions across them.
  • Apache Kylin - Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets
  • Crunch - Go-based toolkit for ETL and feature extraction on Hadoop
  • Apache Ignite - Distributed in-memory platform

YARN

  • Apache Slider - Apache Slider is a project in incubation at the Apache Software Foundation with the goal of making it possible and easy to deploy existing applications onto a YARN cluster.
  • Apache Twill - Apache Twill is an abstraction over Apache Hadoop® YARN that reduces the complexity of developing distributed applications, allowing developers to focus more on their application logic.
  • mpich2-yarn - Running MPICH2 on Yarn

NoSQL

Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable.

  • Apache HBase - Apache HBase
  • Apache Phoenix - A SQL skin over HBase supporting secondary indices
  • happybase - A developer-friendly Python library to interact with Apache HBase.
  • Hannibal - Hannibal is tool to help monitor and maintain HBase-Clusters that are configured for manual splitting.
  • Haeinsa - Haeinsa is linearly scalable multi-row, multi-table transaction library for HBase
  • hindex - Secondary Index for HBase
  • Apache Accumulo - The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system.
  • OpenTSDB - The Scalable Time Series Database
  • Apache Cassandra

SQL on Hadoop

SQL on Hadoop

  • Apache Hive - The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL
  • Apache Phoenix A SQL skin over HBase supporting secondary indices
  • Apache HAWQ (incubating) - Apache HAWQ is a Hadoop native SQL query engine that combines the key technological advantages of MPP database with the scalability and convenience of Hadoop
  • Lingual - SQL interface for Cascading (MR/Tez job generator)
  • Cloudera Impala
  • Presto - Distributed SQL Query Engine for Big Data. Open sourced by Facebook.
  • Apache Tajo - Data warehouse system for Apache Hadoop
  • Apache Drill - Schema-free SQL Query Engine
  • Apache Trafodion

Data Management

  • Apache Calcite - A Dynamic Data Management Framework
  • Apache Atlas - Metadata tagging & lineage capture suppoting complex business data taxonomies

Workflow, Lifecycle and Governance

  • Apache Oozie - Apache Oozie
  • Azkaban
  • Apache Falcon - Data management and processing platform
  • Apache NiFi - A dataflow system
  • Apache AirFlow - Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines
  • Luigi - Python package that helps you build complex pipelines of batch jobs

Data Ingestion and Integration

DSL

  • Apache Pig - Apache Pig
  • Apache DataFu - A collection of libraries for working with large-scale data in Hadoop
  • vahara - Machine learning and natural language processing with Apache Pig
  • packetpig - Open Source Big Data Security Analytics
  • akela - Mozilla‘s utility library for Hadoop, HBase, Pig, etc.
  • seqpig - Simple and scalable scripting for large sequencing data set(ex: bioinfomation) in Hadoop
  • Lipstick - Pig workflow visualization tool. Introducing Lipstick on A(pache) Pig
  • PigPen - PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don‘t need to know much about Pig to use it.

Libraries and Tools

Realtime Data Processing

Distributed Computing and Programming

  • Apache Spark

    • Spark Packages - A community index of packages for Apache Spark
    • SparkHub - A community site for Apache Spark
  • Apache Crunch
  • Cascading - Cascading is the proven application development platform for building data applications on Hadoop.
  • Apache Flink - Apache Flink is a platform for efficient, distributed, general-purpose data processing.
  • Apache Apex (incubating) - Enterprise-grade unified stream and batch processing engine.

    Packaging, Provisioning and Monitoring

  • Apache Bigtop - Apache Bigtop: Packaging and tests of the Apache Hadoop ecosystem
  • Apache Ambari - Apache Ambari
  • Ganglia Monitoring System
  • ankush - A big data cluster management tool that creates and manages clusters of different technologies.
  • Apache Zookeeper - Apache Zookeeper
  • Apache Curator - ZooKeeper client wrapper and rich ZooKeeper framework
  • Buildoop - Hadoop Ecosystem Builder
  • Deploop - The Hadoop Deploy System
  • Jumbune - An open source MapReduce profiling, MapReduce flow debugging, HDFS data quality validation and Hadoop cluster monitoring tool.
  • inviso - Inviso is a lightweight tool that provides the ability to search for Hadoop jobs, visualize the performance, and view cluster utilization.

Search

Search Engine Framework

  • Apache Nutch - Apache Nutch is a highly extensible and scalable open source web crawler software project.

Security

  • Apache Ranger - Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform.
  • Apache Sentry - An authorization module for Hadoop
  • Apache Knox Gateway - A REST API Gateway for interacting with Hadoop clusters.

Benchmark

  • Big Data Benchmark
  • HiBench
  • Big-Bench
  • hive-benchmarks
  • hive-testbench - Testbench for experimenting with Apache Hive at any data scale.
  • YCSB - The Yahoo! Cloud Serving Benchmark (YCSB) is an open-source specification and program suite for evaluating retrieval and maintenance capabilities of computer programs. It is often used to compare relative performance of NoSQL database management systems.

Machine learning and Big Data analytics

  • Apache Mahout
  • Oryx 2 - Lambda architecture on Spark, Kafka for real-time large scale machine learning
  • MLlib - MLlib is Apache Spark‘s scalable machine learning library.
  • R - R is a free software environment for statistical computing and graphics.
  • RHadoop including RHDFS, RHBase, RMR2, plyrmr
  • RHive RHive, for launching Hive queries from R
  • Apache Lens
  • Apache SINGA (incubating) - SINGA is a general distributed deep learning platform for training big deep learning models over large datasets

Misc.

Resources

Various resources, such as books, websites and articles.

Websites

Useful websites and articles

Presentations

Books

Hadoop and Big Data Events

时间: 2024-11-05 12:20:02

Awesome Hadoop的相关文章

Hadoop:Windows 7 32 Bit 编译与运行

所需工具 1.Windows 7 32 Bit OS(你懂的) 2.Apache Hadoop 2.2.0-bin(hadoop-2.2.0.tar.gz) 3.Apache Hadoop 2.2.0-src(hadoop-2.2.0-src.tar.gz) 3.JDK 1.7 4.Maven 3.2.1(apache-maven-3.2.1-bin.zip) 5.Protocol Buffers 2.5.0 6.Unix command-line tool Cygwin(Setup-x86.e

编译hadoop 的native library

os:centos 6.7 x64 要解决的问题:   WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 解决的必要性 hadoop的cache和短路读(Short-Circuit Local Reads)都需要native library的支持 解决步骤 编译方法是 http://had

Hadoop Hive基础sql语法

Hive 是基于Hadoop 构建的一套数据仓库分析系统,它提供了丰富的SQL查询方式来分析存储在Hadoop 分布式文件系统中的数据,可以将结构化的数据文件映射为一张数据库表,并提供完整的SQL查询功能,可以将SQL语句转换为MapReduce任务进行运行,通过自己的SQL 去查询分析需要的内容,这套SQL 简称Hive SQL,使不熟悉mapreduce 的用户很方便的利用SQL 语言查询,汇总,分析数据.而mapreduce开发人员可以把己写的mapper 和reducer 作为插件来支持

Hadoop快速入门

传说中的Hadoop,我终于来对着你唱"征服"了,好可爱的小象,!J 总的来说,hadoop的思路比较简单(map-reduce),就是将任务分开进行,最后汇总.但这个思路实现起来,比较复杂,但相对于几年前Intel等硬件公司提出的网格运算等方式,显得更加开放. 你难任你难,哥就是头铁! Tip:实践应用是核心,本文概念为主,有些部分可能会有些晦涩,直接跳过就好(不是特别重要). 本文代码实践在:https://github.com/wanliwang/cayman/tree/mast

Hadoop学习—浅谈hadoop

大数据这个词越来越热,本人一直想学习一下,正巧最近有时间了解一下.先从hadoop入手,在此记录学习中的点滴. 什么是hadoop? What Is Apache Hadoop? The Apache? Hadoop? project develops open-source software for reliable, scalable, distributed computing 作者:Doug Cutting 受Google三篇论文的启发(GFS.MapReduce.BigTable) 解

测试搭建成功的单机hadoop环境

1.关闭防火墙service iptables stop,(已经这是开机关闭的忽略) 2.进入hadoop目录,修改hadoop配置文件(4个) core-site.xml <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost.localdomain:8020</value> </property> <property

单机伪分布式Hadoop环境搭建

1.安装和配置JDK 具体操作见笔记 http://www.cnblogs.com/DreamDriver/p/6597178.html 2.创建Hadoop用户 为Hadoop创建一个专门的用户,可以在系统安装的时候就创建,也可以在系统安装好之后用如下命令创建: # groupadd hadoop-user # useradd -g hadoop-user hadoop # passwd hadoop 3.下载安装Hadoop 4.配置SSH (1)生成密钥对时,执行如下命名 # ssh-ke

Hadoop学习笔记(3) Hadoop文件系统二

1 查询文件系统 (1) 文件元数据:FileStatus,该类封装了文件系统中文件和目录的元数据,包括文件长度.块大小.备份.修改时间.所有者以及版权信息.FileSystem的getFileStatus()方法用于获取文件或目录的FileStatus对象. 例:展示文件状态信息 public class ShowFileStatusTest{ private MiniDFSCluster cluster; private FileSystem fs; @Before public void

基于OGG的Oracle与Hadoop集群准实时同步介绍

Oracle里存储的结构化数据导出到Hadoop体系做离线计算是一种常见数据处置手段.近期有场景需要做Oracle到Hadoop体系的实时导入,这里以此案例做以介绍.Oracle作为商业化的数据库解决方案,自发性的获取数据库事务日志等比较困难,故选择官方提供的同步工具OGG(Oracle GoldenGate)来解决. 安装与基本配置 环境说明 软件配置 角色 数据存储服务及版本 OGG版本 IP 源服务器 OracleRelease11.2.0.1 Oracle GoldenGate 11.2

数据采集之Web端上传文件到Hadoop HDFS

前言 最近在公司接到一个任务,是关于数据采集方面的. 需求主要有3个: 通过web端上传文件到HDFS; 通过日志采集的方式导入到HDFS; 将数据库DB的表数据导入到HDFS. 正好最近都有在这方面做知识储备.正所谓养兵千日,用兵一时啊.学习到的东西只有应用到真实的环境中才有意义不是么. 环境 这里只做模拟环境,而不是真实的线上环境,所以也很简单,如果要使用的话还需要优化优化. OS Debian 8.7 Hadoop 2.6.5 SpringBoot 1.5.1.RELEASE 说明一下,这