一些大数据工具，名词的记录

经常看到一些词一起出现，今天总结下。日后再看

All from Apache Offical Docs

1/apache kafka

what is kafka?

　　　　kafka is a distributed, partipationed, replicated commit log service,. It provides the functionlity of a messaging system, but with a unique design.

　Simply, it is a log messaging system. It reminds of RabbitMQ which also a message system.

　　So, google its differences.

　　TL;DR; Reference: http://www.quora.com/What-are-the-differences-between-Apache-Kafka-and-RabbitMQ

　　And, kafka is dependent on zookeeper.

2/apache zookeeper

　　what is zookeeper?

　　　　ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.

　　what‘s his aim?

　　　　ZooKeeper aims at distilling the essence of these different services into a very simple interface to a centralized coordination service. The service itself is distributed and highly reliable. Consensus, group management, and presence protocols will be implemented by the service so that the applications do not need to implement them on their own. Application specific uses of these will consist of a mixture of specific components of Zoo Keeper and application specific conventions. ZooKeeper Recipes shows how this simple service can be used to build much more powerful abstractions.

3/apache storm

　　what is storm?

　　　　Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!

　　where to use it?

　　　　Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

4/apache spark

　　what is spark?

　　　　Apache Spark™ is a fast and general engine for large-scale data processing.

5/apache hive

　　what is hive?

　　　　The Apache Hive ™ data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

　　　　So, it is a sql-like language. Find it on IBM: http://www-01.ibm.com/software/data/infosphere/hadoop/hive/ Their docs are always good.

6/apache pig

　　what is pig?

　　　　Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.　　　　

Conclusion:

　　1. most of messaging system based on producer-consumer pattern.　

　　2.pig and hive are like language, sql-language.

时间： 2024-10-10 14:12:56

一些大数据工具，名词的记录

一些大数据工具，名词的记录的相关文章

史上最全开源大数据工具汇总

Java程序员在用的大数据工具，MongoDB稳居第一！

三款大数据工具比拼,谁才是真正的王者

分享一下Java程序猿最喜欢用的大数据工具

三款大数据工具比拼,真正的王者会是谁呢？

Java转职大数据人群常使用的二十多个大数据工具

大数据工具千千万，到底谁才是最强王者？

大数据工具集详

大数据工具集

利用大数据技术实现日志记录与分析