Meet Hadoop
1.1 Data!(数据)
Most of the data is locked up in the largest web properties (like search engines), or scientific or financial institutions, isn’t it? Does the advent of “Big Data,” as it is being
called, affect smaller organizations or individuals?
作为普通民众并未在浩瀚的数据中受益,数据都在网络中存储或者被广大的研究机构存储,因此大数据的挖掘也就应用而生。
从个人角度来看,因为数据量的不断扩大,对数据的读取和筛选都会消耗大量的时间。
1.2 Data Storage and Analysis (数据存储和分析)
虽然硬盘等存储介质的读取速度不断的提高,但是相对数据量的增长速率相比,数据的检索和筛选还是会消耗大量的时间。
This is a long time to read all data on a single drive—and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once. Imagine if we
had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes.
从单一的驱动器上读取数据就更慢了,最显而易见的方式就是减少从多个介质中一次读取。但是同时在太高读取速率的同时也降低了硬件的利用率。
并行从多个驱动器上读取数据也同时存在风险:
1.硬件故障造成的数据读取失败。redundant copies of the data are kept by the system so that in the event of failure, there is another copy available.数据备份
2.从不同的驱动器中整合数据也是一个很大的挑战。由此也就引出了MapReduce.
1.3 Comparison with Other Systems(与其他系统比较)
MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative.
RDBMS 关系型数据库管理系统
Grid Computing 网格计算
网格计算分布式计算是近年提出的一种新的计算方式。所谓分布式计算就是在两个或多个软件互相共享信息,这些软件既可以在同一台计算机上运行,也可以在通过网络连接起来的多台计算机上运行。
volunteer computing 志愿计算
志愿计算是通过互联网让全球的普通大众志愿提供空闲的PC时间,参与科学计算或数据分析的一种计算方式。这种方式为解决基础科学运算规模较大、计算资源需求较多的难题提供了一种行之有效的解决途径。对于科学家而言,志愿计算意味着近乎免费且无限的计算资源;而就志愿者而言,他们可以得到一个了解科学、参与科学的机会,以促进公众对科学的理解。
1.4 A Brief History of Hadoop(Hadoop历史简介)
Apache Lucene
1.5 Apache Hadoop and Hadoop ecosystem(关于组织和Hadoop生态系统)
Common :A set of components and interfaces for distributed filesystems and general I/O (serialization, Java RPC, persistent data structures).
Avro:A serialization system for efficient, cross-language RPC, and persistent data storage.
MapReduce:A distributed data processing model and execution environment that runs on large clusters of commodity machines.
HDFS:A distributed filesystem that runs on large clusters of commodity machines.
Pig:A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.
Hive:A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to
MapReduce jobs) for querying the data.
HBase:A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point
queries (random reads).
ZooKeeper:A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.
Sqoop:A tool for efficiently moving data between relational databases and HDFS.
1.6 Hadoop Releases(Hadoop的版本介绍)
hadoop权威指南 chapter1 Meet Hadoop