a) The trend is for every individual’s data footprint to grow, but perhaps more significantly,the amount of data generated by machines as a part of the Internet of Things will be even greater than that generated by people.
每个人在互联网上的足迹数据(或者说痕迹)会一直增长,但是更重要的也许是大量由机器所产生的数据,这些数据作为物联网的一部分将比由人所产生的数据大得多。
b) Organizations no longer have to merely manage their own data; success in the future will be dictated to a large extent by their ability to extract value from other organizations’ data.
组织者不再仅仅只管理自己的数据,在将来,很大程度上,拥有能从其他组织的数据中提取有价值的数据的能力将会决定你是否成功。
c) Mashups between different information sources make for unexpected and hitherto unimaginable applications.
混合不同的信息源将会产生迄今都无法想象的应用
d) It has been said that “more data usually beats better algorithms,”.
有人说过,更多的数据常常会胜过更好的算法
e) This is a long time to read all data on a single drive — and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once. Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes.
在单驱动的设备上读取所有的数据需要很长时间,写入就更慢了。减少时间最显著的方法就是同时从多个磁盘中读取。想象一下,如果我们有100个磁盘驱动,每个磁盘驱动持有百分之一的数据。并发运行的情况下,我们可以在两分钟之内读完数据。
f) We can imagine that the users of such a system would be happy to share access in return for shorter analysis times,and statistically, that their analysis jobs would be likely to be spread over time, so they wouldn’t interfere with each other too much.
我们可以想象的到,用户将会很乐意分享设备,作为更短分析时间的回报,而且,在统计学意义上,用户的分析工作分布在各个时间段,故彼此之间不会干扰太多。
g) The first problem to solve is hardware failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high.
要解决的第一个问题就是硬件故障问题:只要你使用多零件集成的设备,那么其中某个部分出故障的机会是非常高的
h) The second problem is that most analysis tasks need to be able to combine the data in some way, and data read from one disk may need to be combined with data from any of the other 99 disks.
第二个问题就是在某种程度上大多数分析任务需要集合数据,而从一块装置中读取这些数据又需要结合其他99块装置中的数据。
i) In a nutshell, this is what Hadoop provides: a reliable, scalable platform for storage and analysis. What’s more, because it runs on commodity hardware and is open source,Hadoop is affordable.
简而言之,hadoop就是一个具有可靠性、可扩展性的存储和分析平台。而且,因为hadoop是运行在普通的机器上,再加上是开源的,因此hadoop具有经济性。
j) Queries typically take minutes or more, so it’s best for offline use, where there isn’t a human sitting in the processing loop waiting for results.
典型的查询过程需要几分钟甚至更多,因此hadoop非常适合离线处理,因为这个时候不会有人在旁边等着循环过程的处理结果
k) The first component to provide online access was HBase, a key-value store that uses HDFS for its underlying storage. HBase provides both online read/write access of individual rows and batch operations for reading and writing data in bulk, making it a good solution for building applications on.
所提供的第一个在线访问组件是HBase,具有键值对的特性,使用HDFS来构成它的底层存储。HBase不仅提供个别行记录的在线读写访问,而且也可以对大批未整合的数据进行批量的读写操作,这使得它是一个建立应用程序的好方法。
l) The real enabler for new processing models in Hadoop was the introduction of YARN(which stands for Yet Another Resource Negotiator) in Hadoop 2. YARN is a cluster resource management system, which allows any distributed program (not just MapReduce) to run on data in a Hadoop cluster.
对于Hadoop2新处理模型真正的促成者是YARN,YARN是一个集群资源管理系统,允许任何分布式程序(不仅仅是MapReduce)运行在hadoop集群上的数据
m) Despite the emergence of different processing frameworks on Hadoop, MapReduce still has a place for batch processing, and it is useful to understand how it works since it introduces several concepts that apply more generally (like the idea of input formats, or how a dataset is split into pieces).
尽管有各种不同的处理框架在hadoop上出现,但是MapReduce在批处理上任然有一席之地,并且自从引进一些能更普遍应用的思想以来,是有助于理解它如何工作的
n) Why can’t we use databases with lots of disks to do large-scale analysis? Why is Hadoop needed? The answer to these questions comes from another trend in disk drives: seek time is improving more slowly than transfer rate.
为什么我们不能用多磁盘驱动的数据库来进行大规模的数据分析?为什么需要hadoop,这么问题的答案来自于磁盘驱动的另一个趋势:磁盘的寻址时间的提升要远慢于传输速率的提升
o) In many ways, MapReduce can be seen as a complement to a Relational Database Management System (RDBMS). MapReduce is a good fit for problems that need to analyze the whole dataset in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data. MapReduce suits applications where the data is written once and read many times, whereas a relational database is good for datasets that are continually updated.
在许多方面,MapReduce可以看做是对关系型数据库管理系统(RDBMS)的补充,MapReduce非常适合于用在用批处理方式分析整个数据集的问题上。而RDBMS却擅长于在已经添加了索引的数据集上进行低延时的查询和多次更新相对较小的数据。MapReduce适合于一次写入多次读取的应用场景,而一个关系型数据库擅长于频繁地更新应用。
p) Another difference between Hadoop and an RDBMS is the amount of structure in the datasets on which they operate.
hadoop与RDBMS系统的另外一个不同之处在于它们所操作的数据集结构的体量。
q) Relational data is often normalized to retain its integrity and remove redundancy. Normalization poses problems for Hadoop processing because it makes reading a record a nonlocal operation, and one of the central assumptions that Hadoop makes is that it is possible to perform (high-speed) streaming reads and writes.
关系型数据一般都是标准化的,需要保持其数据完整性以及去除冗余数据。标准化的处理方式对于hadoop来说是个问题,因为这使得读取数据不再是一个本地化操作,而且,hadoop的核心假设之一就是可以进行高速流式读写。
r) Hadoop tries to co-locate the data with the compute nodes, so data access is fast because it is local.This feature, known as data locality, is at the heart of data processing in Hadoop and is the reason for its good performance.
hadoop利用计算机节点共置存储数据,也就是数据本地化,因此数据访问是很快的。这个数据本地化是hadoop的核心特征,也是使hadoop获得高性能的原因。
s) Processing in Hadoop operates only at the higher level: the programmer thinks in terms of the data model (such as key-value pairs for MapReduce), while the data flow remains implicit.
我们进行hadoop 的操作仅仅发生在高级阶段:程序员只需根据数据模型(比如MapReduce的键值对)来构思,但是依然隐含着数据流。
t) Coordinating the processes in a large-scale distributed computation is a challenge. The hardest aspect is gracefully handling partial failure — when you don’t know whether or not a remote process has failed — and still making progress with the overall computation.
在大规模分布式计算中进行各个进程之间的协调工作是一个挑战,最困难的方面就是当你不知道是否有一个远程进行已经失败的情况下,如何优雅的处理局部失败问题,同时仍然能够使整个计算稳步向前推进。
u) MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects.
MapReduce被设计运行在一个由专用硬件构成,实现了内部高速集成互联的单一数据中心中,服务于那些在不深究的情况下,只需几分钟或者几小时就可以完成的计算任务。
v) Today, Hadoop is widely used in mainstream enterprises. Hadoop’s role as a generalpurpose storage and analysis platform for big data has been recognized by the industry, and this fact is reflected in the number of products that use or incorporate Hadoop in some way.
今天,hadoop在主流企业中得到了广泛的应用。hadoop作为一个大数据通用存储和分析平台的角色已经得到了行业的认可,在某种程度上,这个事实也在众多使用或者参考了hadoop的产品中得到了印证。