hadoop权威指南(第四版)要点翻译(2)——Chapter 1. Meet Hadoop

a) The trend is for every individual’s data footprint to grow, but perhaps more significantly,the amount of data generated by machines as a part of the Internet of Things will be even greater than that generated by people.

每个人在互联网上的足迹数据(或者说痕迹)会一直增长,但是更重要的也许是大量由机器所产生的数据,这些数据作为物联网的一部分将比由人所产生的数据大得多。

b) Organizations no longer have to merely manage their own data; success in the future will be dictated to a large extent by their ability to extract value from other organizations’ data.

组织者不再仅仅只管理自己的数据,在将来,很大程度上,拥有能从其他组织的数据中提取有价值的数据的能力将会决定你是否成功。

c) Mashups between different information sources make for unexpected and hitherto unimaginable applications.

混合不同的信息源将会产生迄今都无法想象的应用

d) It has been said that “more data usually beats better algorithms,”.

有人说过,更多的数据常常会胜过更好的算法

e) This is a long time to read all data on a single drive — and writing is even slower. The obvious way to reduce the time is to read from multiple disks at once. Imagine if we had 100 drives, each holding one hundredth of the data. Working in parallel, we could read the data in under two minutes.

在单驱动的设备上读取所有的数据需要很长时间,写入就更慢了。减少时间最显著的方法就是同时从多个磁盘中读取。想象一下,如果我们有100个磁盘驱动,每个磁盘驱动持有百分之一的数据。并发运行的情况下,我们可以在两分钟之内读完数据。

f) We can imagine that the users of such a system would be happy to share access in return for shorter analysis times,and statistically, that their analysis jobs would be likely to be spread over time, so they wouldn’t interfere with each other too much.

我们可以想象的到,用户将会很乐意分享设备,作为更短分析时间的回报,而且,在统计学意义上,用户的分析工作分布在各个时间段,故彼此之间不会干扰太多。

g) The first problem to solve is hardware failure: as soon as you start using many pieces of hardware, the chance that one will fail is fairly high.

要解决的第一个问题就是硬件故障问题:只要你使用多零件集成的设备,那么其中某个部分出故障的机会是非常高的

h) The second problem is that most analysis tasks need to be able to combine the data in some way, and data read from one disk may need to be combined with data from any of the other 99 disks.

第二个问题就是在某种程度上大多数分析任务需要集合数据,而从一块装置中读取这些数据又需要结合其他99块装置中的数据。

i) In a nutshell, this is what Hadoop provides: a reliable, scalable platform for storage and analysis. What’s more, because it runs on commodity hardware and is open source,Hadoop is affordable.

简而言之,hadoop就是一个具有可靠性、可扩展性的存储和分析平台。而且,因为hadoop是运行在普通的机器上,再加上是开源的,因此hadoop具有经济性。

j) Queries typically take minutes or more, so it’s best for offline use, where there isn’t a human sitting in the processing loop waiting for results.

典型的查询过程需要几分钟甚至更多,因此hadoop非常适合离线处理,因为这个时候不会有人在旁边等着循环过程的处理结果

k) The first component to provide online access was HBase, a key-value store that uses HDFS for its underlying storage. HBase provides both online read/write access of individual rows and batch operations for reading and writing data in bulk, making it a good solution for building applications on.

所提供的第一个在线访问组件是HBase,具有键值对的特性,使用HDFS来构成它的底层存储。HBase不仅提供个别行记录的在线读写访问,而且也可以对大批未整合的数据进行批量的读写操作,这使得它是一个建立应用程序的好方法。

l) The real enabler for new processing models in Hadoop was the introduction of YARN(which stands for Yet Another Resource Negotiator) in Hadoop 2. YARN is a cluster resource management system, which allows any distributed program (not just MapReduce) to run on data in a Hadoop cluster.

对于Hadoop2新处理模型真正的促成者是YARN,YARN是一个集群资源管理系统,允许任何分布式程序(不仅仅是MapReduce)运行在hadoop集群上的数据

m) Despite the emergence of different processing frameworks on Hadoop, MapReduce still has a place for batch processing, and it is useful to understand how it works since it introduces several concepts that apply more generally (like the idea of input formats, or how a dataset is split into pieces).

尽管有各种不同的处理框架在hadoop上出现,但是MapReduce在批处理上任然有一席之地,并且自从引进一些能更普遍应用的思想以来,是有助于理解它如何工作的

n) Why can’t we use databases with lots of disks to do large-scale analysis? Why is Hadoop needed? The answer to these questions comes from another trend in disk drives: seek time is improving more slowly than transfer rate.

为什么我们不能用多磁盘驱动的数据库来进行大规模的数据分析?为什么需要hadoop,这么问题的答案来自于磁盘驱动的另一个趋势:磁盘的寻址时间的提升要远慢于传输速率的提升

o) In many ways, MapReduce can be seen as a complement to a Relational Database Management System (RDBMS). MapReduce is a good fit for problems that need to analyze the whole dataset in a batch fashion, particularly for ad hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been indexed to deliver low-latency retrieval and update times of a relatively small amount of data. MapReduce suits applications where the data is written once and read many times, whereas a relational database is good for datasets that are continually updated.

在许多方面,MapReduce可以看做是对关系型数据库管理系统(RDBMS)的补充,MapReduce非常适合于用在用批处理方式分析整个数据集的问题上。而RDBMS却擅长于在已经添加了索引的数据集上进行低延时的查询和多次更新相对较小的数据。MapReduce适合于一次写入多次读取的应用场景,而一个关系型数据库擅长于频繁地更新应用。

p) Another difference between Hadoop and an RDBMS is the amount of structure in the datasets on which they operate.

hadoop与RDBMS系统的另外一个不同之处在于它们所操作的数据集结构的体量。

q) Relational data is often normalized to retain its integrity and remove redundancy. Normalization poses problems for Hadoop processing because it makes reading a record a nonlocal operation, and one of the central assumptions that Hadoop makes is that it is possible to perform (high-speed) streaming reads and writes.

关系型数据一般都是标准化的,需要保持其数据完整性以及去除冗余数据。标准化的处理方式对于hadoop来说是个问题,因为这使得读取数据不再是一个本地化操作,而且,hadoop的核心假设之一就是可以进行高速流式读写。

r) Hadoop tries to co-locate the data with the compute nodes, so data access is fast because it is local.This feature, known as data locality, is at the heart of data processing in Hadoop and is the reason for its good performance.

hadoop利用计算机节点共置存储数据,也就是数据本地化,因此数据访问是很快的。这个数据本地化是hadoop的核心特征,也是使hadoop获得高性能的原因。

s) Processing in Hadoop operates only at the higher level: the programmer thinks in terms of the data model (such as key-value pairs for MapReduce), while the data flow remains implicit.

我们进行hadoop 的操作仅仅发生在高级阶段:程序员只需根据数据模型(比如MapReduce的键值对)来构思,但是依然隐含着数据流。

t) Coordinating the processes in a large-scale distributed computation is a challenge. The hardest aspect is gracefully handling partial failure — when you don’t know whether or not a remote process has failed — and still making progress with the overall computation.

在大规模分布式计算中进行各个进程之间的协调工作是一个挑战,最困难的方面就是当你不知道是否有一个远程进行已经失败的情况下,如何优雅的处理局部失败问题,同时仍然能够使整个计算稳步向前推进。

u) MapReduce is designed to run jobs that last minutes or hours on trusted, dedicated hardware running in a single data center with very high aggregate bandwidth interconnects.

MapReduce被设计运行在一个由专用硬件构成,实现了内部高速集成互联的单一数据中心中,服务于那些在不深究的情况下,只需几分钟或者几小时就可以完成的计算任务。

v) Today, Hadoop is widely used in mainstream enterprises. Hadoop’s role as a generalpurpose storage and analysis platform for big data has been recognized by the industry, and this fact is reflected in the number of products that use or incorporate Hadoop in some way.

今天,hadoop在主流企业中得到了广泛的应用。hadoop作为一个大数据通用存储和分析平台的角色已经得到了行业的认可,在某种程度上,这个事实也在众多使用或者参考了hadoop的产品中得到了印证。

时间: 2024-10-07 02:39:24

hadoop权威指南(第四版)要点翻译(2)——Chapter 1. Meet Hadoop的相关文章

hadoop权威指南(第四版)要点翻译(4)——Chapter 3. The HDFS(1-4)

Filesystems that manage the storage across a network of machines are called distributed filesystems. Since they are network based, all the complications of network programming kick in, thus making distributed filesystems more complex than regular dis

hadoop权威指南(第四版)要点翻译(5)——Chapter 3. The HDFS(5)

5) The Java Interface a) Reading Data from a Hadoop URL. 使用hadoop URL来读取数据 b) Although we focus mainly on the HDFS implementation, DistributedFileSystem, in general you should strive to write your code against the FileSystem abstract class, to retain

hadoop权威指南(第四版)要点翻译(3)——Chapter 2. MapReduce

Most importantly, MapReduce programs are inherently parallel, thus putting very large-scale data analysis into the hands of anyone with enough machines at their disposal.MapReduce comes into its own for large datasets, so let's start by looking at on

hadoop权威指南(第四版)要点翻译(1)——Foreword and Preface

前期已经完成了hadoop全分布模式的部署和运行,近期想更进一步的了解hadoop原理,基于hadoop2.X的书籍最好的莫过于<hadoop权威指南(第四版)>,很可惜作者年初才刚写完,没来得及翻译,只好看英文版了,书中的要点重点在接下来的一段时间我会依次翻译出来(全部翻译不太现实,没那么多时间精力,将近900页呢),如果有翻译不妥的地方,还请大家指出来,共同进步,谢谢! 今天是个开头,就先翻译一下文中前言和序的要点 1.Foreword 1) Wesplit off the distrib

hadoop学习:《Hadoop权威指南第四版》中文PDF+英文PDF+代码

结合理论和实践,<Hadoop权威指南第四版>由浅入深,全方位介绍了Hadoop 这一高性能的海量数据处理和分析平台.5部分24 章,第Ⅰ部分介绍Hadoop 基础知识,第Ⅱ部分介绍MapReduce,第Ⅲ部分介绍Hadoop 的运维,第Ⅳ部分介绍Hadoop 相关开源项目,第Ⅴ部分提供了三个案例. Hadoop生态都有涉及,很厚很全:HDFS, MapReduce1&2(YARN), Hive, HBase, Pig, ZooKeeper, Sqoop等. 多数章节对自己的要求都是了

[hadoop]hadoop权威指南例第二版3-1、3-2

hadoop版本1.2.1 jdk1.7.0 例3-1.通过URLStreamHandler实例以标准输出方式显示Hadoop文件系统的文件 hadoop fs -mkdir input 在本地创建两个文件file1,file2,file1的内容为hello world,file2内容为hello Hadoop,然后上传到input,具体方法如Hadoop集群(第6期)_WordCount运行详解中 2.1.准备工作可以看到. 完整代码如下: 1 import org.apache.hadoop

《Hadoop权威指南 第4版》 - 第四章 关于YARN - hadoop的集群资源管理系统

简介 YARN 提供请求和使用hadoop集群资源的API 向上隐藏细节 提供更高层的API 4.1 YARN应用运行机制 - 资源请求 - 应用生命周期 - 构建yarn应用 4.2 YARN与MapReduce 1相比 (MapReduce特指hadoop1 的版本, 2/3依次对应) - 4.3 YARN中的调度 调度选项 FIFO调度器 容量调度器 (多个请求队列调用一个hadoop集群, 每个队列请求量上限不可逾越) 公平调度器 (动态平衡资源调度, 大作业多分配) 启动YARN并运行

辛星笔记之Hadoop权威指南第四篇HDFS简介

当数据集的大小超过一台独立物理计算机的存储能力时,就有必要对它进行分区并且存储到若干台单独的计算机上.管理网络中跨多台计算机存储的文件系统被称为分布式文件系统(distributed  filesystem). 分布式文件系统架构于网络智商,势必会引入网络编程的复杂性,因此分布式文件系统比普通磁盘文件系统更加复杂,比如文件系统能够容忍节点故障但是不丢失数据就是一个很大的挑战. HDFS的全称是Hadoop  Distributed  Filesystem,在非正式文档或者旧文档以及配置文件中,有

分享《Hadoop权威指南(第四版)》中文PDF+英文PDF+源代码

下载:https://pan.baidu.com/s/1YrWpwl2xgsFlf6GBS2Ry8w更多资料:http://blog.51cto.com/3215120 <Hadoop权威指南(第四版)>中文PDF+英文PDF+源代码 <Hadoop权威指南(第四版)>中文PDF+英文PDF+源代码<Hadoop权威指南(第四版)>中文PDF,734页,带书签目录.<Hadoop权威指南(第四版)>英文PDF,805页,带书签目录.配套源代码. 其中,中文版