The Design of HDFS

HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware. Let’s examine this statement in more detail:

Very large files
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters running today that store petabytes of data.

Streaming data access
HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. A dataset is typically generated or copied from source, and then various analyses are performed on that dataset over time. Each analysis will involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is more important than the latency in reading the first record.

Commodity hardware
Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware (commonly available hardware that can be obtained from multiple vendors) for which the chance of node failure across the cluster is high, at least for large clusters. HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure.

It is also worth examining the applications for which using HDFS does not work so well. Although this may change in the future, these are areas where HDFS is not a good fit today:

Low-latency data access
Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS. Remember, HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency. HBase is currently a better choice for low-latency access.

Lots of small files
Because the namenode holds filesystem metadata in memory, the limit to the number of files in a filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each file, directory, and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory. Although storing millions of files is feasible, billions is beyond the capability of current hardware.

Multiple writers, arbitrary file modifications
Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. There is no support for multiple writers or for modifications at arbitrary offsets in the file. (These might be supported in the future, but they are likely to be relatively inefficient.)

时间: 2024-12-21 03:52:53

The Design of HDFS的相关文章

后端分布式系列:分布式存储-HDFS 与 GFS 的设计差异

「后端分布式系列」前面关于 HDFS 的一些文章介绍了它的整体架构和一些关键部件的设计实现要点. 我们知道 HDFS 最早是根据 GFS(Google File System)的论文概念模型来设计实现的. 然后呢,我就去把 GFS 的原始论文找出来仔细看了遍,GFS 的整体架构图如下: HDFS 参照了它所以大部分架构设计概念是类似的,比如 HDFS NameNode 相当于 GFS Master,HDFS DataNode 相当于 GFS chunkserver. 但还有些细节不同的地方,所以

hadoop权威指南(第四版)要点翻译(4)——Chapter 3. The HDFS(1-4)

Filesystems that manage the storage across a network of machines are called distributed filesystems. Since they are network based, all the complications of network programming kick in, thus making distributed filesystems more complex than regular dis

HDFS分布式文件系统(The Hadoop Distributed File System)

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execu

大数据技术hadoop入门理论系列之二—HDFS架构简介

HDFS简单介绍 HDFS全称是Hadoop Distribute File System,是一个能运行在普通商用硬件上的分布式文件系统. 与其他分布式文件系统显著不同的特点是: HDFS是一个高容错系统且能运行在各种低成本硬件上: 提供高吞吐量,适合于存储大数据集: HDFS提供流式数据访问机制. HDFS起源于Apache Nutch,现在是Apache Hadoop项目的核心子项目. HDFS设计假设和目标 硬件错误是常态 在数据中心,硬件异常应被视作常态而非异常态. 在一个大数据环境下,

基于word分词提供的文本相似度算法来实现通用的网页相似度检测

实现代码:基于word分词提供的文本相似度算法来实现通用的网页相似度检测 运行结果: 检查的博文数:128 1.检查博文:192本软件著作用词分析(五)用词最复杂99级,相似度分值:Simple=0.968589 Cosine=0.955598 EditDistance=0.916884 EuclideanDistance=0.00825 ManhattanDistance=0.001209 Jaccard=0.859838 JaroDistance=0.824469 JaroWinklerDi

Data storage on the batch layer

4.1 Storage requirements for the master dataset To determine the requirements for data storage, you must consider how your data will be written and how it will be read. The role of the batch layer within the Lambda Architecture affects both values. I

[Hive - LanguageManual] Archiving for File Count Reduction

Archiving for File Count Reduction Note: Archiving should be considered an advanced command due to the caveats involved. Archiving for File Count Reduction Overview Settings Usage Archive Unarchive Cautions and Limitations Under the Hood Overview Due

一种通用的网页相似度检测算法

如果我们需要在海量的结构未知的网页库中找到和指定的网页相似度比较高的一些网页,我们该怎么办呢?本文提出的"一种通用的网页相似度检测算法"就是专门解决这个问题. 算法如下: 1.提取网页文本.这个提取步骤不要求精确,也没办法精确,因为你面对的是未知结构的网页,所以只需要提取去掉标签之后的文本即可. 2.对提取的文本进行分词.我们使用开源的中文分词组件word分词. 3.为每一个网页建立一个词向量,向量的维度就是两个网页的不重复词的并集,每一个维度的权重就是词频TF,我们这里忽略IDF也不

读书笔记-HBase in Action-第二部分Advanced concepts-(1)HBase table design

本章以山寨版Twitter为例介绍HBase Schema设计模式.广义的HBase Schema设计不只包括创建表时指定项,还应该综合考虑Column families/Column qualifier/Cell value/Versions/Rowkey等相关内容. 灵活的Schema&简单的存储视图 Schema设计和数据存储及访问模式关系密切,先回顾下HBase数据模型,有几个要点: 被索引的部分包括Row Key+Col Fam+Col Qual+Time Stamp 由于HBase的