大数据基础问答-之一

What is Hadoop?

==========

Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

What technology inspired the invention of Hadoop?

===========

Google paper:

  • MapReduce

  • Google File System

At architecture level, what are the most key components of Hadoop?

===========

HDFS – Hadoop Distributed File System, which is capable of storing data across thousands of commodity servers to achieve high bandwidth between nodes.

MapReduce - provides the programming model used to tackle large distributed data processing -- mapping data and reducing it to a result.

Yarn – Hadoop Yet Another Resource Negotiator which provides resource management and scheduling for user applications.

What is HDFS?

===========

HDFS provides the scalable, fault-tolerant, cost-efficient storage for big data. HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.

Describe the HDFS architecture, please.

===========

HDFS has a master/slave architecture.

An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients.

In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.

HDFS exposes a file system namespace and allows user data to be stored in files.

Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes.

  • The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.

  • The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

How is a file stored in HDFS?

===========

A file is separated into blocks of at least 64MB. These blocks will be stored in DataNodes.

Block placement strategy:

  • One replica on local node.

  • Second replica on a remote rack.
  • Third replica on the same remote rack.
  • Additional replicas are randomly placed.

A data file of 1 TB requires how much storage and network traffic to store in HDFS?

===========

1 TB file requires:

  • 3 TB storage

  • 3 TB network traffic

How is filesystem metadata stored in Hadoop?

===========

This filesystem metadata is stored in two different constructs: the fsimage and the edit log.

The fsimage is a file that represents a point-in-time snapshot of the filesystem’s metadata.

Rather than writing a new fsimage every time the namespace is modified, the NameNode instead records the modifying operation in the edit log for durability.

This way, if the NameNode crashes, it can restore its state by first loading the fsimage then replaying all the operations (also called edits or transactions) in the edit log to catch up to the most recent state of the namesystem. The edit log comprises a series of files, called edit log segments, that together represent all the namesystem modifications made since the creation of the fsimage.

What is ZooKeeper and how is that fit into hadoop?

===========

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

What is Hadoop MapReduce?

===========

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting in very high aggregate bandwidth across the cluster.

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node. The master is responsible for scheduling the jobs‘ component tasks on the slaves, monitoring them and re-executing the failed tasks. The slaves execute the tasks as directed by the master.

Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client then submits the job (jar/executable etc.) and configuration to the JobTracker which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

What is Yarn?

===========

The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.

The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.

The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

Put everything above together?

==============

时间: 2024-11-08 23:24:03

大数据基础问答-之一的相关文章

大数据基础问答-之二

What is Spark? ============= Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. Apache Spark provides programm

区块链这些技术与h5房卡斗牛平台出售,大数据基础软件干货不容错过

在IT产业发展中,包括CPU.操作系统h5房卡斗牛平台出售 官网:h5.super-mans.com 企娥:2012035031 vx和tel:17061863513 h5房卡斗牛平台出售在内的基础软硬件地位独特,不但让美国赢得了产业发展的先机,成就了产业巨头,而且因为技术.标准和生态形成的壁垒,主宰了整个产业的发展.错失这几十年的发展机遇,对于企业和国家都是痛心的. 当大数据迎面而来,并有望成就一个巨大的应用和产业机会时,企业和国家都虎视眈眈,不想错再失这一难得的机遇.与传统的IT产业一样,大

大数据基础教程:创建RDD的二种方式

大数据基础教程:创建RDD的二种方式 1.从集合中创建RDD val conf = new SparkConf().setAppName("Test").setMaster("local")      val sc = new SparkContext(conf)      //这两个方法都有第二参数是一个默认值2  分片数量(partition的数量)      //scala集合通过makeRDD创建RDD,底层实现也是parallelize      val 

“大数据“基础知识普及

大数据,官方定义是指那些数据量特别大.数据类别特别复杂的数据集,这种数据集无法用传统的数据库进行存储,管理和处理.大数据的主要特点为数据量大(Volume),数据类别复杂(Variety),数据处理速度快(Velocity)和数据真实性高(Veracity),合起来被称为4V. 大数据中的数据量非常巨大,达到了PB级别.而且这庞大的数据之中,不仅仅包括结构化数据(如数字.符号等数据),还包括非结构化数据(如文本.图像.声音.视频等数据).这使得大数据的存储,管理和处理很难利用传统的关系型数据库去

大数据基础架构详解

简介:本文是对大数据领域的基础论文的阅读总结,相关论文包括GFS,MapReduce.BigTable.Chubby.SMAQ. 大数据出现的原因: 大多数的技术突破来源于实际的产品需要,大数据最初诞生于谷歌的搜索引擎中.随着web2.0时代的发展,互联网上数据量呈献爆炸式的增长,为了满足信息搜索的需要,对大规模数据的存储提出了非常强劲的需要.基于成本的考虑,通过提升硬件来解决大批量数据的搜索越来越不切实际,于是谷歌提出了一种基于软件的可靠文件存储体系GFS,使用普通的PC机来并行支撑大规模的存

图说大数据基础

大数据开发基础上之图说笔记 1.Hadoop2概览 1.1Hadoop2的组成.演化: 1.2Hadoop2.0——Hadoop1.0演化与改进: 2.HDFS系统概览 2.1HDFS系统的主要特性与适用场景: 2.2HDFS的体系结构: 2.3HDFS的构成 2.4HDFS的读流程: 2.5HDFS创建子路径流程: 2.6写流程和删除流程 3 YARN概览 3.1Hadoop1.x中的MapReduce构成及特点: 3.2 Yarn的结构图和主要组件: 3.3 YARN的工作流程图: 4 Ma

第二期:关于十大数据相关问答汇总,关注持续更新中哦~

NO.1 学大数据如何零基础入门? 答:学习任何东西都一样,一开始就是一道坎,我很喜欢看书,特别是容易入门的书.对于大数据,我的具体研究方向是大规模数据的机器学习应用,所以首先要掌握以下基本概念.微积分(求导,极值,极限)线性代数(矩阵表示.矩阵计算.特征根.特征向量)概率论+统计(很多数据分析建模基于统计模型).统计推断.随机过程线性规划+凸优化.非线性规划等*数值计算.数值线代等当然一开始只要有微积分.线代以及概率论基本上就可以入门机器学习,我强烈推荐几本书,这几本书不需要看完,只需要对其中

第四期:有关大数据相关问答汇总,持续更新哦~

NO.1 大数据为什么这么"火"?为什么那么多人转型学大数据? 回答一:身为数据极客,在2017年应该能感觉很幸福. 去年,我们曾经问过大家"大数据还是个值得关注的大事吗?",并注意到由于大数据更像是一种"系统化工程",因此在企业的接受速度方面要落后于整个业界的炒作.大数据技术用了多年时间进行演化,才从一种看起来很酷的新技术变成企业在生产环境中实际部署的核心企业级系统. 2017年,我们已经很适应这样的部署阶段."大数据"这个

学完大数据基础,可以按照我写的顺序学下去

首先给大家介绍什么叫大数据,大数据最早是在2006年谷歌提出来的,百度给他的定义为巨量数据集合,辅相成在今天大数据技术任然随着互联网的发展,更加迅速的成长,小到个人,企业,达到国家安全,大数据的作用可见一斑,也就是近几年大数据这个概念,随着云计算的出现才凸显出其价值,云计算与大数据的关系就像硬币的正反面一样,相密不可分.但是大数据的人才缺失少之又少,这就拖延了大数据的发展.所以人才培养真的很重要. 大数据的定义.大数据,又称巨量资料,指的是所涉及的数据资料量规模巨大到无法通过人脑甚至主流软件工具