大数据基础问答-之二

What is Spark?

=============

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.

Apache Spark provides programmers with an application programming interface centered on a data structure called the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way.

It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs:

MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark‘s RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.

It was developed in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs

What is KAFKA?

==============

Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Its storage layer is essentially a "massively scalable pub/sub message queue architected as a distributed transaction log," making it highly valuable for enterprise infrastructures to process streaming data. Additionally, Kafka connects to external systems (for data import/export) via Kafka Connect and provides Kafka Streams, a Java stream processing library.

Kafka stores messages which come from arbitrarily many processes called "producers". The data can thereby be partitioned in different "partitions" within different "topics". Within a partition the messages are indexed and stored together with a timestamp. Other processes called "consumers" can query messages from partitions. Kafka runs on a cluster of one or more servers and the partitions can be distributed across cluster nodes.

What is Splunk?

==============

What do you do when you need information about the state of all devices in your data center? Looking at all logfiles would be the right answer if it was possible in any practical amount of time. This is where Splunk comes in.

Splunk started out as a kind of “Google for Logfiles”. It does a lot more today but log processing is still at the product’s core. It stores all your logs and provides very fast search capabilities roughly in the same way Google does for the internet.

  • splunkd is a distributed C/C++ server that accesses, processes and indexes streaming IT data and also handles search requests.
  • splunkweb is a Python-based application server providing the Splunk Web user interface.
  • Splunk‘s Data Store manages the original raw data in compressed format as well as the indexes into the data. Data can be deleted or archived based on retention period or maximum data store size.
  • Splunk Servers can communicate with one another via Splunk-2-Splunk, a TCP-based protocol, to forward data from one server to another and to distribute searches across multiple servers.
  • Bundles are files that contain configuration settings including, user accounts, Splunks, Live Splunks, Data Inputs and Processing Properties to easily create specific Splunk environments.
  • Modules are files that add new functionality to Splunk by adding to or modifying existing processors and pipelines.

What is SAS?

============

Big data analytics software.

What is Storm?

============

Apache Storm, in simple terms, is a distributed framework for real time processing of Big Data like Apache Hadoop is a distributed framework for batch processing.

Why Storm was needed :

First of all , we should understand why we needed storm when we already had Hadoop. Hadoop is intelligently designed to process huge amount of data in distributed fashion on commodity hardwares but the way it processes does not suit the use case of Real time processing. MR jobs are disk intensive, every time it runs, it fetches data from disk,processes in memory in mapper/reducer and then write back to disk ; the job is finished then and there. For next batch of data, a new job will be created again and whole cycle will repeat.

However in real time processing, data will be coming continuously and you have to keep getting data,process it and keep writing the output as well. And of course, all these have to be done parallel and fast in distributed fashion with no data loss. This is where Storm pitches in.

时间: 2024-07-28 20:40:06

大数据基础问答-之二的相关文章

大数据基础问答-之一

What is Hadoop? ========== Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apac

大数据基础教程:创建RDD的二种方式

大数据基础教程:创建RDD的二种方式 1.从集合中创建RDD val conf = new SparkConf().setAppName("Test").setMaster("local")      val sc = new SparkContext(conf)      //这两个方法都有第二参数是一个默认值2  分片数量(partition的数量)      //scala集合通过makeRDD创建RDD,底层实现也是parallelize      val 

中小企业的大数据技术路线选择(二)-Cassandra+Presto方案

中小企业的大数据技术路线选择(二)-Cassandra+Presto方案 我前面曾经写过:中小企业的大数据技术路线选择 和 低调.奢华.有内涵的敏捷式大数据方案:Flume+Cassandra+Presto+SpagoBI . 最近用了两个月的时间终于把Cassandra+Presto+SpagoBI方案验证通过了.验证了Presto的JDBC Driver .Prestogres网关.SHIB三种方式. 一.Presto JDBC驱动方案 Presto JDBC驱动方案,Java动用客户端,如

区块链这些技术与h5房卡斗牛平台出售,大数据基础软件干货不容错过

在IT产业发展中,包括CPU.操作系统h5房卡斗牛平台出售 官网:h5.super-mans.com 企娥:2012035031 vx和tel:17061863513 h5房卡斗牛平台出售在内的基础软硬件地位独特,不但让美国赢得了产业发展的先机,成就了产业巨头,而且因为技术.标准和生态形成的壁垒,主宰了整个产业的发展.错失这几十年的发展机遇,对于企业和国家都是痛心的. 当大数据迎面而来,并有望成就一个巨大的应用和产业机会时,企业和国家都虎视眈眈,不想错再失这一难得的机遇.与传统的IT产业一样,大

第二期:关于十大数据相关问答汇总,关注持续更新中哦~

NO.1 学大数据如何零基础入门? 答:学习任何东西都一样,一开始就是一道坎,我很喜欢看书,特别是容易入门的书.对于大数据,我的具体研究方向是大规模数据的机器学习应用,所以首先要掌握以下基本概念.微积分(求导,极值,极限)线性代数(矩阵表示.矩阵计算.特征根.特征向量)概率论+统计(很多数据分析建模基于统计模型).统计推断.随机过程线性规划+凸优化.非线性规划等*数值计算.数值线代等当然一开始只要有微积分.线代以及概率论基本上就可以入门机器学习,我强烈推荐几本书,这几本书不需要看完,只需要对其中

第四期:有关大数据相关问答汇总,持续更新哦~

NO.1 大数据为什么这么"火"?为什么那么多人转型学大数据? 回答一:身为数据极客,在2017年应该能感觉很幸福. 去年,我们曾经问过大家"大数据还是个值得关注的大事吗?",并注意到由于大数据更像是一种"系统化工程",因此在企业的接受速度方面要落后于整个业界的炒作.大数据技术用了多年时间进行演化,才从一种看起来很酷的新技术变成企业在生产环境中实际部署的核心企业级系统. 2017年,我们已经很适应这样的部署阶段."大数据"这个

学完大数据基础,可以按照我写的顺序学下去

首先给大家介绍什么叫大数据,大数据最早是在2006年谷歌提出来的,百度给他的定义为巨量数据集合,辅相成在今天大数据技术任然随着互联网的发展,更加迅速的成长,小到个人,企业,达到国家安全,大数据的作用可见一斑,也就是近几年大数据这个概念,随着云计算的出现才凸显出其价值,云计算与大数据的关系就像硬币的正反面一样,相密不可分.但是大数据的人才缺失少之又少,这就拖延了大数据的发展.所以人才培养真的很重要. 大数据的定义.大数据,又称巨量资料,指的是所涉及的数据资料量规模巨大到无法通过人脑甚至主流软件工具

分分钟理解大数据基础之Spark

一背景 Spark 是 2010 年由 UC Berkeley AMPLab 开源的一款 基于内存的分布式计算框架,2013 年被Apache 基金会接管,是当前大数据领域最为活跃的开源项目之一 Spark 在 MapReduce 计算框架的基础上,支持计算对象数据可以直接缓存到内存中,大大提高了整体计算效率.特别适合于数据挖掘与机器学习等需要反复迭代计算的场景. 二特性 高效:Spark提供 Cache 机制,支持需要反复迭代的计算或者多次数据共享,基于Spark 的内存计算比 Hadoop

大数据基础和hadoop

一.大数据的特点 大数据是什么?其实很简单,大数据其实就是海量资料巨量资料,这些巨量资料来源于世界各地随时产生的数据,在大数据时代,任何微小的数据都可能产生不可思议的价值.大数据有4个特点,为别为:Volume(大量).Variety(多样).Velocity(高速).Value(价值),一般我们称之为4V. 所谓4V,具体指如下4点: 1.大量.大数据的特征首先就体现为“大”,从先Map3时代,一个小小的MB级别的Map3就可以满足很多人的需求,然而随着时间的推移,存储单位从过去的GB到TB,