MapR Hadoop

When it comes to Hadoop distributions, enterprises care about a number of things. Among them are high performance, high availability, and API compatibility. MapR, a San Jose, Calif.-based start-up, is betting that enterprises are less concerned with whether the distribution is purely open source or if it includes proprietary components. That’s according to Jack Norris, MapR’s vice president of marketing. He said MapR is the market leader in all three of the top Hadoop priorities – performance, availability, and API compatibility – and it has the customers to prove it.

Currently MapR has between 40 and 50 paying customers using its enterprise M5 Hadoop distribution, which, as everyone knows by now, includes the proprietary NFS storage layer. They include commScore, the online market intelligence firm, which recently dumped Cloudera’s Hadoop distribution for M5. In addition, the company’s free community distribution, M3, has been downloaded thousands of times according to Norris.

MapR’s performance and availability advantages over competing Hadoop distributions, Norris explained, are due in part to:

  • M5’s distributed namenode architecture, which removes the single point of failure that plagues HDFS;
  • MapR’s Lockless Storage Services layer, which results in higher MapReduce throughput than competing distributions;
  • Its ability to run the equivalent number of jobs on fewer nodes, which results in overall lower TCO.

Figure 1 - MapR Hadoop Stack
Source: MapR 2011

But it’s the open source issue where MapR takes a lot of heat. Norris argues that MapR’s approach – improving upon an open source core with proprietary value-add components and services – is a pretty “standard” model in the commercial open source world. While that is a common commercial open source business model, many would argue that the storage layer in a Hadoop distribution is the core, not an add-on.

Norris also said what’s important is not that a given Hadoop distribution is purely open source or not, but that it is 100% API compatible with the Apache distribution, which M5 is. This, he said, means that while developers can’t fiddle with NFS, they can easily integrate MapR’s distribution with HBase, HDFS, and other Apache Hadoop components, as well as move data in and out of NFS should they choose to tap a different Hadoop distribution. This last point is particularly important. It means, according to MapR, that there is no greater risk for vendor lock-in with its Hadoop distribution than with any other.

MapR’s focus on performance, availability, and API compatibility over open source code also comes through in its go-to-market strategy. MapR is not interested in educating the wider market about the benefits of Hadoop, as Cloudera and Hortonworks seem to be, according to Norris. Rather, MapR is targeting companies that are already using Hadoop or have made the decision to deploy Hadoop and are evaluating their distribution options. MapR also has a relationship with EMC to ship parts of its distribution with EMC Greenplum’s Hadoop offering.

Norris said MapR is targeting customers who already understand what Hadoop can do and want a highly available, enterprise-ready version that they can quickly deploy and easily integrate with other big data tools and technologies through open APIs. MapR’s target customers already did the experimenting with Cloudera or Apache, Norris explained, and are now ready to move Hadoop into production.

Fact Checking MapR’s Approach

Let’s consider MapR’s claims one-by-one.

  1. API compatibility is more important than open source code. As Hadoop goes mainstream, traditional enterprise users will be more interested in deploying stable, high-performance, enterprise-ready big data stacks than in hacking the Hadoop core. In the meantime, however, big data application developers are adamant that they have access to the source code to integrate their wares seamlessly with Hadoop. In the long-term, this claim is probably accurate, but as Hadoop continues rapid development open source code is still a critical element for many.
  2. MapR provides better performance and availability than competing Hadoop distributions. It is certainly true that MapR’s distribution has demonstrated significant performance and speed improvements over “vanilla” Hadoop. That said, CIOs are increasingly less interested in “speeds and feeds” and more interested in how Hadoop can deliver real business value.
  3. Enterprises are at no higher risk for vendor lock-in with MapR than with competing Hadoop distributions. It will prove reassuring to potential MapR customers that moving data out of M5, should they choose to move to a different distribution, is no more difficult than with any other distribution thanks to M5’s API compatibility. Still, (and like Cloudera’s enterprise Hadoop distribution), M5 costs money. How much money an enterprise sinks into an M5 deployment will determine the cost-effectiveness of moving to a competing distribution. So the risk of vendor lock-in with MapR is probably even with that of Cloudera, but higher than that of Hortonworks‘ distribution or the straight Apache Hadoop distribution.

MapR’s strategy carries with it a number of risks. The biggest risk for MapR is that Apache Hadoop catches up to M5 in performance and availability capabilities before it, M5, gains wide-spread adoption, thus nullifying its entire value proposition. Indeed, Apache contributors recently introduced HDFS federation to tackle the single-point-of-failure issue “by adding support for multiple namenodes/namespaces to HDFS file system.”

Norris said that while MapR respects the competition, he doesn’t believe the Apache distribution is even close to reaching performance parity with M5. When it comes to the single-point-of-failure issue, for example. MapR’s distributed namenode is superior to namenode federation in that M5 “is self-healing, and no user intervention is needed at any point.” In any event, that is a judgment the community will make.

Another risk is that its message of performance/availability/compatibility over open source code never reaches CIOs, drowned out by the fervent open source Hadoop community as well as by marketing from competitors. Hortonworks, like most Benchmark-funded start-ups, is a marketing and PR machine, while Cloudera, with more than 100 paying customers, is double MapR’s size and is on the verge of becoming the de facto Hadoop distribution.

And don’t forger support services. Enterprises that deploy Hadoop want assurance that if there’s an issue with their cluster, the vendor is there ready and waiting to put out the fire with fast technical support and intervention.

The $10 billion question, then, is which of the three Hadoop distribution models will enterprises embrace. Cloudera differentiates its core open source Hadoop distribution with its proprietary management console, which the company updated just last week. Hortonworks is going to market as the only 100% open source commercial Apache Hadoop distribution and plans to make money on technical support services. MapR is betting enterprises serious about Hadoop will value its performance and availability advantages over open source code, with its API compatibility assuaging vendor lock-in concerns.

The race is on. For MapR to remain competitive, I believe it must take the following steps:

  • Develop deep and real partnerships with big data application vendors. Enterprises looking to capitalize on big data analytics are increasingly looking to application vendors that promise to deliver real business value from Hadoop. The more application vendors work closely with MapR, the more likely these vendors are to recommend MapR as the underlying Hadoop infrastructure.
  • Continue contributing to the community where it can. MapR recently established MapR Academy, an online resource for Hadoop training and education. MapR should continue efforts like MapR Academy, as well as contribute to the Apache Hadoop project when possible, to engender good-will in the Hadoop community.
  • Aggressively take its message of performance/availability/compatibility over open source code to enterprise CIOs and even CEOs, who are more interested in enterprise stability and performance than whether a technology is open source or not. If MapR can convince executives that its Hadoop distribution is more powerful, safe and cost-effective than competing distributions, it has a chance to slow Cloudera’s and Hortonworks’ momentum and give itself a fighting chance to win the market.

Action Item: Enterprises evaluating MapR’s Hadoop distribution should demand proof-points/customer references from the vendor that include illustrations of its open API claims, including the ability to easily move data into and out of its cluster. Enterprises looking to navigate the larger Hadoop distribution market should focus on which of the competing Hadoop approaches – Cloudera, Hortonworks or MapR-- brings the greatest business value with the lowest cost and least risk. As we’ve written before, for some enterprises the value of fast business impact on revenue or profit offered by MapR will outweigh the risks of higher capex and the inability to customize the code. For enterprises just beginning to learn about Big Data and the benefits of Hadoop, it may make more sense to adopt Cloudera or Hortonworks’ more open approach, betting that performance improvements the community will develop over time and the flexibility offered by an open source distribution will prove more valuable in the long-term. Whatever option enterprises choose, stay up-to-speed with developments in the Hadoop community, as both open and proprietary improvements that can deliver real business value are being made to the technology at a fast clip. Footnotes:

categories

Big DataHadoopMapR TechnologiesProfessional alerts

时间: 2024-08-22 17:00:32

MapR Hadoop的相关文章

MAPR 开发环境搭建过程记录

我下载了MAPR 官方提供的virtualbox 和 vmware版本的sandbox进行试用. 开始试用了一会vmware版的,因为不太熟悉vmware的操作,而且vmplayer经常没有反应,后来改用了virtualbox版. 因为sandbox是单机版的,所以必须把网络设置设为host only,否则服务是无法正常启动的. 即使是这样,服务时常因为超时无法正常启动,这时我们可以在按alt+F2进入系统后重启服务即可.可能需要多尝试几次. 我想在sandbox中使用eclipse开发mapr

大数据开发生态圈之Apache Hadoop简介

Hadoop概述 Hadoop是一个由Apache基金会所开发的分布式系统基础架构.用户可以在不了解分布式底层细节的情况下,开发分布式程序,充分利用集群的威力进行高速运算和存储.Hadoop实现了一个分布式文件系统(Hadoop Distributed File System),简称HDFS.HDFS有高容错性的特点,并且设计用来部署在低廉的(low-cost)硬件上:而且它提供高吞吐量(high throughput)来访问应用程序的数据,适合那些有着超大数据集(large data set)

50个数据科学和机器学习速查表【转】

在数据科学领域有成千上万的包和数以百计的函数公式,你虽然不需要掌握所有的这些知识,但是有一个速查表在你的学习中是非常重要的.学习大数据包括对统计学.数学.编程知识(尤其是R.python.SQL)等知识的理解,还需要理解业务来驱动决策.这些表单也许能给你一些帮助. Python的速查表 Python在初学者中非常受欢迎,同样足以支持那些最受欢迎的产品和应用程序,它的设计让你在编程的时候感觉同用英语写作一样自然,Python basics 或者Python Debugger的速查表覆盖了重要的语法

Five Steps to Avoiding Java Heap Space Errors

来自:https://www.mapr.com/blog/how-to-avoid-java-heap-space-errors-understanding-and-managing-task-attempt-memory#.VMWvNDGUfXY Keeping these five steps in mind can save you a lot of headaches and avoid Java heap space errors. Calculate memory needed. C

Hadoop 架构初探

对流行Hadoop做了一些最基本的了解,暂时没太大感觉,恩先记点笔记吧. = = Hadoop 基本命令及环境安装 一.下载虚拟机镜像 目前比较流行的有以下三个: (CHD) http://www.cloudera.com (HDP)  http://hortonworks.com/ (MapR) http://www.mapr.com 本文使用HDP的沙盘 下载地址 http://hortonworks.com/products/hortonworks-sandbox/#install 我使用

决胜大数据时代:Hadoop&Yarn&Spark企业级最佳实践(8天完整版脱产式培训版本)

Hadoop.Yarn.Spark是企业构建生产环境下大数据中心的关键技术,也是大数据处理的核心技术,是每个云计算大数据工程师必修课. 课程简介 大数据时代的精髓技术在于Hadoop.Yarn.Spark,是大数据时代公司和个人必须掌握和使用的核心内容. Hadoop.Yarn.Spark是Yahoo!.阿里淘宝等公司公认的大数据时代的三大核心技术,是大数据处理的灵魂,是云计算大数据时代的技术命脉之所在,以Hadoop.Yarn.Spark为基石构建起来云计算大数据中心广泛运行于Yahoo!.阿

决胜大数据时代:Hadoop&Yarn&Spark企业级最佳实践(3天)

Hadoop是云计算的事实标准软件框架,是云计算理念.机制和商业化的具体实现,是整个云计算技术学习中公认的核心和最具有价值内容. Yarn是目前公认的最佳的分布式集群资源管理框架: Mahout是目前数据挖掘领域的王者:        工业和信息化部电信研究院于2014年5月发布的“大数据白皮书”中指出: “2012 年美国联邦政府就在全球率先推出“大数据行动计划(Big data initiative)”,重点在基础技术研究和公共部门应用上加大投入.在该计划支持下,加州大学伯克利分校开发了完整

Hadoop 中利用 mapreduce 读写 mysql 数据

Hadoop 中利用 mapreduce 读写 mysql 数据 有时候我们在项目中会遇到输入结果集很大,但是输出结果很小,比如一些 pv.uv 数据,然后为了实时查询的需求,或者一些 OLAP 的需求,我们需要 mapreduce 与 mysql 进行数据的交互,而这些特性正是 hbase 或者 hive 目前亟待改进的地方. 好了言归正传,简单的说说背景.原理以及需要注意的地方: 1.为了方便 MapReduce 直接访问关系型数据库(Mysql,Oracle),Hadoop提供了DBInp

初识Apache Hadoop

Apache Hadoop是一套用于在由通用硬件构建的大型集群上运行应用程序的框架.它实现了Map/Reduce编程范型,计算任务会被分割成小块(多次)运行在不同的节点上.除此之外,它还提供了一款分布式文件系统(HDFS),数据被存储在计算节点上以提供极高的跨数据中心聚合带宽.下面编者对于ApacheHadoop进行详细介绍:     一.什么是Apache Hadoop?     1.概念:用来存储.处理和分析大数据的开源框架.     2.特点:分布式,可扩展并且容错     3.使用成本: