Spark-zeppelin-大数据可视化分析

官网介绍

Multi-purpose Notebook

The Notebook is the place for all your needs

  • Data Ingestion
  • Data Discovery
  • Data Analytics
  • Data Visualization & Collaboration

Multiple language backend

Zeppelin interpreter concept allows any language/data-processing-backend to be plugged into Zeppelin.Currently Zeppelin supports many interpreters such as Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown and Shell.

Adding new language-backend is really simple. Learn how to write a zeppelin interpreter.

Apache Spark integration

Zeppelin provides built-in Apache Spark integration. You don‘t need to build a separate module, plugin or library for it.

Zeppelin‘s Spark integration provides

  • Automatic SparkContext and SQLContext injection
  • Runtime jar dependency loading from local filesystem or maven repository. Learn more aboutdependency loader.
  • Canceling job and displaying its progress

Data visualization

Some basic charts are already included in Zeppelin. Visualizations are not limited to SparkSQL‘s query, any output from any language backend can be recognized and visualized.

Pivot chart

With simple drag and drop Zeppelin aggeregates the values and display them in pivot chart. You can easily create chart with multiple aggregated values including sum, count, average, min, max.

Learn more about Zeppelin‘s Display system. ( text, html, table, angular )

Dynamic forms

Zeppelin can dynamically create some input forms into your notebook.

Learn more about Dynamic Forms.

Collaboration

Notebook URL can be shared among collaborators. Zeppelin can then broadcast any changes in realtime, just like the collaboration in Google docs.

Publish

Zeppelin provides an URL to display the result only, that page does not include Zeppelin‘s menu and buttons.This way, you can easily embed it as an iframe inside of your website.

100% Opensource

Apache Zeppelin (incubating) is Apache2 Licensed software. Please check out thesource repository andHow
to contribute

Zeppelin has a very active development community.Join the Mailing list and report issues on our Issue tracker.

Undergoing Incubation

Apache Zeppelin is an effort undergoing incubation at The Apache Software Foundation (ASF), sponsored by the Incubator. Incubation is required of all newly accepted projects until a further review indicates that the infrastructure, communications, and decision making process have stabilized in
a manner consistent with other successful ASF projects. While incubation status is not necessarily a reflection of the completeness or stability of the code, it does indicate that the project has yet to be fully endorsed by the ASF.

安装

From binary package

Download latest binary package from Download.

Build from source

Check instructions in README to build from source.

Configure

Configuration can be done by both environment variable(conf/zeppelin-env.sh) and java properties(conf/zeppelin-site.xml). If both defined, environment vaiable is used.

zepplin-env.sh zepplin-site.xml Default value Description
ZEPPELIN_PORT zeppelin.server.port 8080 Zeppelin server port.
ZEPPELIN_MEM N/A -Xmx1024m -XX:MaxPermSize=512m JVM mem options
ZEPPELIN_INTP_MEM N/A ZEPPELIN_MEM JVM mem options for interpreter process
ZEPPELIN_JAVA_OPTS N/A   JVM Options
ZEPPELIN_ALLOWED_ORIGINS zeppelin.server.allowed.origins * Allows a way to specify a ‘,‘ separated list of allowed origins for rest and websockets. i.e. http://localhost:8080
ZEPPELIN_SERVER_CONTEXT_PATH zeppelin.server.context.path / Context Path of the Web Application
ZEPPELIN_SSL zeppelin.ssl false  
ZEPPELIN_SSL_CLIENT_AUTH zeppelin.ssl.client.auth false  
ZEPPELIN_SSL_KEYSTORE_PATH zeppelin.ssl.keystore.path keystore  
ZEPPELIN_SSL_KEYSTORE_TYPE zeppelin.ssl.keystore.type JKS  
ZEPPELIN_SSL_KEYSTORE_PASSWORD zeppelin.ssl.keystore.password    
ZEPPELIN_SSL_KEY_MANAGER_PASSWORD zeppelin.ssl.key.manager.password    
ZEPPELIN_SSL_TRUSTSTORE_PATH zeppelin.ssl.truststore.path    
ZEPPELIN_SSL_TRUSTSTORE_TYPE zeppelin.ssl.truststore.type    
ZEPPELIN_SSL_TRUSTSTORE_PASSWORD zeppelin.ssl.truststore.password    
ZEPPELIN_NOTEBOOK_HOMESCREEN zeppelin.notebook.homescreen   Id of notebook to be displayed in homescreen ex) 2A94M5J1Z
ZEPPELIN_NOTEBOOK_HOMESCREEN_HIDE zeppelin.notebook.homescreen.hide false hide homescreen notebook from list when this value set to "true"
ZEPPELIN_WAR_TEMPDIR zeppelin.war.tempdir webapps The location of jetty temporary directory.
ZEPPELIN_NOTEBOOK_DIR zeppelin.notebook.dir notebook Where notebook file is saved
ZEPPELIN_NOTEBOOK_S3_BUCKET zeppelin.notebook.s3.bucket zeppelin Bucket where notebook saved
ZEPPELIN_NOTEBOOK_S3_USER zeppelin.notebook.s3.user user User in bucket where notebook saved. For example bucket/user/notebook/2A94M5J1Z/note.json
ZEPPELIN_NOTEBOOK_STORAGE zeppelin.notebook.storage org.apache.zeppelin.notebook.repo.VFSNotebookRepo Comma separated list of notebook storage
ZEPPELIN_INTERPRETERS zeppelin.interpreters org.apache.zeppelin.spark.SparkInterpreter,

org.apache.zeppelin.spark.PySparkInterpreter,

org.apache.zeppelin.spark.SparkSqlInterpreter,

org.apache.zeppelin.spark.DepInterpreter,

org.apache.zeppelin.markdown.Markdown,

org.apache.zeppelin.shell.ShellInterpreter,

org.apache.zeppelin.hive.HiveInterpreter

...

Comma separated interpreter configurations [Class]. First interpreter become a default
ZEPPELIN_INTERPRETER_DIR zeppelin.interpreter.dir interpreter Zeppelin interpreter directory

You‘ll also need to configure individual interpreter. Information can be found in ‘Interpreter‘ section in this documentation.

For example Spark.

Start/Stop

Start Zeppelin

bin/zeppelin-daemon.sh start

After successful start, visit http://localhost:8080 with your web browser.

Stop Zeppelin

bin/zeppelin-daemon.sh stop

实践例子:

Zeppelin Tutorial

We will assume you have Zeppelin installed already. If that‘s not the case, seeInstall.

Zeppelin‘s current main backend processing engine is Apache Spark. If you‘re new to the system, you might want to start by getting an idea of how it processes data to get the most out of Zeppelin.

Tutorial with Local File

Data Refine

Before you start Zeppelin tutorial, you will need to download bank.zip.

First, to transform data from csv format into RDD of Bank objects, run following script. This will also remove header usingfilter function.

val bankText = sc.textFile("yourPath/bank/bank-full.csv")

case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer)

// split each line, filter out header (starts with "age"), and map it into Bank case class
val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map(
    s=>Bank(s(0).toInt,
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
)

// convert to DataFrame and create temporal table
bank.toDF().registerTempTable("bank")

Data Retrieval

Suppose we want to see age distribution from bank. To do this, run:

%sql select age, count(1) from bank where age < 30 group by age order by age

You can make input box for setting age condition by replacing 30 with${maxAge=30}.

%sql select age, count(1) from bank where age < ${maxAge=30} group by age order by age

Now we want to see age distribution with certain marital status and add combo box to select marital status. Run:

%sql select age, count(1) from bank where marital="${marital=single,single|divorced|married}" group by age order by age

Tutorial with Streaming Data

Data Refine

Since this tutorial is based on Twitter‘s sample tweet stream, you must configure authentication with a Twitter account. To do this, take a look atTwitter
Credential Setup
. After you get API keys, you should fill out credential related values(apiKey,apiSecret,
accessToken, accessTokenSecret) with your API keys on following script.

This will create a RDD of Tweet objects and register these stream data as a table:

import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.storage.StorageLevel
import scala.io.Source
import scala.collection.mutable.HashMap
import java.io.File
import org.apache.log4j.Logger
import org.apache.log4j.Level
import sys.process.stringSeqToProcess

/** Configures the Oauth Credentials for accessing Twitter */
def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) {
  val configs = new HashMap[String, String] ++= Seq(
    "apiKey" -> apiKey, "apiSecret" -> apiSecret, "accessToken" -> accessToken, "accessTokenSecret" -> accessTokenSecret)
  println("Configuring Twitter OAuth")
  configs.foreach{ case(key, value) =>
    if (value.trim.isEmpty) {
      throw new Exception("Error setting authentication - value for " + key + " not set")
    }
    val fullKey = "twitter4j.oauth." + key.replace("api", "consumer")
    System.setProperty(fullKey, value.trim)
    println("\tProperty " + fullKey + " set as [" + value.trim + "]")
  }
  println()
}

// Configure Twitter credentials
val apiKey = "xxxxxxxxxxxxxxxxxxxxxxxxx"
val apiSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
val accessToken = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
val accessTokenSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
configureTwitterCredentials(apiKey, apiSecret, accessToken, accessTokenSecret)

import org.apache.spark.streaming.twitter._
val ssc = new StreamingContext(sc, Seconds(2))
val tweets = TwitterUtils.createStream(ssc, None)
val twt = tweets.window(Seconds(60))

case class Tweet(createdAt:Long, text:String)
twt.map(status=>
  Tweet(status.getCreatedAt().getTime()/1000, status.getText())
).foreachRDD(rdd=>
  // Below line works only in spark 1.3.0.
  // For spark 1.1.x and spark 1.2.x,
  // use rdd.registerTempTable("tweets") instead.
  rdd.toDF().registerAsTable("tweets")
)

twt.print

ssc.start()

Data Retrieval

For each following script, every time you click run button you will see different result since it is based on real-time data.

Let‘s begin by extracting maximum 10 tweets which contain the word "girl".

%sql select * from tweets where text like ‘%girl%‘ limit 10

This time suppose we want to see how many tweets have been created per sec during last 60 sec. To do this, run:

%sql select createdAt, count(1) from tweets group by createdAt order by createdAt

You can make user-defined function and use it in Spark SQL. Let‘s try it by making function namedsentiment. This function will return one of the three attitudes(positive, negative, neutral) towards the parameter.

def sentiment(s:String) : String = {
    val positive = Array("like", "love", "good", "great", "happy", "cool", "the", "one", "that")
    val negative = Array("hate", "bad", "stupid", "is")

    var st = 0;

    val words = s.split(" ")
    positive.foreach(p =>
        words.foreach(w =>
            if(p==w) st = st+1
        )
    )

    negative.foreach(p=>
        words.foreach(w=>
            if(p==w) st = st-1
        )
    )
    if(st>0)
        "positivie"
    else if(st<0)
        "negative"
    else
        "neutral"
}

// Below line works only in spark 1.3.0.
// For spark 1.1.x and spark 1.2.x,
// use sqlc.registerFunction("sentiment", sentiment _) instead.
sqlc.udf.register("sentiment", sentiment _)

To check how people think about girls using sentiment function we‘ve made above, run this:

%sql select sentiment(text), count(1) from tweets where text like ‘%girl%‘ group by sentiment(text)
时间: 2024-11-06 09:54:34

Spark-zeppelin-大数据可视化分析的相关文章

大数据可视化分析平台新应用:提升企业的数字营销策略

数字化时代,催生了不少社交媒体和搜索引擎公司.无论是国内还是国外乃至全球,社交媒体的势力愈加强大,与此也产生了大量的数据,成为大数据中的一部分.企业发展到一定地步,免不了大大小小的决策,这驱使着越来越多的企业选择商业智能产品——大数据可视化分析平台来合理利用它们积累的数据基础. 如今,从Facebook到Instagram,许多社交媒体渠道现在正在淹没在大量数据中.每天,超过400万小时的视频内容上传到YouTube,而每天有43亿条消息在Facebook网上发布. 随着可用于分析的数据量继续呈

一周实现大数据可视化分析——敏捷BI助艾瑞咨询集团实现互联网的大数据分析

相对传统分析方法,通过敏捷BI和Hadoop的互补,艾瑞咨询集团的业务效率获得数倍的提升:线下报告交付周期从3至4周缩短至小于1周,软件交付从半年缩短至一个月. 当前,一提到大数据人们就会想Hadoop,它似乎成为大数据的"代言人".不可否认,Hadoop在集群扩展性和成本上都有巨大的优势,但是,Hadoop并不适合做实时分析系统. 因此,很多企业都会利用Hadoop实现数据存储,再通过其他工具实现对大数据的高速捕获和实时分析.这里,我们将通过艾瑞咨询集团的一个真实案例,解读一下敏捷B

新零售大数据可视化分析系统搭建大数据系统解决方案

大数据可视化分析系统是什么?最贴切的例子就是,年底来了,各大软件都出了的年度账单.他们利用大数据分析系统,对每个用户进行了全面的分析,然后用文字的方式表达出来,以此方式又做了一次成功的营销. 其实每个行业都是需要大数据分析系统,不仅仅是可以做出年度账单,更多的是分析数据,发现问题,为公司做更好的规划.尤其是在新零售行业,无论是线上还是线下的销售每天都会产生大量数据,如何将这些大数据利用起来呢. 要知道营销的本质就是利用数据提高消费者购买转化率,促进成交总额的增长.通过融合线上线下的数据,生成消费

55个最实用大数据可视化分析工具

该文转自[IT168 技术] 近年来,随着云和大数据时代的来临,数据可视化产品已经不再满足于使用传统的数据可视化工具来对数据仓库中的数据抽取.归纳并简单的展现.传统的数据可视化工具仅仅将数据加以组合,通过不同的展现方式提供给用户,用于发现数据之间的关联信息.新型的数据可视化产品必须满足互联网爆发的大数据需求,必须快速的收集.筛选.分析.归纳.展现决策者所需要的信息,并根据新增的数据进行实时更新.因此,在大数据时代,数据可视化工具必须具有以下特性: (1)实时性:数据可视化工具必须适应大数据时代数

盘点最实用的大数据可视化分析工具(1/4)

俗话说的好:工欲善其事,必先利其器!一款好的工具可以让你事半功倍,尤其是在大数据时代,更需要强有力的工具通过使数据有意义的方式实现数据可视化,还有数据的可交互性:我们还需要跨学科的团队,而不是单个数据科学家.设计师或数据分析员:我们更需要重新思考我们所知道的数据可视化,图表和图形还只能在一个或两个维度上传递信息, 那么他们怎样才能与其他维度融合到一起深入挖掘大数据呢?此时就需要倚仗大数据可视化(BDV)工具,因此,笔者收集了适合各个平台各种行业的多个图表和报表工具,这些工具中不乏有适用于NET.

55个最实用的大数据可视化分析工具

俗话说的好:工欲善其事,必先利其器!一款好的工具可以让你事半功倍,尤其是在大数据时代,更需要强有力的工具通过使数据有意义的方式实现数据可视化,还有数据的可交互性:我们还需要跨学科的团队,而不是单个数据科学家.设计师或数据分析员:我们更需要重新思考我们所知道的数据可视化,图表和图形还只能在一个或两个维度上传递信息, 那么他们怎样才能与其他维度融合到一起深入挖掘大数据呢?此时就需要倚仗大数据可视化(BDV)工具,因此,笔者收集了适合各个平台各种行业的多个图表和报表工具,这些工具中不乏有适用于NET.

大数据可视化分析工具推荐

各个互联网公司通过大量的用户数据.信息进行统计分析,而这些大量繁杂的数据在经过可视化工具处理后,就能以图形化的形式展现在用户面前,清晰直观.随着各种数据的增加,这种可视化工具越来越得到开发者们的欢迎. 下面推荐25款可视化工具供大家选择和使用. 1.Modest Maps Modest Maps是一个轻量级.可扩展的.可定制的和免费的地图显示类库,这个类库能帮助开发人员在他们自己的项目里能够与地图进行交互.ModestMaps提供一个核心健壮的带有很多hooks与附加functionality函

大数据可视化分析电商快销用户画像分析系统开发

大数据的时代,每一个企业都希望从用户数据中分析出有价值的信息.尤其是电商行业,用户画像分析可以让商品推广范围更加精准,从而提升销量.大数据分析系统可以从海量数据分析预测出商品的发展的趋势,提高产品质量,同时提高用户满意度. 用户画像也叫用户信息标签化,根据用户社会属性.生活习惯和消费行为等信息而抽象出的一个标签化的用户模型.在电商的大数据中,可以通过用户的消费习惯,在电商平台上填的信息分析出大致的标签. 大数据可视化电商用户画像分析系统的优势: 1.精准营销:通过用户画像分析后,可以针对潜在用户

最实用的大数据可视化分析工具汇总(3/4)

一款好的工具可以让你事半功倍,尤其是在大数据时代,更需要强有力的工具通过使数据有意义的方式实现数据可视化.这些工具中不乏有适用于.NET.Java.Flash.HTML5.Flex等平台的,也不乏有适用于常规图表报表.甘特图.流程图.金融图表.工控图表.数据透视表.OLAP多维分析等图表报表开发的,下面就来看看全球备受欢迎的的可视化工具都有哪些吧! 二十九.Gantti Gantti是一个开源的PHP类,帮助用户即时生成Gantti图表.使用Gantti创建图表无需使用JavaScript,纯H

最实用的大数据可视化分析工具汇总(2/4)

一款好的工具可以让你事半功倍,尤其是在大数据时代,更需要强有力的工具通过使数据有意义的方式实现数据可视化.这些工具中不乏有适用于.NET.Java.Flash.HTML5.Flex等平台的,也不乏有适用于常规图表报表.甘特图.流程图.金融图表.工控图表.数据透视表.OLAP多维分析等图表报表开发的,下面就来看看全球备受欢迎的的可视化工具都有哪些吧! 十五.Kartograph Kartograph不需要任何地图提供者,如Google Maps,用来建立互动式地图,它由两个libraries组成,