sbt的assembly插件使用(打包所有依赖)

1.sbt是什么

对于sbt 我也是小白, 为了搞spark看了一下scala,学习scala时指定的构建工具就是sbt(因为sbt也是用scala开发的嘛),起初在我眼里就是一个maven(虽然maven我也没怎么用),后面构建2个项目之后,发现还是蛮强大的,就是学习成本有点高。

哎,但是现在什么东东没有学习成本呢。扯远了,0.13版本的入门之旅参考:http://www.scala-sbt.org/0.13/tutorial/zh-cn/index.html

2.assembly是sbt的一个打包插件

下面是一个入门之旅里面的例子:

/* SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}
#simple.sbtname := "Simple Project"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2"
# Your directory layout should look like this
$ find .
.
./simple.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala

# Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.10/simple-project_2.10-1.0.jar

# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit   --class "SimpleApp"   --master local[4]   target/scala-2.10/simple-project_2.10-1.0.jar
...
Lines with a: 46, Lines with b: 23

到目前为止,都很happy,因为都能顺利通过,因为你依赖的spark库,在spark master和worker上面都有。但是如果依赖mysql的jdbc这些第三方库, 只使用sbt的 package 命令打包,是不会把这些第三方库打包进去的。

这样在spark上面运行就会报错,而且如果你有多台wroker机器的话,需要把其它机器都撞上同样的运行环境(jar包依赖)。

所以,这个时候我们就需要sbt的assembly pulgin。它的任务,就是负责把所有依赖的jar包都打成一个 fat jar。

但是,它也不是万能的,特别当你遇到重名的文件时候,就非常尴尬。

3.assembly如何解决 SBT Assembly - Deduplicate error & Exclude error

我们先来看个错误例子:

[error] 1 error was encountered during merge
[trace] Stack trace suppressed: run last *:assembly for the full output.
[error] (*:assembly) deduplicate: different file contents found in the following:
[error] /Users/qpzhang/.ivy2/cache/io.netty/netty-handler/jars/netty-handler-4.0.27.Final.jar:META-INF/io.netty.versions.properties
[error] /Users/qpzhang/.ivy2/cache/io.netty/netty-buffer/jars/netty-buffer-4.0.27.Final.jar:META-INF/io.netty.versions.properties
[error] /Users/qpzhang/.ivy2/cache/io.netty/netty-common/jars/netty-common-4.0.27.Final.jar:META-INF/io.netty.versions.properties
[error] /Users/qpzhang/.ivy2/cache/io.netty/netty-transport/jars/netty-transport-4.0.27.Final.jar:META-INF/io.netty.versions.properties
[error] /Users/qpzhang/.ivy2/cache/io.netty/netty-codec/jars/netty-codec-4.0.27.Final.jar:META-INF/io.netty.versions.properties
[error] Total time: 5 s, completed 2015-11-25 20:20:23

大概是说,这里面有很多路径一样的重复文件,它处理不了。怎么办?

只好手动来进行判断,assembly提供了不打包文件的规则,这些可以用脚本写在build.sbt文件中。

参考:https://github.com/sbt/sbt-assembly#excluding-jars-and-files

在我们这里,脚本是这样的(注意:sbt是当时最新的 0.13版本):

[email protected]:~/scala_code/CassandraTest $cat build.sbt

name := "CassandraTest"

version := "1.0"

scalaVersion := "2.10.4"
#spark的依赖直接忽略, 使用关键词provided表示运行环境已经有,不需要打包
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.5.2" % "provided"

#依赖spark-cassandra-connector的库
libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0-M2"
#如果后缀是.properties的文件,合并策略采用(MergeStrategy.first)第一个出现的文件
assemblyMergeStrategy in assembly := {
  case PathList(ps @ _*) if ps.last endsWith ".properties" => MergeStrategy.first
  case x =>
    val oldStrategy = (assemblyMergeStrategy in assembly).value
    oldStrategy(x)
}

这样就搞定了,其它的情况,再根据修改一下合并策略咯。

> assembly
[info] Including from cache: slf4j-api-1.7.5.jar
[info] Including from cache: metrics-core-3.0.2.jar
[info] Including from cache: netty-codec-4.0.27.Final.jar
[info] Including from cache: netty-handler-4.0.27.Final.jar
[info] Including from cache: netty-common-4.0.27.Final.jar
[info] Including from cache: joda-time-2.3.jar
[info] Including from cache: netty-buffer-4.0.27.Final.jar
[info] Including from cache: commons-lang3-3.3.2.jar
[info] Including from cache: jsr166e-1.1.0.jar
[info] Including from cache: cassandra-clientutil-2.1.5.jar
[info] Including from cache: joda-convert-1.2.jar
[info] Including from cache: netty-transport-4.0.27.Final.jar
[info] Including from cache: guava-16.0.1.jar
[info] Including from cache: spark-cassandra-connector_2.10-1.5.0-M2.jar
[info] Including from cache: cassandra-driver-core-2.2.0-rc3.jar
[info] Including from cache: scala-reflect-2.10.5.jar
[info] Including from cache: scala-library-2.10.5.jar
[info] Checking every *.class/*.jar file‘s SHA-1.
[info] Merging files...
[warn] Merging ‘META-INF/INDEX.LIST‘ with strategy ‘discard‘
[warn] Merging ‘META-INF/MANIFEST.MF‘ with strategy ‘discard‘
[warn] Merging ‘META-INF/io.netty.versions.properties‘ with strategy ‘first‘
[warn] Merging ‘META-INF/maven/com.codahale.metrics/metrics-core/pom.xml‘ with strategy ‘discard‘
[warn] Merging ‘META-INF/maven/com.datastax.cassandra/cassandra-driver-core/pom.xml‘ with strategy ‘discard‘
[warn] Merging ‘META-INF/maven/com.google.guava/guava/pom.xml‘ with strategy ‘discard‘
[warn] Merging ‘META-INF/maven/com.twitter/jsr166e/pom.xml‘ with strategy ‘discard‘
[warn] Merging ‘META-INF/maven/io.netty/netty-buffer/pom.xml‘ with strategy ‘discard‘
[warn] Merging ‘META-INF/maven/io.netty/netty-codec/pom.xml‘ with strategy ‘discard‘
[warn] Merging ‘META-INF/maven/io.netty/netty-common/pom.xml‘ with strategy ‘discard‘
[warn] Merging ‘META-INF/maven/io.netty/netty-handler/pom.xml‘ with strategy ‘discard‘
[warn] Merging ‘META-INF/maven/io.netty/netty-transport/pom.xml‘ with strategy ‘discard‘
[warn] Merging ‘META-INF/maven/joda-time/joda-time/pom.xml‘ with strategy ‘discard‘
[warn] Merging ‘META-INF/maven/org.apache.commons/commons-lang3/pom.xml‘ with strategy ‘discard‘
[warn] Merging ‘META-INF/maven/org.joda/joda-convert/pom.xml‘ with strategy ‘discard‘
[warn] Merging ‘META-INF/maven/org.slf4j/slf4j-api/pom.xml‘ with strategy ‘discard‘
[warn] Strategy ‘discard‘ was applied to 15 files
[warn] Strategy ‘first‘ was applied to a file
[info] SHA-1: d2cb403e090e6a3ae36b08c860b258c79120fc90
[info] Packaging /Users/qpzhang/scala_code/CassandraTest/target/scala-2.10/CassandraTest-assembly-1.0.jar ...
[info] Done packaging.
[success] Total time: 19 s, completed 2015-11-26 10:12:22

4.执行结果

[email protected]:~/project/spark-1.5.2-bin-hadoop2.6 $./bin/spark-submit --class "CassandraTestApp" --master local[4] ~/scala_code/CassandraTest/target/scala-2.10/CassandraTest-assembly-1.0.jar
//...........................
5/11/26 11:40:23 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, NODE_LOCAL, 26660 bytes)
15/11/26 11:40:23 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/11/26 11:40:23 INFO Executor: Fetching http://10.60.215.42:57683/jars/CassandraTest-assembly-1.0.jar with timestamp 1448509221160
15/11/26 11:40:23 INFO CassandraConnector: Disconnected from Cassandra cluster: Test Cluster
15/11/26 11:40:23 INFO Utils: Fetching http://10.60.215.42:57683/jars/CassandraTest-assembly-1.0.jar to /private/var/folders/2l/195zcc1n0sn2wjfjwf9hl9d80000gn/T/spark-4030cadf-8489-4540-976e-e98eedf50412/userFiles-63085bda-aa04-4906-9621-c1cedd98c163/fetchFileTemp7487594
894647111926.tmp
15/11/26 11:40:23 INFO Executor: Adding file:/private/var/folders/2l/195zcc1n0sn2wjfjwf9hl9d80000gn/T/spark-4030cadf-8489-4540-976e-e98eedf50412/userFiles-63085bda-aa04-4906-9621-c1cedd98c163/CassandraTest-assembly-1.0.jar to class loader
15/11/26 11:40:24 INFO Cluster: New Cassandra host localhost/127.0.0.1:9042 added
15/11/26 11:40:24 INFO CassandraConnector: Connected to Cassandra cluster: Test Cluster
15/11/26 11:40:25 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2676 bytes result sent to driver
15/11/26 11:40:25 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2462 ms on localhost (1/1)
15/11/26 11:40:25 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/11/26 11:40:25 INFO DAGScheduler: ResultStage 0 (collect at CassandraTest.scala:32) finished in 2.481 s
15/11/26 11:40:25 INFO DAGScheduler: Job 0 finished: collect at CassandraTest.scala:32, took 2.940601 s
Existing Data: CassandraRow{key: 1, value: first row}
Existing Data: CassandraRow{key: 2, value: second row}
Existing Data: CassandraRow{key: 3, value: third row}
//....................
5/11/26 11:40:27 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
15/11/26 11:40:27 INFO DAGScheduler: ResultStage 3 (collect at CassandraTest.scala:41) finished in 0.032 s
15/11/26 11:40:27 INFO DAGScheduler: Job 3 finished: collect at CassandraTest.scala:41, took 0.046502 s
New Data: (4,fourth row)
New Data: (5,fifth row)
Work completed, stopping the Spark context.
时间: 2024-11-12 19:45:52

sbt的assembly插件使用(打包所有依赖)的相关文章

Java技术--maven的assembly插件打包(依赖包归档)

注:最近工作中遇到的一个问题,写了一个日志处理的模块,现在需要给第三方客户使用,但是该模块依赖了我们自己写的或者修改的一些jar包,可选择方案:1.所有jar包放在一个文件夹中给第三方(感觉好不专业):2.将日志处理模块和依赖包全部打成一个jar包,明显这种方法专业且方便.因此引入maven的assembly插件来完成这个工作. 前提是:你的项目也是用maven来管理的. 1.在pom.xml文件中增加assembly插件: <!-- for package --> <plugin>

使用Maven的assembly插件实现自定义打包

一.背景 最近我们项目越来越多了,然后我就在想如何才能把基础服务的打包方式统一起来,并且可以实现按照我们的要求来生成,通过研究,我们通过使用maven的assembly插件完美的实现了该需求,爽爆了有木有.本文分享该插件的配置以及微服务的统一打包方式. 二.配置步骤及其他事项 1.首先我们需要在pom.xml中配置maven的assembly插件 1 <build> 2 <plugins> 3 <plugin> 4 <groupId>org.apache.m

Intell 使用 assembly插件 打包可执行Iar包

1.在pom.xml文件中增加assembly插件 <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-assembly-plugin</artifactId> <version>2.4</version> <configuration> <executions> <execution> <i

sbt发布assembly解决jar包冲突问题 deduplicate: different file contents found in the following

一.问题定义 最近在用sbt打assembly包时出现问题,在package的时候,发生jar包冲突/文件冲突问题,两个相同的class来自不同的jar包在classpath内引起冲突. 具体是:我有一个self4j的jar, 还有一个hadoop-common-hdfs的jar包,其中hadoop-common-hdfs.jar内包含了self4j这个jar包,导致冲突. 此类异常一般是由于打包不规范和打包疏忽引起的. (个人认为正确的打包策略是:只打包自己核心功能,不将依赖打包在一起,但是有

sbt公布assembly解决jar包冲突 deduplicate: different file contents found in the following

一个.问题定义 近期使用sbt战斗assembly发生故障时,包,在package什么时候,发生jar包冲突/文件冲突,两个相同class来自不同jar包classpath内心冲突. 有关详细信息:我有一个self4j的jar, hadoop-common-hdfs的jar包.当中hadoop-common-hdfs.jar内包括了self4j这个jar包,导致冲突. 此类异常通常是由于打包不规范和打包疏忽引起的. (个人觉得正确的打包策略是:仅仅打包自己核心功能.不将依赖打包在一起.可是有时为

Maven Assembly插件介

你是否想要创建一个包含脚本.配置文件以及所有运行时所依赖的元素(jar)Assembly插件能帮你构建一个完整的发布包. Assembly插件会生成 "assemblies", 此特性等同于的Maven 1 distribution plug-in..该插件不仅支持创建二进制归档文件,也支持创建源码归档文件.这些assemblies定义在一个assembly描述符文件里.你可以选择自定义assembly描述符或者直接使用插件自带的三个预定义描述符中的任何一个. 目前Assembly插件

maven--插件篇(assembly插件)

maven-assembly可以通过dependencySets将依赖的jar包打到特定目录. 1. 简介 简单的说,maven-assembly-plugin 就是用来帮助打包用的,比如说打出一个什么类型的包,包里包括哪些内容等等. 2. 常见的maven插件 maven插件是在生命周期中某些阶段执行的任务.一个插件完成一项功能.以下介绍几种常见的插件.如对于打包来说,有多种插件选择.最常见的有以下3个: plugin function maven-jar-plugin maven 默认打包插

Maven 的assembly插件使用

Maven 的assembly插件使用: 最近在做一个小工程,利用java启动运行. 为了简单方便使用运行,利用maven的assembly将需要使用的jar都打包到一个jar中.这样无论拷贝到哪里,只有运行就可以,不用需要任何配置. 配置时指定mainClass,运行java的main函数时则可以不用指定包含main函数的类路径名.如运行下面的jar文件则非常简单方便(后台运行):nohup java -jar CalculateScore.jar & 第一步:需要在pom的xml中配置如下内

idea打包jar的多种方式,用IDEA自带的打包形式,用IDEA自带的打包形式 用Maven插件maven-shade-plugin打包,用Maven插件maven-assembly-plugin打包

这里总结出用IDEA打包jar包的多种方式,以后的项目打包Jar包可以参考如下形式: 用IDEA自带的打包形式 用Maven插件maven-shade-plugin打包 用Maven插件maven-assembly-plugin打包 用IDEA自带的打包方式: 打开IDEA的file -> Project Structure,进入项目配置页面.如下图: 点击Artifacts,进入Artifacts配置页面,点击 + ,选择如下图的选项. 进入Create JAR from Modules页面,