分布式应用监控: SkyWalking 快速接入实践

  分布式应用,会存在各种问题。而要解决这些难题,除了要应用自己做一些监控埋点外,还应该有一些外围的系统进行主动探测,主动发现。

  APM工具就是干这活的,SkyWalking 是国人开源的一款优秀的APM应用,已成为apache的顶级项目。

  

  今天我们就来实践下 SkyWalking 下吧。

  实践目标: 达到监控现有的几个系统,清楚各调用关系,可以找到出性能问题点。

实践步骤:

  1. SkyWalking 服务端安装运行;
  2. 应用端的接入;
  3. 后台查看效果;
  4. 分析排查问题;
  5. 深入了解(如有心情);

1. SkyWalking 服务端安装

  下载应用包:

    # 主下载页
    http://skywalking.apache.org/downloads/
    # 点开具体下载地址后进行下载,如:
    wget http://mirrors.tuna.tsinghua.edu.cn/apache/skywalking/6.5.0/apache-skywalking-apm-6.5.0.tar.gz
    

  解压安装包:

    tar -xzvf apache-skywalking-apm-6.5.0.tar.gz

  使用默认配置端口,默认存储方式 h2, 直接启动服务:

    ./bin/startup.sh

  好产品就是这么简单!

  现在服务端就启起来了,可以打开后台地址查看(默认是8080端口): http://localhost:8080    界面如下:

  当然,上面是已存在应用的页面。现在你是看不到任何应用的,因为你还没有接入嘛。

2. 应用端的接入

  我们只以java应用接入方式实践。

  直接使用 javaagent 进行启动即可:

    java -javaagent:/root/skywalking/agent/skywalking-agent.jar -Dskywalking.agent.service_name=app1 -Dskywalking.collector.backend_service=localhost:11800 -jar myapp.jar

  参数说明:

    # 参数解释
    skywalking.agent.service_name: 本应用在skywalking中的名称
    skywalking.collector.backend_service: skywalking 服务端地址,grpc上报地址,默认端口是 11800
    # 上面两个参数也可以使用另外的表现形式
    SW_AGENT_COLLECTOR_BACKEND_SERVICES: 与 skywalking.collector.backend_service 含义相同
    SW_AGENT_NAME: 与 skywalking.agent.service_name 含义相同

  随便访问几个接口或页面,使监控抓取到数据。

  再回管理页面,已经看到有节点了。截图如上。

  现在我们还可以查看各应用之间的关系了!

  关系清晰吧!一目了然,代码再复杂也不怕了。

  我们还可以追踪具体链路:

  只要知道问题发生的时间点,即可以很快定位到发生问题的接口、系统,快速解决。

3. SkyWalking 配置文件

  如上,我们并没有改任何配置文件,就让系统跑起来了。幸运的同时,我们应该要知道更多!至少配置得知道。

  config/application.yml : 收集器服务端配置

  webapp/webapp.yml : 配置 Web 的端口及获取数据的 OAP(Collector)的IP和端口

  agent/config/agent.config : 配置 Agent 信息,如 Skywalking OAP(Collector)的地址和名称

  下面是 skywalking 的默认配置,我们可以不用更改就能跑起来一个样例!更改以生产化配置!

config/application.yml

cluster:
  standalone:
  # Please check your ZooKeeper is 3.5+, However, it is also compatible with ZooKeeper 3.4.x. Replace the ZooKeeper 3.5+
  # library the oap-libs folder with your ZooKeeper 3.4.x library.
#  zookeeper:
#    nameSpace: ${SW_NAMESPACE:""}
#    hostPort: ${SW_CLUSTER_ZK_HOST_PORT:localhost:2181}
#    #Retry Policy
#    baseSleepTimeMs: ${SW_CLUSTER_ZK_SLEEP_TIME:1000} # initial amount of time to wait between retries
#    maxRetries: ${SW_CLUSTER_ZK_MAX_RETRIES:3} # max number of times to retry
#    # Enable ACL
#    enableACL: ${SW_ZK_ENABLE_ACL:false} # disable ACL in default
#    schema: ${SW_ZK_SCHEMA:digest} # only support digest schema
#    expression: ${SW_ZK_EXPRESSION:skywalking:skywalking}
#  kubernetes:
#    watchTimeoutSeconds: ${SW_CLUSTER_K8S_WATCH_TIMEOUT:60}
#    namespace: ${SW_CLUSTER_K8S_NAMESPACE:default}
#    labelSelector: ${SW_CLUSTER_K8S_LABEL:app=collector,release=skywalking}
#    uidEnvName: ${SW_CLUSTER_K8S_UID:SKYWALKING_COLLECTOR_UID}
#  consul:
#    serviceName: ${SW_SERVICE_NAME:"SkyWalking_OAP_Cluster"}
#     Consul cluster nodes, example: 10.0.0.1:8500,10.0.0.2:8500,10.0.0.3:8500
#    hostPort: ${SW_CLUSTER_CONSUL_HOST_PORT:localhost:8500}
#  nacos:
#    serviceName: ${SW_SERVICE_NAME:"SkyWalking_OAP_Cluster"}
#    hostPort: ${SW_CLUSTER_NACOS_HOST_PORT:localhost:8848}
#  # Nacos Configuration namespace
#    namespace: ‘public‘
#  etcd:
#    serviceName: ${SW_SERVICE_NAME:"SkyWalking_OAP_Cluster"}
#     etcd cluster nodes, example: 10.0.0.1:2379,10.0.0.2:2379,10.0.0.3:2379
#    hostPort: ${SW_CLUSTER_ETCD_HOST_PORT:localhost:2379}
core:
  default:
    # Mixed: Receive agent data, Level 1 aggregate, Level 2 aggregate
    # Receiver: Receive agent data, Level 1 aggregate
    # Aggregator: Level 2 aggregate
    role: ${SW_CORE_ROLE:Mixed} # Mixed/Receiver/Aggregator
    restHost: ${SW_CORE_REST_HOST:0.0.0.0}
    restPort: ${SW_CORE_REST_PORT:12800}
    restContextPath: ${SW_CORE_REST_CONTEXT_PATH:/}
    gRPCHost: ${SW_CORE_GRPC_HOST:0.0.0.0}
    gRPCPort: ${SW_CORE_GRPC_PORT:11800}
    downsampling:
      - Hour
      - Day
      - Month
    # Set a timeout on metrics data. After the timeout has expired, the metrics data will automatically be deleted.
    enableDataKeeperExecutor: ${SW_CORE_ENABLE_DATA_KEEPER_EXECUTOR:true} # Turn it off then automatically metrics data delete will be close.
    dataKeeperExecutePeriod: ${SW_CORE_DATA_KEEPER_EXECUTE_PERIOD:5} # How often the data keeper executor runs periodically, unit is minute
    recordDataTTL: ${SW_CORE_RECORD_DATA_TTL:90} # Unit is minute
    minuteMetricsDataTTL: ${SW_CORE_MINUTE_METRIC_DATA_TTL:90} # Unit is minute
    hourMetricsDataTTL: ${SW_CORE_HOUR_METRIC_DATA_TTL:36} # Unit is hour
    dayMetricsDataTTL: ${SW_CORE_DAY_METRIC_DATA_TTL:45} # Unit is day
    monthMetricsDataTTL: ${SW_CORE_MONTH_METRIC_DATA_TTL:18} # Unit is month
    # Cache metric data for 1 minute to reduce database queries, and if the OAP cluster changes within that minute,
    # the metrics may not be accurate within that minute.
    enableDatabaseSession: ${SW_CORE_ENABLE_DATABASE_SESSION:true}
storage:
#  elasticsearch:
#    nameSpace: ${SW_NAMESPACE:""}
#    clusterNodes: ${SW_STORAGE_ES_CLUSTER_NODES:localhost:9200}
#    protocol: ${SW_STORAGE_ES_HTTP_PROTOCOL:"http"}
#    trustStorePath: ${SW_SW_STORAGE_ES_SSL_JKS_PATH:"../es_keystore.jks"}
#    trustStorePass: ${SW_SW_STORAGE_ES_SSL_JKS_PASS:""}
#    user: ${SW_ES_USER:""}
#    password: ${SW_ES_PASSWORD:""}
#    indexShardsNumber: ${SW_STORAGE_ES_INDEX_SHARDS_NUMBER:2}
#    indexReplicasNumber: ${SW_STORAGE_ES_INDEX_REPLICAS_NUMBER:0}
#    # Those data TTL settings will override the same settings in core module.
#    recordDataTTL: ${SW_STORAGE_ES_RECORD_DATA_TTL:7} # Unit is day
#    otherMetricsDataTTL: ${SW_STORAGE_ES_OTHER_METRIC_DATA_TTL:45} # Unit is day
#    monthMetricsDataTTL: ${SW_STORAGE_ES_MONTH_METRIC_DATA_TTL:18} # Unit is month
#    # Batch process setting, refer to https://www.elastic.co/guide/en/elasticsearch/client/java-api/5.5/java-docs-bulk-processor.html
#    bulkActions: ${SW_STORAGE_ES_BULK_ACTIONS:1000} # Execute the bulk every 1000 requests
#    flushInterval: ${SW_STORAGE_ES_FLUSH_INTERVAL:10} # flush the bulk every 10 seconds whatever the number of requests
#    concurrentRequests: ${SW_STORAGE_ES_CONCURRENT_REQUESTS:2} # the number of concurrent requests
#    resultWindowMaxSize: ${SW_STORAGE_ES_QUERY_MAX_WINDOW_SIZE:10000}
#    metadataQueryMaxSize: ${SW_STORAGE_ES_QUERY_MAX_SIZE:5000}
#    segmentQueryMaxSize: ${SW_STORAGE_ES_QUERY_SEGMENT_SIZE:200}
  h2:
    driver: ${SW_STORAGE_H2_DRIVER:org.h2.jdbcx.JdbcDataSource}
    url: ${SW_STORAGE_H2_URL:jdbc:h2:mem:skywalking-oap-db}
    user: ${SW_STORAGE_H2_USER:sa}
    metadataQueryMaxSize: ${SW_STORAGE_H2_QUERY_MAX_SIZE:5000}
#  mysql:
#    properties:
#      jdbcUrl: ${SW_JDBC_URL:"jdbc:mysql://localhost:3306/swtest"}
#      dataSource.user: ${SW_DATA_SOURCE_USER:root}
#      dataSource.password: ${SW_DATA_SOURCE_PASSWORD:[email protected]1234}
#      dataSource.cachePrepStmts: ${SW_DATA_SOURCE_CACHE_PREP_STMTS:true}
#      dataSource.prepStmtCacheSize: ${SW_DATA_SOURCE_PREP_STMT_CACHE_SQL_SIZE:250}
#      dataSource.prepStmtCacheSqlLimit: ${SW_DATA_SOURCE_PREP_STMT_CACHE_SQL_LIMIT:2048}
#      dataSource.useServerPrepStmts: ${SW_DATA_SOURCE_USE_SERVER_PREP_STMTS:true}
#    metadataQueryMaxSize: ${SW_STORAGE_MYSQL_QUERY_MAX_SIZE:5000}
receiver-sharing-server:
  default:
receiver-register:
  default:
receiver-trace:
  default:
    bufferPath: ${SW_RECEIVER_BUFFER_PATH:../trace-buffer/}  # Path to trace buffer files, suggest to use absolute path
    bufferOffsetMaxFileSize: ${SW_RECEIVER_BUFFER_OFFSET_MAX_FILE_SIZE:100} # Unit is MB
    bufferDataMaxFileSize: ${SW_RECEIVER_BUFFER_DATA_MAX_FILE_SIZE:500} # Unit is MB
    bufferFileCleanWhenRestart: ${SW_RECEIVER_BUFFER_FILE_CLEAN_WHEN_RESTART:false}
    sampleRate: ${SW_TRACE_SAMPLE_RATE:10000} # The sample rate precision is 1/10000. 10000 means 100% sample in default.
    slowDBAccessThreshold: ${SW_SLOW_DB_THRESHOLD:default:200,mongodb:100} # The slow database access thresholds. Unit ms.
receiver-jvm:
  default:
receiver-clr:
  default:
service-mesh:
  default:
    bufferPath: ${SW_SERVICE_MESH_BUFFER_PATH:../mesh-buffer/}  # Path to trace buffer files, suggest to use absolute path
    bufferOffsetMaxFileSize: ${SW_SERVICE_MESH_OFFSET_MAX_FILE_SIZE:100} # Unit is MB
    bufferDataMaxFileSize: ${SW_SERVICE_MESH_BUFFER_DATA_MAX_FILE_SIZE:500} # Unit is MB
    bufferFileCleanWhenRestart: ${SW_SERVICE_MESH_BUFFER_FILE_CLEAN_WHEN_RESTART:false}
istio-telemetry:
  default:
envoy-metric:
  default:
#    alsHTTPAnalysis: ${SW_ENVOY_METRIC_ALS_HTTP_ANALYSIS:k8s-mesh}
#receiver_zipkin:
#  default:
#    host: ${SW_RECEIVER_ZIPKIN_HOST:0.0.0.0}
#    port: ${SW_RECEIVER_ZIPKIN_PORT:9411}
#    contextPath: ${SW_RECEIVER_ZIPKIN_CONTEXT_PATH:/}
query:
  graphql:
    path: ${SW_QUERY_GRAPHQL_PATH:/graphql}
alarm:
  default:
telemetry:
  none:
configuration:
  none:
#  apollo:
#    apolloMeta: http://106.12.25.204:8080
#    apolloCluster: default
#    # apolloEnv: # defaults to null
#    appId: skywalking
#    period: 5
#  nacos:
#    # Nacos Server Host
#    serverAddr: 127.0.0.1
#    # Nacos Server Port
#    port: 8848
#    # Nacos Configuration Group
#    group: ‘skywalking‘
#    # Nacos Configuration namespace
#    namespace: ‘‘
#    # Unit seconds, sync period. Default fetch every 60 seconds.
#    period : 60
#    # the name of current cluster, set the name if you want to upstream system known.
#    clusterName: "default"
#  zookeeper:
#    period : 60 # Unit seconds, sync period. Default fetch every 60 seconds.
#    nameSpace: /default
#    hostPort: localhost:2181
#    #Retry Policy
#    baseSleepTimeMs: 1000 # initial amount of time to wait between retries
#    maxRetries: 3 # max number of times to retry
#  etcd:
#    period : 60 # Unit seconds, sync period. Default fetch every 60 seconds.
#    group :  ‘skywalking‘
#    serverAddr: localhost:2379
#    clusterName: "default"
#  consul:
#    # Consul host and ports, separated by comma, e.g. 1.2.3.4:8500,2.3.4.5:8500
#    hostAndPorts: ${consul.address}
#    # Sync period in seconds. Defaults to 60 seconds.
#    period: 1

#exporter:
#  grpc:
#    targetHost: ${SW_EXPORTER_GRPC_HOST:127.0.0.1}
#    targetPort: ${SW_EXPORTER_GRPC_PORT:9870}
    

webapp/webapp.yml

server:
  port: 8080

collector:
  path: /graphql
  ribbon:
    ReadTimeout: 10000
    # Point to all backend‘s restHost:restPort, split by ,
    listOfServers: 127.0.0.1:12800

agent/config/agent.config

# The agent namespace
# agent.namespace=${SW_AGENT_NAMESPACE:default-namespace}

# The service name in UI
agent.service_name=${SW_AGENT_NAME:Your_ApplicationName}

# The number of sampled traces per 3 seconds
# Negative number means sample traces as many as possible, most likely 100%
# agent.sample_n_per_3_secs=${SW_AGENT_SAMPLE:-1}

# Authentication active is based on backend setting, see application.yml for more details.
# agent.authentication = ${SW_AGENT_AUTHENTICATION:xxxx}

# The max amount of spans in a single segment.
# Through this config item, skywalking keep your application memory cost estimated.
# agent.span_limit_per_segment=${SW_AGENT_SPAN_LIMIT:300}

# Ignore the segments if their operation names end with these suffix.
# agent.ignore_suffix=${SW_AGENT_IGNORE_SUFFIX:.jpg,.jpeg,.js,.css,.png,.bmp,.gif,.ico,.mp3,.mp4,.html,.svg}

# If true, skywalking agent will save all instrumented classes files in `/debugging` folder.
# Skywalking team may ask for these files in order to resolve compatible problem.
# agent.is_open_debugging_class = ${SW_AGENT_OPEN_DEBUG:true}

# The operationName max length
# agent.operation_name_threshold=${SW_AGENT_OPERATION_NAME_THRESHOLD:500}

# Backend service addresses.
collector.backend_service=${SW_AGENT_COLLECTOR_BACKEND_SERVICES:127.0.0.1:11800}

# Logging file_name
logging.file_name=${SW_LOGGING_FILE_NAME:skywalking-api.log}

# Logging level
logging.level=${SW_LOGGING_LEVEL:DEBUG}

# Logging dir
# logging.dir=${SW_LOGGING_DIR:""}

# Logging max_file_size, default: 300 * 1024 * 1024 = 314572800
# logging.max_file_size=${SW_LOGGING_MAX_FILE_SIZE:314572800}

# The max history log files. When rollover happened, if log files exceed this number,
# then the oldest file will be delete. Negative or zero means off, by default.
# logging.max_history_files=${SW_LOGGING_MAX_HISTORY_FILES:-1}

# mysql plugin configuration
# plugin.mysql.trace_sql_parameters=${SW_MYSQL_TRACE_SQL_PARAMETERS:false}

4. SkyWalking 架构

  来自官网的图片,感受一下!无须细说,大概原理就是: 针对各种不同客户端实现不同的指标采集,统一通过grpc/http发送到apm服务端,然后经过分析引擎后存储到es/h2/mysql等等存储系统,最后由前端通过查询引擎进行展现。

5. 可以用来干啥

  发现系统耗时或者说瓶颈在哪里。

  发现各系统之间的调用关系。

  监控服务异常。

  排查系统故障。

原文地址:https://www.cnblogs.com/yougewe/p/11973117.html

时间: 2024-11-01 13:19:32

分布式应用监控: SkyWalking 快速接入实践的相关文章

如何让微信小程序快速接入七牛云

如果你确定用七牛运行小程序的话,给大家分享一个九折优惠码:61d1fd4d1 月 9 日 微信小程序正式发布,小程序终于揭开了它神秘的面纱,开发者对小程序的追捧更是热度不减.从小程序的热门应用场景来看,大概可以分为两大类,一类是低使用频率的 App,如金融类的银行或保险公司 App,O2O 类的上门做饭.家政 App:另一类是虽然使用频率高但是功能简单的 App,如工具类的天气.快递查询,富媒体类的资讯 App 等.那么,谁将成为小程序的大赢家?要打造独角兽级别的微信小程序,开发者除了要注重小程

在ASP.NET MVC5应用程序中快速接入QQ和新浪微博OAuth

这篇文章演示如何在你的ASP.NET MVC5应用程序中支持用户使用腾讯QQ和新浪微博的open authentication. 起步 安装Visual studio 2013 higher或者Visual studio express 2013 for web就不再赘述了,点击这里下载. 创建应用程序 打开vs,在Template中选择C#->asp.net web application ,命名为OauthDemo,并点击OK 在弹出窗口中选择MVC template,并且选择"Cha

快速接入PHP微信支付

微信支付是微信开发中坑最多的一个功能,本文旨在帮助有开发基础的人快速接入微信支付,如果要详细了解微信支付,请看微信支付的开发文档. 再说把开发文档搬到这里来就没必要了.想要快速跑通微信支付的可以继续查看. 微信支付分公众号支付(在微信里的网页用微信支付).PC版扫码支付(扫码有两种模式).APP微信支付,当初做这三种支付,还没有很多人做, 没有资料,虽然官方提供了demo,但是出现各种问题跑不通,对着文档做, 遇到各种你想不到的坑,简直要哭,说多了都是泪.... 这里介绍一下公众号支付的流程:

Docker快速入门实践-纯干货文章

Docker快速入门实践-纯干货文章,如果细看还能发现讲解视频呦!小伙伴们赶紧猛戳吧! Docker快速入门实践-纯干货文章 老男孩教育2016启用最新的官方博文地址: http://blog.oldboyedu.com/ 欢迎小伙伴们收藏关注,干货连连!

unity3d如何快速接入渠道SDK之Unity篇

首先我们讲一下,为什么要介绍这个插件? 是因为这个插件极大的简化了我对接渠道SDK的工作量,精力和时间,也避免了我不断的重复的做接入SDK工作这样没有成就感的无聊工作! 所以我就介绍一下这款插件!!!!!! 我的开发环境:windows系统 , eclipse , unity4.3.4 这款插件的工作方式是中间件加插件 这样的组合模式工作的. 所以需要先接入中间件. 对接过程: 1. 准备中间件工程对接(http://www.abctools.cn/documentCenter/toSdkDow

如何接入银联“快速接入”产品API

引言:使用银联开放平台的用户或多或少都接触过产品API吧,那么大家对于“快速接入”产品API是否还会存在一些疑问呢?因为我之前对“快速接入”模糊不清,所以整理的一份详细的资料,里面梳理了“快速接入”产品API的类型,功能特点,如何调用等信息,供大家参考. “快速接入”产品API为银联向外提供统一规范产品API服务的入口,为机构.服务商以及广大开发者统一标准.统一流程的API调用服务.开发者在开放平台创建账号信息.获取产品API调用权限后,可以通过阅读本技术文档来帮助开发.下图为其中一个“快速接入

快速接入业务监控体系,grafana监控的艺术

做一个系统,如果不做监控,是不完善的. 如果为做一个快速系统,花力气去做监控,是不值得的. 因为,我们有必要具备一个能够快速建立监控体系的能力.即使你只是一个普通开发人员! 个人觉得,做监控有三个核心能力: 1. 持续收集数据的能力; (时间序列数据库) 2. 监控结果可视化的能力; 3. 异常监控可报警的能力; 只要把这三点做到了,那么,这个监控基本就成功了. 既然自己不方便做监控系统,那就只要好框架造型就可以了. 因grafana的图表功能个人感觉最强大,我们就以 grafana 作为监控抓

京东前端:PhantomJS 和NodeJS在网站前端监控平台的最佳实践

1. 为什么需要一个前端监控系统 通常在一个大型的 Web 项目中有很多监控系统,比如后端的服务 API 监控,接口存活.调用.延迟等监控,这些一般都用来监控后台接口数据层面的信息.而且对于大型网站系统来说,从后端服务到前台展示会有很多层:内网 VIP.CDN 等. 但是这些监控并不能准确地反应用户看到的前端页面状态,比如:页面第三方系统数据调用失败,模块加载异常,数据不正确,空白开天窗等. 相关厂商内容 Native动态化最新技术解析 不可错过的智能时代的大前端 性能优化最佳实践经验谈 百度技

流计算及在特来电监控引擎中的实践

随着云计算的深入落地,大数据技术有了坚实的底层支撑,不断向前发展并日趋成熟,无论是传统企业还是互联网公司,都不再满足于离线批处理计算,而是更倾向于应用实时流计算,要想在残酷的企业竞争中立于不败之地,企业数据必须被快速处理并输出结果,流计算无疑将是企业Must Have的大杀器.作为充电生态网的领军企业,特来电在流计算方面很早便开始布局,下面笔者抛砖引玉的谈一下流计算及在特来电监控引擎中的应用实践. 一.由Bit说开去 作为计算机信息中的最小单位,Bit就像工蚁一样忙碌,任一时刻都只能处于以下三种