kubernetes之收集集群的events,监控集群行为

一、概述

线上部署的k8s已经扛过了双11的洗礼,期间先是通过对网络和监控的优化顺利度过了双11并且表现良好。先简单介绍一下我们kubernetes的使用方式:

物理机系统:Ubuntu-16.04(kernel 升级到4.17)

kuberneets-version:1.13.2

网络组件:calico(采用的是BGP模式+bgp reflector)

kube-proxy:使用的是ipvs模式

监控:prometheus+grafana

日志: fluentd + ES

metrics: metrics-server

HPA:cpu + memory

告警:钉钉

CI/CD: gitlab-ci/gitlab-runner

应用管理工具:helm、chartmuseum(不建议直接使用helm,helm charts可读性很差,学习成本较高)

由于k8s、物理环境共存,需要打通通网络提供访问:kube-gateway

有的地方涉及到公司内部的东西不方便写出来,但是绝大部分在我之前的博客都有介绍,有兴趣的可以参考一下。

自己的反思:

开始的时候,k8s集群在线上跑了一段时间,但是我发现我对集群内部的变化没有办法把控的很清楚,比如某个pod被重新调度了、某个node节点上的imagegc失败了、某个hpa被触发了等等,而这些都是可以通过events拿到的,但是events并不是永久存储的,它包含了集群各种资源的状态变化,所以我们可以通过收集分析events来了解整个集群内部的变化,经过一番探索找到一个开源的eventrouter来收集events事件,经过一些改造使其符合我们的业务场景,更名为eventrouter-kafka(https://github.com/cuishuaigit/eventrouter-kafka)直接将修改配置直传kafka,而不是需要各种配置,感觉原版的配置有些繁琐不是很好用,而我们的日志也是走kafka队列的,减轻ES的写压力。现在的events收集流程:

eventrouter---->kafka---->logstash(过滤、解析)----->ES------elastalert---->钉钉

经过添加上面的收集events使k8s集群又完善了一步。

二、简述流程

1、部署eventrouter

eventrouter是使用golang写的,可以根据自己的需求二次开发,部署很简单,参考:https://github.com/cuishuaigit/eventrouter-kafka。这里就不细述了。

2、kafka集群

参考:https://github.com/cuishuaigit/k8s-kafka

3、logstash

现在相应版本的logstash,下载地址:https://www.elastic.co/guide/en/logstash/6.5/installing-logstash.html

然后进行配置,这里贴一下我的测试配置:

input{
   kafka{
      bootstrap_servers => ["kafka-0.kafka-svc.kafka.svc.cluster.local:9092,kafka-1.kafka-svc.kafka.svc.cluster.local:9092,kafka-2.kafka-svc.kafka.svc.cluster.local:9092"]
      client_id => "eventrouter-prod"
      #auto_offset_reset => "latest"
      group_id => "eventrouter"
      consumer_threads => 2
      #decorate_events  => true
      id => "eventrouter"
      topics => ["eventrouter"]
}
}

filter {
  if [message] =~ ‘DNSConfigForming‘ {
     drop { }
  }
  json {
    source => "message"
  }
  mutate {
    remove_field => [ "message","old_event" ]
}
}

output{
 elasticsearch {
                        hosts => "10.4.9.28:9200"
                        index => "eventrouter-%{+YYYY-MM-dd}"
                 }
}

4、ES

version: ‘2‘
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:6.5.1
    container_name: elasticsearch
    environment:
      - cluster.name=docker-cluster
      - bootstrap.memory_lock=true
      - "ES_JAVA_OPTS=-Xms4096m -Xmx4096m"

    ulimits:
      memlock:
        soft: -1
        hard: -1
    volumes:
      - /data/es1:/usr/share/elasticsearch/data
      - /data/backups:/usr/share/elasticsearch/backups
      - /data/longterm_backups:/usr/share/elasticsearch/longterm_backups
      - ./config/jvm.options:/usr/share/elasticsearch/config/jvm.options
    ports:
      - "9200:9200"
    networks:
      - esnet
#  elasticsearch2:
#    image: docker.elastic.co/elasticsearch/elasticsearch:6.5.1
#    container_name: elasticsearch2
#    environment:
#      - cluster.name=docker-cluster
#      - bootstrap.memory_lock=true
#      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
#      - "discovery.zen.ping.unicast.hosts=elasticsearch"
#    ulimits:
#      memlock:
#        soft: -1
#        hard: -1
#    volumes:
#      - /data/es2:/usr/share/elasticsearch/data
#    networks:
#      - esnet
  kibana:
    image: docker.elastic.co/kibana/kibana:6.5.1
    container_name: kibana
    environment:
      SERVER_NAME: kibana
      SERVER_HOST: "0.0.0.0"
      ELASTICSEARCH_URL: http://elasticsearch:9200
      XPACK_MONITORING_UI_CONATINER_ELASTICSEARCH_ENABLED: "true"
    volumes:
      - /data/plugin:/usr/share/kibana/plugin
      - /tmp/:/etc/archives
    ports:
      - "5601:5601"
    networks:
      - esnet
    depends_on:
      - elasticsearch
networks:
 esnet:
   driver: bridge

cat config/jvm.properties

## JVM configuration

################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms2g
-Xmx2g

################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don‘t tamper with them unless
## you understand what you are doing
##
################################################################

## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1GC is only supported on JDK version 10 or later.
# To use G1GC uncomment the lines below.
# 10-:-XX:-UseConcMarkSweepGC
# 10-:-XX:-UseCMSInitiatingOccupancyOnly
# 10-:-XX:+UseG1GC
# 10-:-XX:InitiatingHeapOccupancyPercent=75

## optimizations

# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch

## basic

# explicitly set the stack size
-Xss1m

# set to headless, just in case
-Djava.awt.headless=true

# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8

# use our provided JNA always versus the system one
-Djna.nosys=true

# turn off a JDK optimization that throws away stack traces for common
# exceptions because stack traces are important for debugging
-XX:-OmitStackTraceInFastThrow

# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0

# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true

-Djava.io.tmpdir=${ES_TMPDIR}

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError

# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=data

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=logs/hs_err_pid%p.log

## JDK 8 GC logging

8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:logs/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=32,filesize=64m
# due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
# time/date parsing will break in an incompatible way for some date patterns and locals
9-:-Djava.locale.providers=COMPAT

# temporary workaround for C2 bug with JDK 10 on hardware with AVX-512
10-:-XX:UseAVX=2

5、elastalert

部署参考https://github.com/Yelp/elastalert.git

使用:

mkdir  /etc/elastalert

将clone的elastalert目录下面的config.yaml.example拷贝到上面创建的目录里面:

cpoy  elastalert/config.yaml.example     /etc/elastalert/config.yaml

只需要修改:

rules_folder、es_host、es_port,如果设置了用户密码,还需要修改。

创建rules

mkdir /etc/elastalert/rules

6、钉钉

创建机器人参考我其他的博客,获取token,下载钉钉plugin, https://github.com/xuyaoqiang/elastalert-dingtalk-plugin

将elastalert_modules拷贝到/etc/elastalert目录下面

cp  -r elastalert-dingtalk-plugin/elastalert_modules   /etc/elastalert/elastalert

rules example

# Alert when the rate of events exceeds a threshold

# (Optional)
# Elasticsearch host
es_host: 10.2.9.28

# (Optional)
# Elasticsearch port
es_port: 9200

# (OptionaL) Connect with SSL to Elasticsearch
#use_ssl: True

# (Optional) basic-auth username and password for Elasticsearch
#es_username: someusername
#es_password: somepassword

# (Required)
# Rule name, must be unique
name: Other event frequency rule

# (Required)
# Type of alert.
# the frequency rule type alerts when num_events events occur with timeframe time
type: frequency

# (Required)
# Index to search, wildcard supported
index: eventrouter-*

# (Required, frequency specific)
# Alert when this many documents matching the query occur within a timeframe
num_events: 5

# (Required, frequency specific)
# num_events must occur within this amount of time to trigger an alert
timeframe:
  #hours: 4
  minutes: 15
# (Required)
# A list of Elasticsearch filters used for find events
# These filters are joined with AND and nested in a filtered query
# For more info: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl.html
filter:
#- term:
#    some_field: "some_value"
- query:
    query_string:
      query: "event.type: Warning NOT event.involvedObject.kind: Node"
# (Required)
# The alert is use when a match is found

#smtp_host: smtp.exmail.qq.com
#smtp_port: 25
#smtp_auth_file: /etc/elastalert/smtp_auth_file.yaml
#email_reply_to: ci@qq.com
#from_addr: ci@qq.com
realert:
  minutes: 5
exponential_realert:
  hours: 1

alert:
#- "email"
- "elastalert_modules.dingtalk_alert.DingTalkAlerter"
dingtalk_webhook: "https://oapi.dingtalk.com/robot/send?access_token=47194e6904c6e3133a9080980984444c8e5d7745e1f76c12cefa99c8c8ac718dd88d4c"
dingtalk_msgtype: "text"

alert_text_type: alert_text_only
alert_text: "
   ====elastalert message====\n

   EventTime>>:  {0}\n

   Event_involvedObject_name>>:  {1}\n

   Event_involvedObject_kind>>:  {2}\n

   Event_involvedObject_namespace>>:  {3}\n

   Message>>:  {4}\n

   Event_reason>>: {5}\n

   verb>>: {6}
"

alert_text_args:
- "@timestamp"
- event.involvedObject.name
- event.source.component
- event.involvedObject.namespace
- event.message
- event.reason
- verb
# (required, email specific)
# a list of email addresses to send alerts to
#email:
#- "ci@qq.com"

自己定制的告警消息格式:

alert:
#- "email"
- "elastalert_modules.dingtalk_alert.DingTalkAlerter"
dingtalk_webhook: "https://oapi.dingtalk.com/robot/send?access_token=47194e6904c6e3133a9080980984444c8e5d7745e1f76c12cefa99c8c8ac718dd88d4c"
dingtalk_msgtype: "text"

alert_text_type: alert_text_only
alert_text: "
   ====elastalert message====\n

   EventTime>>:  {0}\n

   Event_involvedObject_name>>:  {1}\n

   Event_involvedObject_kind>>:  {2}\n

   Event_involvedObject_namespace>>:  {3}\n

   Message>>:  {4}\n

   Event_reason>>: {5}\n

   verb>>: {6}
"

alert_text_args:
- "@timestamp"
- event.involvedObject.name
- event.source.component
- event.involvedObject.namespace
- event.message
- event.reason
- verb

详细信息参考官网:https://elastalert.readthedocs.io/en/latest/recipes/writing_filters.html#writingfilters

原文地址:https://www.cnblogs.com/cuishuai/p/10573586.html

时间: 2024-10-06 03:35:04

kubernetes之收集集群的events,监控集群行为的相关文章

ElasticSearch 集群与x-pack监控集群分开部署

现在项目的集群因为分片较多,监控压力较大,所以研究了一下把监控数据集群分离的方法 因为license是base,目前内网环境并没有添加http认证,所以关于认证部分的设置没有研究 被监控集群A,一共5个节点:192.168.4.1~192.168.4.5 监控集群,一共2个节点:192.168.4.101~192.168.4.102 kibana在单独节点:192.168.4.111 在集群A中每个节点的elasticsearch.yml中添加 xpack.monitoring.exporter

nagios新增监控集群、卸载监控集群批量操作

1.一定要找应用侧确认每台节点上需要监控的进程,不要盲目以为所有hadoop集群的zk.journal啥的都一样,切记! 2.被监控节点只需要安装nagios-plugin和nrpe,依赖需要安装xinetd 3.确认被监控节点上没有安装过nagios 4.确认被监控节点间.被监控节点和nagios server间的互信 5.开始 5-1 选择一个同操作系统的集群a的一个节点an,目标集群b ssh an for dn in cluster{an..b1} do echo "$dn is con

熔断监控集群(Turbine)

Spring Cloud Turbine 上一章我们集成了Hystrix Dashboard,使用Hystrix Dashboard可以看到单个应用内的服务信息,显然这是不够的,我们还需要一个工具能让我们汇总系统内多个服务的数据并显示到Hystrix Dashboard上,这个工具就是Turbine. 添加依赖 修改 spring-cloud-consul-monitor 的pom文件,添加 turbine 依赖包. 注意:因为我们使用的注册中心是Consul,所以需要排除默认的euraka包,

基于k8s集群部署prometheus监控etcd

目录 基于k8s集群部署prometheus监控etcd 1.背景和环境概述 2.修改prometheus配置 3.检查是否生效 4.配置grafana图形 基于k8s集群部署prometheus监控etcd 1.背景和环境概述 本文中涉及到的环境中.prometheus监控和grafana基本环境已部署好.etcd内置了metrics接口供收集数据,在etcd集群任意一台节点上可通过ip:2379/metrics检查是否能正常收集数据. curl -L http://localhost:237

《二》Kubernetes集群部署(master)-搭建单集群v1.0

搭建单集群平台的环境规划 多master环境规划 官方提供的三种部署方式 minikubeMinikube是一个工具,可以在本地快速运行一个单点的Kubernetes,仅用于尝试Kubernetes或日常开发的用户使用.部署地址:https://kubernetes.io/docs/setup/minikube/ kubeadmKubeadm也是一个工具,提供kubeadm init和kubeadm join,用于快速部署Kubernetes集群.部署地址:https://kubernetes.

Spring Cloud Hystrix理解与实践(一):搭建简单监控集群

前言 在分布式架构中,所谓的断路器模式是指当某个服务发生故障之后,通过断路器的故障监控,向调用方返回一个错误响应,这样就不会使得线程因调用故障服务被长时间占用不释放,避免故障的继续蔓延.Spring Cloud Hystrix实现了断路器,线程隔离等一系列服务保护功能,它是基于Netflix的开源框架Hystrix实现的. 目的不是介绍Hystrix的与原理.及其使用等(有时间也要记录啊),而是通过实战搭建一个简单的监控集群,使用Hystrix Dashboard仪表盘动态监控展示以此来加深对H

kubeadm搭建kubernetes(v1.13.1)单节点集群

kubeadm是Kubernetes官方提供的用于快速部署Kubernetes集群的工具,本篇文章使用kubeadm搭建一个单master节点的k8s集群. 节点部署信息 节点主机名 节点IP 节点角色 操作系统 k8s-master 10.10.55.113 master centos7.6 k8s-node1 10.10.55.114 node centos7.6 节点说明 master:控制节点.kube-apiserver负责API服务,kube-controller-manager负责

linux集群系列(1) --- Linux集群系统基础

一.简介     1.1. Linux集群系统包括集群节点和集群管理器两部分. 集群节点有时简称为节点.服务器或服务器节点,是提供处理资源的系统,它进行集群的实际工作.一般来讲,它必须进行配置才能成为集群的一部分,也必须运行集群的应用软件.应用软件可以是专用于集群的软件,也可以是设计用于分布式系统的标准软件. Linux集群管理器则是将节点捆绑在一起,以构成单一系统外观的逻辑结构,它用于将任务分解到所有的节点.集群因多种不同的原因而有着不同的类型,建立Linux集群的最直接原因是共享CPU资源,

分布式缓存技术redis学习系列(四)——redis高级应用(集群搭建、集群分区原理、集群操作)

本文是redis学习系列的第四篇,前面我们学习了redis的数据结构和一些高级特性,点击下面链接可回看 <详细讲解redis数据结构(内存模型)以及常用命令> <redis高级应用(主从.事务与锁.持久化)> 本文我们继续学习redis的高级特性--集群.本文主要内容包括集群搭建.集群分区原理和集群操作的学习. Redis集群简介 Redis 集群是3.0之后才引入的,在3.0之前,使用哨兵(sentinel)机制(本文将不做介绍,大家可另行查阅)来监控各个节点之间的状态.Redi

MongoDB集群部署(副本集模式)

一.需求背景1.现状描述(1).针对近期出现的mongodb未授权的安全问题,导致mongodb数据会被非法访问.应安全平台部的要求,需要对所有mongodb进行鉴权设置,目前活动侧总共有4台,用于某XX产品: (2).某XX产品用到4台mongodb,属于2015年机房裁撤的范围: (3).早期的4台mongodb采用是的M1机型,同时在架构上采取了路由分片的模式.从目前来看,无论是数据量上,还是访问量上,都比较小,在资源上有点浪费,在架构上属于过早设计. 而本次新建的mongodb集群,采用