openshift 容器云从入门到崩溃之九《容器监控》

容器状态监控

主要是监控POD的状态包括重启、不健康等等这些k8s api 状态本身会报出来,在配合zabbix报警

导入zabbix模板关联上oc master主机

<?xml version="1.0" encoding="UTF-8"?>
<zabbix_export>
    <version>3.2</version>
    <date>2019-02-27T07:33:05Z</date>
    <groups>
        <group>
            <name>Templates</name>
        </group>
    </groups>
    <templates>
        <template>
            <template>OC Pods</template>
            <name>OC Pods</name>
            <description/>
            <groups>
                <group>
                    <name>Templates</name>
                </group>
            </groups>
            <applications>
                <application>
                    <name>restartCount</name>
                </application>
                <application>
                    <name>RunningStatus</name>
                </application>
            </applications>
            <items/>
            <discovery_rules>
                <discovery_rule>
                    <name>OC Pods Discover</name>
                    <type>0</type>
                    <snmp_community/>
                    <snmp_oid/>
                    <key>oc.pod.status[discover,discover]</key>
                    <delay>300</delay>
                    <status>0</status>
                    <allowed_hosts/>
                    <snmpv3_contextname/>
                    <snmpv3_securityname/>
                    <snmpv3_securitylevel>0</snmpv3_securitylevel>
                    <snmpv3_authprotocol>0</snmpv3_authprotocol>
                    <snmpv3_authpassphrase/>
                    <snmpv3_privprotocol>0</snmpv3_privprotocol>
                    <snmpv3_privpassphrase/>
                    <delay_flex/>
                    <params/>
                    <ipmi_sensor/>
                    <authtype>0</authtype>
                    <username/>
                    <password/>
                    <publickey/>
                    <privatekey/>
                    <port/>
                    <filter>
                        <evaltype>0</evaltype>
                        <formula/>
                        <conditions/>
                    </filter>
                    <lifetime>7</lifetime>
                    <description/>
                    <item_prototypes>
                        <item_prototype>
                            <name>Pod {#POD_NAME} Get Status</name>
                            <type>0</type>
                            <snmp_community/>
                            <multiplier>0</multiplier>
                            <snmp_oid/>
                            <key>oc.pod.status[{#POD_NAME},get_status]</key>
                            <delay>300</delay>
                            <history>7</history>
                            <trends>0</trends>
                            <status>0</status>
                            <value_type>4</value_type>
                            <allowed_hosts/>
                            <units/>
                            <delta>0</delta>
                            <snmpv3_contextname/>
                            <snmpv3_securityname/>
                            <snmpv3_securitylevel>0</snmpv3_securitylevel>
                            <snmpv3_authprotocol>0</snmpv3_authprotocol>
                            <snmpv3_authpassphrase/>
                            <snmpv3_privprotocol>0</snmpv3_privprotocol>
                            <snmpv3_privpassphrase/>
                            <formula>1</formula>
                            <delay_flex/>
                            <params/>
                            <ipmi_sensor/>
                            <data_type>0</data_type>
                            <authtype>0</authtype>
                            <username/>
                            <password/>
                            <publickey/>
                            <privatekey/>
                            <port/>
                            <description/>
                            <inventory_link>0</inventory_link>
                            <applications>
                                <application>
                                    <name>RunningStatus</name>
                                </application>
                            </applications>
                            <valuemap/>
                            <logtimefmt/>
                            <application_prototypes/>
                        </item_prototype>
                        <item_prototype>
                            <name>Pod {#POD_NAME} Restarts</name>
                            <type>0</type>
                            <snmp_community/>
                            <multiplier>0</multiplier>
                            <snmp_oid/>
                            <key>oc.pod.status[{#POD_NAME},restarts]</key>
                            <delay>300</delay>
                            <history>7</history>
                            <trends>0</trends>
                            <status>0</status>
                            <value_type>4</value_type>
                            <allowed_hosts/>
                            <units/>
                            <delta>0</delta>
                            <snmpv3_contextname/>
                            <snmpv3_securityname/>
                            <snmpv3_securitylevel>0</snmpv3_securitylevel>
                            <snmpv3_authprotocol>0</snmpv3_authprotocol>
                            <snmpv3_authpassphrase/>
                            <snmpv3_privprotocol>0</snmpv3_privprotocol>
                            <snmpv3_privpassphrase/>
                            <formula>1</formula>
                            <delay_flex/>
                            <params/>
                            <ipmi_sensor/>
                            <data_type>0</data_type>
                            <authtype>0</authtype>
                            <username/>
                            <password/>
                            <publickey/>
                            <privatekey/>
                            <port/>
                            <description/>
                            <inventory_link>0</inventory_link>
                            <applications>
                                <application>
                                    <name>restartCount</name>
                                </application>
                            </applications>
                            <valuemap/>
                            <logtimefmt/>
                            <application_prototypes/>
                        </item_prototype>
                        <item_prototype>
                            <name>Pod {#POD_NAME} Running</name>
                            <type>0</type>
                            <snmp_community/>
                            <multiplier>0</multiplier>
                            <snmp_oid/>
                            <key>oc.pod.status[{#POD_NAME},running]</key>
                            <delay>300</delay>
                            <history>7</history>
                            <trends>0</trends>
                            <status>0</status>
                            <value_type>4</value_type>
                            <allowed_hosts/>
                            <units/>
                            <delta>0</delta>
                            <snmpv3_contextname/>
                            <snmpv3_securityname/>
                            <snmpv3_securitylevel>0</snmpv3_securitylevel>
                            <snmpv3_authprotocol>0</snmpv3_authprotocol>
                            <snmpv3_authpassphrase/>
                            <snmpv3_privprotocol>0</snmpv3_privprotocol>
                            <snmpv3_privpassphrase/>
                            <formula>1</formula>
                            <delay_flex/>
                            <params/>
                            <ipmi_sensor/>
                            <data_type>0</data_type>
                            <authtype>0</authtype>
                            <username/>
                            <password/>
                            <publickey/>
                            <privatekey/>
                            <port/>
                            <description/>
                            <inventory_link>0</inventory_link>
                            <applications>
                                <application>
                                    <name>RunningStatus</name>
                                </application>
                            </applications>
                            <valuemap/>
                            <logtimefmt/>
                            <application_prototypes/>
                        </item_prototype>
                    </item_prototypes>
                    <trigger_prototypes>
                        <trigger_prototype>
                            <expression>{OC Pods:oc.pod.status[{#POD_NAME},running].str(Running_true)}=0
and
{OC Pods:oc.pod.status[{#POD_NAME},running].str(Pod deleted)}=0</expression>
                            <recovery_mode>0</recovery_mode>
                            <recovery_expression/>
                            <name>Pod {#POD_NAME} Not Running</name>
                            <correlation_mode>0</correlation_mode>
                            <correlation_tag/>
                            <url/>
                            <status>0</status>
                            <priority>1</priority>
                            <description/>
                            <type>0</type>
                            <manual_close>1</manual_close>
                            <dependencies/>
                            <tags/>
                        </trigger_prototype>
                        <trigger_prototype>
                            <expression>{OC Pods:oc.pod.status[{#POD_NAME},restarts].str(Warning)}=1</expression>
                            <recovery_mode>1</recovery_mode>
                            <recovery_expression>{OC Pods:oc.pod.status[{#POD_NAME},restarts].str(Warning,#3)}=0</recovery_expression>
                            <name>Pod {#POD_NAME} restarted Warning</name>
                            <correlation_mode>0</correlation_mode>
                            <correlation_tag/>
                            <url/>
                            <status>0</status>
                            <priority>1</priority>
                            <description/>
                            <type>0</type>
                            <manual_close>1</manual_close>
                            <dependencies/>
                            <tags/>
                        </trigger_prototype>
                    </trigger_prototypes>
                    <graph_prototypes/>
                    <host_prototypes/>
                </discovery_rule>
            </discovery_rules>
            <httptests/>
            <macros/>
            <templates/>
            <screens/>
        </template>
    </templates>
</zabbix_export>

zabbix客户端配置

修改zabbix_agentd.conf

Timeout=30
UserParameter=oc.pod.status[*],/data/app/zabbix/etc/oc_pod_monitor.sh $1 $2

oc_pod_monitor.sh内容

#!/bin/bash
TOKEN=""
ENDPOINT=""
POD_NAME="`echo "$1" |sed ‘s/.*=\(.*$\)/\1/‘`"
Monitoring_type="$2"
WORKSPACE="/data/tmp/oc_monitor"
mkdir -p $WORKSPACE

#通过pod name获得pod所在的namespace
NAMESPACE="`jq -r ‘.items |.[] |.metadata |.name,.namespace‘ $WORKSPACE/all_pods.json |grep -A1 $POD_NAME |grep -v $POD_NAME`"

#验证pod是否存在
if [ "$POD_NAME" == "discover" ]; then
  echo
elif [ ! -n "$NAMESPACE" ]; then
  echo "Pod deleted"
  exit 0
fi
##自动发现
case $Monitoring_type in
   discover)
     #获取所有pod只保留pod name
     curl -k        -H "Authorization: Bearer $TOKEN"        -H ‘Accept: application/json‘        https://$ENDPOINT/api/v1/pods 2>/dev/null  > $WORKSPACE/all_pods.json

     Pod_Name=(`jq -r ‘.items | .[] | .metadata | .name‘ $WORKSPACE/all_pods.json |egrep -v ‘build|deploy|debug‘`)
     #转换为json格式
     printf "{\n"
     printf ‘\t"data":[\n‘
     for ((i=0;i<${#Pod_Name[@]};i++))
     do
        NAMESPACE="`jq -r ‘.items |.[] |.metadata |.name,.namespace‘ $WORKSPACE/all_pods.json |grep -A1 ${Pod_Name[i]} |grep -v ${Pod_Name[i]}`"
        Pod_Name_N=""$NAMESPACE"="${Pod_Name[i]}""
        printf ‘\t\t{\n‘
        num=$(echo $((${#Pod_Name[@]}-1)))
        if [ "$i" == ${num} ];
        then
                printf "\t\t\t\"{#POD_NAME}\":\"${Pod_Name_N}\"}\n"
        else
                printf "\t\t\t\"{#POD_NAME}\":\"${Pod_Name_N}\"},\n"
        fi
     done
     printf "\t]\n"
     printf "}\n"
     exit 0
  ;;

   get_status)#获取pod状态以供所有项目调用
     curl -k        -H "Authorization: Bearer $TOKEN"        -H ‘Accept: application/json‘        https://${ENDPOINT}/api/v1/namespaces/$NAMESPACE/pods/$POD_NAME/status 2>/dev/null > $WORKSPACE/${NAMESPACE}-${POD_NAME}.status
     Pod_NotFound="`cat $WORKSPACE/${NAMESPACE}-${POD_NAME}.status |grep ‘"code": 404‘`"
     if [ -n "$Pod_NotFound" ]; then
       echo "Pod_Status=NotFound"
       exit 0
     else
       echo "Success"
       exit 0
     fi
   ;;
esac

#获取pod状态数据
if [ -f "$WORKSPACE/${NAMESPACE}-${POD_NAME}.status" ];then
   Pod_Status="`cat $WORKSPACE/${NAMESPACE}-${POD_NAME}.status`"
else
   echo "" > $WORKSPACE/${NAMESPACE}-${POD_NAME}.status
   Pod_Status="`cat $WORKSPACE/${NAMESPACE}-${POD_NAME}.status`"
fi

#处理Pod_Status的异常
if [ ! -n "$Pod_Status" ]; then  #处理Pod_Status的为空的异常
   echo "Running_true Pod_Status=Null"
   exit 0
elif [ -n "`echo "$Pod_Status" |grep ‘"code": 404‘`" ]; then  #处理pod不存在但是all_pods.json还没更新的异常
   echo "Pod_Status=NotFound"
   exit 0
elif [ "`echo "$Pod_Status" |jq -r ‘.status |.phase‘`" = "Pending" ]; then  #验证容器是否在Pending状态
   echo "Pending"
   exit 0
fi

#选择要获取的数据
case $Monitoring_type in
   restarts)#监控pod是否重启过
     #判断是否是新pod
     if [ ! -f "$WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount" ]; then
       echo "Warning New Pod"
       echo "0" > $WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount
       exit 0
     fi

     ##获取上次的值
     A_line=`sed -n 1p $WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount`
     B_line_null="`sed -n 2p $WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount`"
     if [ ! -n "$B_line_null" ]; then  #处理有两个restartCount值的pod
       B_line="0"
     else
       B_line=`sed -n 2p $WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount`
     fi
     Last_state=`expr $A_line + $B_line`
     ##

     ##获取本次的值
     echo "$Pod_Status" |jq -r ‘.status |.containerStatuses |.[] |.restartCount‘ > $WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount
     A_line=`sed -n 1p $WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount`
     B_line_null="`sed -n 2p $WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount`"
     if [ ! -n "$B_line_null" ]; then  #处理有两个restartCount值的pod
       B_line="0"
       else
       B_line=`sed -n 2p $WORKSPACE/${NAMESPACE}-${POD_NAME}.restartCount`
     fi
     Current_state=`expr $A_line + $B_line`
     ##

     #对比本次拿到的restartCount值与上此的restartCount值
     if [ "$Current_state" -gt "$Last_state" ]; then
       Restart_status="Warning restart_count=$Current_state"
     else
       Restart_status="Normal restart_count=$Current_state"
     fi
     echo "$Restart_status"
  ;;

   running)#监控pod的运行状态和容器的状态返回字符串

     #获取pod和容器的状态
     running_status=`echo "$Pod_Status" |jq -r ‘.status |.phase‘`
     Container_status="`echo "$Pod_Status" |jq -r ‘.status |.containerStatuses |.[] |.ready‘ |grep false`"
     if [ ! -n "$Container_status" ]; then
        Container_status="_true"
     else
        Container_status="_false"
     fi
     echo "${running_status}${Container_status}"
  ;;

   *)
     echo "Error parameters"
     exit 0
  ;;

esac
exit 0

这样POD重启或者新建都会报出来

集群NODE节点监控

主要监控node节点的不健康状态,还有lvm卷容量监控

导入zabbix模板关联上oc master主机

<?xml version="1.0" encoding="UTF-8"?>
<zabbix_export>
    <version>3.2</version>
    <date>2019-02-27T07:47:32Z</date>
    <groups>
        <group>
            <name>Templates</name>
        </group>
    </groups>
    <templates>
        <template>
            <template>OC Node Status</template>
            <name>OC Node Status</name>
            <description/>
            <groups>
                <group>
                    <name>Templates</name>
                </group>
            </groups>
            <applications>
                <application>
                    <name>oc_node</name>
                </application>
            </applications>
            <items/>
            <discovery_rules>
                <discovery_rule>
                    <name>OC Nodes Discover</name>
                    <type>0</type>
                    <snmp_community/>
                    <snmp_oid/>
                    <key>oc.node.status[discover,discover]</key>
                    <delay>60</delay>
                    <status>0</status>
                    <allowed_hosts/>
                    <snmpv3_contextname/>
                    <snmpv3_securityname/>
                    <snmpv3_securitylevel>0</snmpv3_securitylevel>
                    <snmpv3_authprotocol>0</snmpv3_authprotocol>
                    <snmpv3_authpassphrase/>
                    <snmpv3_privprotocol>0</snmpv3_privprotocol>
                    <snmpv3_privpassphrase/>
                    <delay_flex/>
                    <params/>
                    <ipmi_sensor/>
                    <authtype>0</authtype>
                    <username/>
                    <password/>
                    <publickey/>
                    <privatekey/>
                    <port/>
                    <filter>
                        <evaltype>0</evaltype>
                        <formula/>
                        <conditions/>
                    </filter>
                    <lifetime>7</lifetime>
                    <description/>
                    <item_prototypes>
                        <item_prototype>
                            <name>Node {#NODE_NAME}  DiskPressure</name>
                            <type>0</type>
                            <snmp_community/>
                            <multiplier>0</multiplier>
                            <snmp_oid/>
                            <key>oc.node.status[{#NODE_NAME},DiskPressure]</key>
                            <delay>30</delay>
                            <history>7</history>
                            <trends>0</trends>
                            <status>1</status>
                            <value_type>4</value_type>
                            <allowed_hosts/>
                            <units/>
                            <delta>0</delta>
                            <snmpv3_contextname/>
                            <snmpv3_securityname/>
                            <snmpv3_securitylevel>0</snmpv3_securitylevel>
                            <snmpv3_authprotocol>0</snmpv3_authprotocol>
                            <snmpv3_authpassphrase/>
                            <snmpv3_privprotocol>0</snmpv3_privprotocol>
                            <snmpv3_privpassphrase/>
                            <formula>1</formula>
                            <delay_flex/>
                            <params/>
                            <ipmi_sensor/>
                            <data_type>0</data_type>
                            <authtype>0</authtype>
                            <username/>
                            <password/>
                            <publickey/>
                            <privatekey/>
                            <port/>
                            <description/>
                            <inventory_link>0</inventory_link>
                            <applications>
                                <application>
                                    <name>oc_node</name>
                                </application>
                            </applications>
                            <valuemap/>
                            <logtimefmt/>
                            <application_prototypes/>
                        </item_prototype>
                        <item_prototype>
                            <name>Node {#NODE_NAME} Get Status</name>
                            <type>0</type>
                            <snmp_community/>
                            <multiplier>0</multiplier>
                            <snmp_oid/>
                            <key>oc.node.status[{#NODE_NAME},get_status]</key>
                            <delay>30</delay>
                            <history>7</history>
                            <trends>0</trends>
                            <status>0</status>
                            <value_type>4</value_type>
                            <allowed_hosts/>
                            <units/>
                            <delta>0</delta>
                            <snmpv3_contextname/>
                            <snmpv3_securityname/>
                            <snmpv3_securitylevel>0</snmpv3_securitylevel>
                            <snmpv3_authprotocol>0</snmpv3_authprotocol>
                            <snmpv3_authpassphrase/>
                            <snmpv3_privprotocol>0</snmpv3_privprotocol>
                            <snmpv3_privpassphrase/>
                            <formula>1</formula>
                            <delay_flex/>
                            <params/>
                            <ipmi_sensor/>
                            <data_type>0</data_type>
                            <authtype>0</authtype>
                            <username/>
                            <password/>
                            <publickey/>
                            <privatekey/>
                            <port/>
                            <description/>
                            <inventory_link>0</inventory_link>
                            <applications/>
                            <valuemap/>
                            <logtimefmt/>
                            <application_prototypes/>
                        </item_prototype>
                        <item_prototype>
                            <name>Node {#NODE_NAME}  MemoryPressure</name>
                            <type>0</type>
                            <snmp_community/>
                            <multiplier>0</multiplier>
                            <snmp_oid/>
                            <key>oc.node.status[{#NODE_NAME},MemoryPressure]</key>
                            <delay>30</delay>
                            <history>7</history>
                            <trends>0</trends>
                            <status>1</status>
                            <value_type>4</value_type>
                            <allowed_hosts/>
                            <units/>
                            <delta>0</delta>
                            <snmpv3_contextname/>
                            <snmpv3_securityname/>
                            <snmpv3_securitylevel>0</snmpv3_securitylevel>
                            <snmpv3_authprotocol>0</snmpv3_authprotocol>
                            <snmpv3_authpassphrase/>
                            <snmpv3_privprotocol>0</snmpv3_privprotocol>
                            <snmpv3_privpassphrase/>
                            <formula>1</formula>
                            <delay_flex/>
                            <params/>
                            <ipmi_sensor/>
                            <data_type>0</data_type>
                            <authtype>0</authtype>
                            <username/>
                            <password/>
                            <publickey/>
                            <privatekey/>
                            <port/>
                            <description/>
                            <inventory_link>0</inventory_link>
                            <applications>
                                <application>
                                    <name>oc_node</name>
                                </application>
                            </applications>
                            <valuemap/>
                            <logtimefmt/>
                            <application_prototypes/>
                        </item_prototype>
                        <item_prototype>
                            <name>Node {#NODE_NAME}  Ready</name>
                            <type>0</type>
                            <snmp_community/>
                            <multiplier>0</multiplier>
                            <snmp_oid/>
                            <key>oc.node.status[{#NODE_NAME},node_ready]</key>
                            <delay>30</delay>
                            <history>7</history>
                            <trends>0</trends>
                            <status>0</status>
                            <value_type>4</value_type>
                            <allowed_hosts/>
                            <units/>
                            <delta>0</delta>
                            <snmpv3_contextname/>
                            <snmpv3_securityname/>
                            <snmpv3_securitylevel>0</snmpv3_securitylevel>
                            <snmpv3_authprotocol>0</snmpv3_authprotocol>
                            <snmpv3_authpassphrase/>
                            <snmpv3_privprotocol>0</snmpv3_privprotocol>
                            <snmpv3_privpassphrase/>
                            <formula>1</formula>
                            <delay_flex/>
                            <params/>
                            <ipmi_sensor/>
                            <data_type>0</data_type>
                            <authtype>0</authtype>
                            <username/>
                            <password/>
                            <publickey/>
                            <privatekey/>
                            <port/>
                            <description/>
                            <inventory_link>0</inventory_link>
                            <applications>
                                <application>
                                    <name>oc_node</name>
                                </application>
                            </applications>
                            <valuemap/>
                            <logtimefmt/>
                            <application_prototypes/>
                        </item_prototype>
                        <item_prototype>
                            <name>Node {#NODE_NAME} CPU Limits</name>
                            <type>0</type>
                            <snmp_community/>
                            <multiplier>0</multiplier>
                            <snmp_oid/>
                            <key>oc.node.status[{#NODE_NAME},node_resources,cpu_limits]</key>
                            <delay>120</delay>
                            <history>7</history>
                            <trends>0</trends>
                            <status>0</status>
                            <value_type>3</value_type>
                            <allowed_hosts/>
                            <units>%</units>
                            <delta>0</delta>
                            <snmpv3_contextname/>
                            <snmpv3_securityname/>
                            <snmpv3_securitylevel>0</snmpv3_securitylevel>
                            <snmpv3_authprotocol>0</snmpv3_authprotocol>
                            <snmpv3_authpassphrase/>
                            <snmpv3_privprotocol>0</snmpv3_privprotocol>
                            <snmpv3_privpassphrase/>
                            <formula>1</formula>
                            <delay_flex/>
                            <params/>
                            <ipmi_sensor/>
                            <data_type>0</data_type>
                            <authtype>0</authtype>
                            <username/>
                            <password/>
                            <publickey/>
                            <privatekey/>
                            <port/>
                            <description/>
                            <inventory_link>0</inventory_link>
                            <applications>
                                <application>
                                    <name>oc_node</name>
                                </application>
                            </applications>
                            <valuemap/>
                            <logtimefmt/>
                            <application_prototypes/>
                        </item_prototype>
                        <item_prototype>
                            <name>Node {#NODE_NAME} CPU Requests</name>
                            <type>0</type>
                            <snmp_community/>
                            <multiplier>0</multiplier>
                            <snmp_oid/>
                            <key>oc.node.status[{#NODE_NAME},node_resources,cpu_requests]</key>
                            <delay>120</delay>
                            <history>7</history>
                            <trends>0</trends>
                            <status>0</status>
                            <value_type>3</value_type>
                            <allowed_hosts/>
                            <units>%</units>
                            <delta>0</delta>
                            <snmpv3_contextname/>
                            <snmpv3_securityname/>
                            <snmpv3_securitylevel>0</snmpv3_securitylevel>
                            <snmpv3_authprotocol>0</snmpv3_authprotocol>
                            <snmpv3_authpassphrase/>
                            <snmpv3_privprotocol>0</snmpv3_privprotocol>
                            <snmpv3_privpassphrase/>
                            <formula>1</formula>
                            <delay_flex/>
                            <params/>
                            <ipmi_sensor/>
                            <data_type>0</data_type>
                            <authtype>0</authtype>
                            <username/>
                            <password/>
                            <publickey/>
                            <privatekey/>
                            <port/>
                            <description/>
                            <inventory_link>0</inventory_link>
                            <applications>
                                <application>
                                    <name>oc_node</name>
                                </application>
                            </applications>
                            <valuemap/>
                            <logtimefmt/>
                            <application_prototypes/>
                        </item_prototype>
                        <item_prototype>
                            <name>Node {#NODE_NAME} Memory Limits</name>
                            <type>0</type>
                            <snmp_community/>
                            <multiplier>0</multiplier>
                            <snmp_oid/>
                            <key>oc.node.status[{#NODE_NAME},node_resources,memory_limits]</key>
                            <delay>120</delay>
                            <history>7</history>
                            <trends>0</trends>
                            <status>0</status>
                            <value_type>3</value_type>
                            <allowed_hosts/>
                            <units>%</units>
                            <delta>0</delta>
                            <snmpv3_contextname/>
                            <snmpv3_securityname/>
                            <snmpv3_securitylevel>0</snmpv3_securitylevel>
                            <snmpv3_authprotocol>0</snmpv3_authprotocol>
                            <snmpv3_authpassphrase/>
                            <snmpv3_privprotocol>0</snmpv3_privprotocol>
                            <snmpv3_privpassphrase/>
                            <formula>1</formula>
                            <delay_flex/>
                            <params/>
                            <ipmi_sensor/>
                            <data_type>0</data_type>
                            <authtype>0</authtype>
                            <username/>
                            <password/>
                            <publickey/>
                            <privatekey/>
                            <port/>
                            <description/>
                            <inventory_link>0</inventory_link>
                            <applications>
                                <application>
                                    <name>oc_node</name>
                                </application>
                            </applications>
                            <valuemap/>
                            <logtimefmt/>
                            <application_prototypes/>
                        </item_prototype>
                        <item_prototype>
                            <name>Node {#NODE_NAME} Memory Requests</name>
                            <type>0</type>
                            <snmp_community/>
                            <multiplier>0</multiplier>
                            <snmp_oid/>
                            <key>oc.node.status[{#NODE_NAME},node_resources,memory_requests]</key>
                            <delay>120</delay>
                            <history>7</history>
                            <trends>0</trends>
                            <status>0</status>
                            <value_type>3</value_type>
                            <allowed_hosts/>
                            <units>%</units>
                            <delta>0</delta>
                            <snmpv3_contextname/>
                            <snmpv3_securityname/>
                            <snmpv3_securitylevel>0</snmpv3_securitylevel>
                            <snmpv3_authprotocol>0</snmpv3_authprotocol>
                            <snmpv3_authpassphrase/>
                            <snmpv3_privprotocol>0</snmpv3_privprotocol>
                            <snmpv3_privpassphrase/>
                            <formula>1</formula>
                            <delay_flex/>
                            <params/>
                            <ipmi_sensor/>
                            <data_type>0</data_type>
                            <authtype>0</authtype>
                            <username/>
                            <password/>
                            <publickey/>
                            <privatekey/>
                            <port/>
                            <description/>
                            <inventory_link>0</inventory_link>
                            <applications>
                                <application>
                                    <name>oc_node</name>
                                </application>
                            </applications>
                            <valuemap/>
                            <logtimefmt/>
                            <application_prototypes/>
                        </item_prototype>
                        <item_prototype>
                            <name>Node {#NODE_NAME}  OutOfDisk</name>
                            <type>0</type>
                            <snmp_community/>
                            <multiplier>0</multiplier>
                            <snmp_oid/>
                            <key>oc.node.status[{#NODE_NAME},OutOfDisk]</key>
                            <delay>30</delay>
                            <history>7</history>
                            <trends>0</trends>
                            <status>1</status>
                            <value_type>4</value_type>
                            <allowed_hosts/>
                            <units/>
                            <delta>0</delta>
                            <snmpv3_contextname/>
                            <snmpv3_securityname/>
                            <snmpv3_securitylevel>0</snmpv3_securitylevel>
                            <snmpv3_authprotocol>0</snmpv3_authprotocol>
                            <snmpv3_authpassphrase/>
                            <snmpv3_privprotocol>0</snmpv3_privprotocol>
                            <snmpv3_privpassphrase/>
                            <formula>1</formula>
                            <delay_flex/>
                            <params/>
                            <ipmi_sensor/>
                            <data_type>0</data_type>
                            <authtype>0</authtype>
                            <username/>
                            <password/>
                            <publickey/>
                            <privatekey/>
                            <port/>
                            <description/>
                            <inventory_link>0</inventory_link>
                            <applications>
                                <application>
                                    <name>oc_node</name>
                                </application>
                            </applications>
                            <valuemap/>
                            <logtimefmt/>
                            <application_prototypes/>
                        </item_prototype>
                    </item_prototypes>
                    <trigger_prototypes>
                        <trigger_prototype>
                            <expression>{OC Node Status:oc.node.status[{#NODE_NAME},node_resources,cpu_limits].last()}&gt;150</expression>
                            <recovery_mode>0</recovery_mode>
                            <recovery_expression/>
                            <name>Node {#NODE_NAME} CPU Limits 150%</name>
                            <correlation_mode>0</correlation_mode>
                            <correlation_tag/>
                            <url/>
                            <status>0</status>
                            <priority>1</priority>
                            <description/>
                            <type>0</type>
                            <manual_close>1</manual_close>
                            <dependencies/>
                            <tags/>
                        </trigger_prototype>
                        <trigger_prototype>
                            <expression>{OC Node Status:oc.node.status[{#NODE_NAME},node_resources,cpu_requests].last()}&gt;100</expression>
                            <recovery_mode>0</recovery_mode>
                            <recovery_expression/>
                            <name>Node {#NODE_NAME} CPU Requests 100%</name>
                            <correlation_mode>0</correlation_mode>
                            <correlation_tag/>
                            <url/>
                            <status>0</status>
                            <priority>2</priority>
                            <description/>
                            <type>0</type>
                            <manual_close>1</manual_close>
                            <dependencies/>
                            <tags/>
                        </trigger_prototype>
                        <trigger_prototype>
                            <expression>{OC Node Status:oc.node.status[{#NODE_NAME},DiskPressure].str(DiskPressure_False)}=0</expression>
                            <recovery_mode>0</recovery_mode>
                            <recovery_expression/>
                            <name>Node {#NODE_NAME} DiskPressure</name>
                            <correlation_mode>0</correlation_mode>
                            <correlation_tag/>
                            <url/>
                            <status>1</status>
                            <priority>5</priority>
                            <description/>
                            <type>0</type>
                            <manual_close>1</manual_close>
                            <dependencies/>
                            <tags/>
                        </trigger_prototype>
                        <trigger_prototype>
                            <expression>{OC Node Status:oc.node.status[{#NODE_NAME},node_resources,memory_limits].last()}&gt;150</expression>
                            <recovery_mode>0</recovery_mode>
                            <recovery_expression/>
                            <name>Node {#NODE_NAME} Memory Limits 150%</name>
                            <correlation_mode>0</correlation_mode>
                            <correlation_tag/>
                            <url/>
                            <status>0</status>
                            <priority>1</priority>
                            <description/>
                            <type>0</type>
                            <manual_close>1</manual_close>
                            <dependencies/>
                            <tags/>
                        </trigger_prototype>
                        <trigger_prototype>
                            <expression>{OC Node Status:oc.node.status[{#NODE_NAME},MemoryPressure].str(MemoryPressure_False)}=0</expression>
                            <recovery_mode>0</recovery_mode>
                            <recovery_expression/>
                            <name>Node {#NODE_NAME} MemoryPressure</name>
                            <correlation_mode>0</correlation_mode>
                            <correlation_tag/>
                            <url/>
                            <status>1</status>
                            <priority>5</priority>
                            <description/>
                            <type>0</type>
                            <manual_close>1</manual_close>
                            <dependencies/>
                            <tags/>
                        </trigger_prototype>
                        <trigger_prototype>
                            <expression>{OC Node Status:oc.node.status[{#NODE_NAME},node_resources,memory_requests].last()}&gt;95</expression>
                            <recovery_mode>0</recovery_mode>
                            <recovery_expression/>
                            <name>Node {#NODE_NAME} Memory Requests 95%</name>
                            <correlation_mode>0</correlation_mode>
                            <correlation_tag/>
                            <url/>
                            <status>0</status>
                            <priority>2</priority>
                            <description/>
                            <type>0</type>
                            <manual_close>1</manual_close>
                            <dependencies/>
                            <tags/>
                        </trigger_prototype>
                        <trigger_prototype>
                            <expression>{OC Node Status:oc.node.status[{#NODE_NAME},node_ready].str(Ready_True)}=0</expression>
                            <recovery_mode>0</recovery_mode>
                            <recovery_expression/>
                            <name>Node {#NODE_NAME} Not Ready</name>
                            <correlation_mode>0</correlation_mode>
                            <correlation_tag/>
                            <url/>
                            <status>0</status>
                            <priority>5</priority>
                            <description/>
                            <type>0</type>
                            <manual_close>1</manual_close>
                            <dependencies/>
                            <tags/>
                        </trigger_prototype>
                        <trigger_prototype>
                            <expression>{OC Node Status:oc.node.status[{#NODE_NAME},OutOfDisk].str(OutOfDisk_False)}=0</expression>
                            <recovery_mode>0</recovery_mode>
                            <recovery_expression/>
                            <name>Node {#NODE_NAME} OutOfDisk</name>
                            <correlation_mode>0</correlation_mode>
                            <correlation_tag/>
                            <url/>
                            <status>1</status>
                            <priority>5</priority>
                            <description/>
                            <type>0</type>
                            <manual_close>1</manual_close>
                            <dependencies/>
                            <tags/>
                        </trigger_prototype>
                    </trigger_prototypes>
                    <graph_prototypes/>
                    <host_prototypes/>
                </discovery_rule>
            </discovery_rules>
            <httptests/>
            <macros/>
            <templates/>
            <screens/>
        </template>
    </templates>
</zabbix_export>

zabbix客户端配置

修改zabbix_agentd.conf

Timeout=30
UserParameter=oc.node.status[*],/data/app/zabbix/etc/oc_node_monitor.sh $1 $2 $3

oc_node_monitor.sh的内容

#!/bin/bash
TOKEN=""
ENDPOINT=""
NODE_NAME="$1"
Monitoring_type="$2"
WORKSPACE="/data/tmp/oc_monitor"
mkdir -p $WORKSPACE

case $Monitoring_type in
   discover)#自动发现节点
     Node_Name=(`curl -k                    -H "Authorization: Bearer $TOKEN"                    -H ‘Accept: application/json‘                     https://$ENDPOINT/api/v1/nodes 2>/dev/null |jq -r ‘.items|.[]|.metadata|.name‘`)

     printf "{\n"
     printf ‘\t"data":[\n‘
     for ((i=0;i<${#Node_Name[@]};i++))
     do
        printf ‘\t\t{\n‘
        num=$(echo $((${#Node_Name[@]}-1)))
        if [ "$i" == ${num} ];
        then
                printf "\t\t\t\"{#NODE_NAME}\":\"${Node_Name[$i]}\"}\n"
        else
                printf "\t\t\t\"{#NODE_NAME}\":\"${Node_Name[$i]}\"},\n"
        fi
     done
     printf "\t]\n"
     printf "}\n"
     exit 0
;;
   get_status)#获取node状态以供所有项目调用
     curl -k        -H "Authorization: Bearer $TOKEN"        -H ‘Accept: application/json‘        https://${ENDPOINT}/api/v1/nodes/$NODE_NAME 2>/dev/null > $WORKSPACE/${NODE_NAME}.status
     if [ -n "`cat $WORKSPACE/${NODE_NAME}.status |grep ‘"code": 404‘`" ]; then
       echo "Node_Status=NotFound"
       exit 0
     elif [ ! -n "`cat $WORKSPACE/${NODE_NAME}.status`" ]; then
       echo "Node_Status=null"
       exit 0
     else
       echo "Success"
       exit 0
     fi
   ;;
esac 

case $Monitoring_type in
   OutOfDisk)#监控node是否磁盘空间不足
     Node_Status="`cat $WORKSPACE/${NODE_NAME}.status |jq -r ‘.status|.conditions|.[]|.status‘ | sed -n 1p`"
     if [ "$Node_Status" == "False" ]; then
       echo "OutOfDisk_False"
     elif [ ! -n "$Node_Status" ]; then
       echo "OutOfDisk_False"
     else
       echo "OutOfDisk_$Node_Status"
     fi
  ;;

   MemoryPressure)#监控node是否磁盘空间不足
     Node_Status="`cat $WORKSPACE/${NODE_NAME}.status |jq -r ‘.status|.conditions|.[]|.status‘ | sed -n 2p`"
     if [ "$Node_Status" == "False" ]; then
       echo "MemoryPressure_False"
     elif [ ! -n "$Node_Status" ]; then
       echo "MemoryPressure_False"
     else
       echo "MemoryPressure_$Node_Status"
     fi
  ;;

   DiskPressure)#监控node是否磁盘压力太大
     Node_Status="`cat $WORKSPACE/${NODE_NAME}.status |jq -r ‘.status|.conditions|.[]|.status‘ | sed -n 3p`"
     if [ "$Node_Status" == "False" ]; then
       echo "DiskPressure_False"
     elif [ ! -n "$Node_Status" ]; then
       echo "DiskPressure_False"
     else
       echo "DiskPressure_$Node_Status"
     fi
  ;;

   node_ready)#监控node是否准备好了
     Node_Status="`cat $WORKSPACE/${NODE_NAME}.status |jq -r ‘.status|.conditions|.[]|.status‘ | sed -n 4p`"
     if [ "$Node_Status" == "True" ]; then
       echo "Ready_True"
     elif [ ! -n "$Node_Status" ]; then
       echo "Ready_True"
     else
       echo "Ready_$Node_Status"
     fi
  ;;

   node_resources)#监控node资源分配情况
     null="`cat $WORKSPACE/${NODE_NAME}.resources |awk ‘{print $2}‘`"
     if [ ! -n "$null" ]; then
        sleep 1
     fi
     if [ "$3" == "cpu_requests" ]; then
        data="`cat $WORKSPACE/${NODE_NAME}.resources |awk ‘{print $2}‘ |grep -o ‘[0-9]*‘`"
        if [ $data -gt 0 ]; then
          echo $data
        else
          echo 0
        fi
     elif [ "$3" == "cpu_limits" ]; then
        data="`cat $WORKSPACE/${NODE_NAME}.resources |awk ‘{print $4}‘ |grep -o ‘[0-9]*‘`"
        if [ $data -gt 0 ]; then
          echo $data
        else
          echo 0
        fi

     elif [ "$3" == "memory_requests" ]; then
        data="`cat $WORKSPACE/${NODE_NAME}.resources |awk ‘{print $6}‘ |grep -o ‘[0-9]*‘`"
        if [ "$data" -gt 0 ]; then
          echo $data
        else
          echo 0
        fi 

     elif [ "$3" == "memory_limits" ]; then
        data="`cat $WORKSPACE/${NODE_NAME}.resources |awk ‘{print $8}‘ |grep -o ‘[0-9]*‘`"
        if [ $data -gt 0 ]; then
          echo $data
        else
          echo 0
        fi
     fi
  ;;
esac

crontab -e

*/2 * * * * /data/scripts/oc_master_crontab.sh >/dev/null 2>&1

oc_master_crontab.sh内容

node_name=(`oc get node |grep -v "NAME" |awk ‘{print $1}‘`)
for ((i=0;i<${#node_name[*]};i++))
do
oc describe node "${node_name[i]}" |grep -B 1 "Events"  |grep -v "Events"  > /data/tmp/oc_monitor/${node_name[i]}.resources
chmod -R 777 /data/tmp/
done

原文地址:https://www.cnblogs.com/37yan/p/10444009.html

时间: 2024-08-04 12:05:13

openshift 容器云从入门到崩溃之九《容器监控》的相关文章

docker容器与容器云 读书笔记一 第一章

第一章:从容器到容器云 1.1    云计算平台经典云计算架构包括了IaaS(Infrastructure as a Service,基础设施即服务).PaaS(Platform as a Service,平台即服务).SaaS(Software as a Service,软件即服务)三层服务 1.2    容器,新的革命 Docker是什么? 基于官方的定义,Docker是以Docker容器为资源分割和调度的基本单位,封装整个软件运行时环境,为开发者和系统管理员设计的,用于构建.发布和运行分布

网易容器云平台的微服务化实践(一)

作者:冯常健来自: 网易云-共创云上精彩世界 摘要:网易云容器平台期望能给实施了微服务架构的团队提供完整的解决方案和闭环的用户体验,为此从 2016 年开始,我们容器服务团队内部率先开始进行 dogfooding 实践,看看容器云平台能不能支撑得起容器服务本身的微服务架构,这是一次很有趣的尝试. 一旦决定做微服务架构,有很多现实问题摆在面前,比如技术选型.业务拆分问题.高可用.服务通信.服务发现和治理.集群容错.配置管理.数据一致性问题.康威定律.分布式调用跟踪.CI/CD.微服务测试,以及调度

重磅!F5携手BoCloud博云,提供更安全、稳定的容器云平台

在大数据与移动互联网下,信息服务面临数据规模巨大.用户访问突发性强.数据服务实时性高等技术挑战,传统的应用架构.构建模式及运维管理体系都需要进行技术创新,以实现智能.弹性.可扩展的云应用架构与运营保障能力建设.微服务思想与 DevOps 理念在新一代面向云应用架构与运维管理体系建设上提出了全新的思路,而支撑微服务架构与 DevOps 思想的就是现在业界关注度和接受度都很高的 Docker 容器技术. 容器技术在带来如弹性伸缩.轻量部署.快速部署等等诸多好处,并成为云计算领域的趋势之一时,随着容器

容器云平台和Kubernetes之间不得不说的那些事

前言我们知道,传统的应用部署方式是将应用直接部署于单独的物理机或虚拟机中.但是在企业数字化转型的浪潮下,如何满足日益丰满的业务需求,如何高效践行敏捷研发,如何更好的将应用落地实施于客户现场,保障稳定高可用并利于维护,是传统企业不得不面对并解决的问题. 用友云技术中台为助力企业数字化转型提供了大量利器,比如本文将着重提及的容器云平台,就是其中之一. 容器云平台,是基于容器的运行时引擎,利用Kubernetes等容器调度方案,用以解决开发.测试.运行环境统一,服务快速部署,运行期服务管理.调度等问题

支持100+业务线、累计发布17万次|宜信容器云的A点与B点(分享实录)

宜信公司从2018年初开始建设容器云,至今,容器云的常用基本功能已经趋于完善,主要包括服务管理.应用商店.Nginx配置.存储管理.CI/CD.权限管理等,支持100+业务线.3500+的容器运行.伴随公司去VMware以及DevOps.微服务不断推进的背景,后续还会有更多的业务迁移到容器云上,容器云在宜信发挥着越来越重要的作用.本次分享主要介绍宜信容器云平台的背景.主要功能.落地实践及未来规划. 一.宜信容器云平台背景 宜信容器云平台的建设背景主要包括: 提高资源利用率.容器云建设之前,每台物

Docker——容器与容器云——互动出版网

这篇是计算机类的优质预售推荐>>>><Docker--容器与容器云> Docker和Kubernetes这一本就够了!从内核知识到容器原理,容器云技术深度揭秘!全面理解Docker源码实现与高级使用技巧.深入解读Kubernetes源码分析和最佳实践! 编辑推荐 从源码层面深度解析Docker核心原理 Kubernetes源码完全解读+最佳实践 广泛涵盖Docker高级实践技巧 全面梳理主流容器云技术架构方法 内容简介 本书从实践者的角度,在讲解Docker高级实践技巧

有容云:梁胜-如何让Docker容器在企业中投产(下)

编者注: 本文是对上海容器大会有容云专场梁胜博士直播视频的文字回播,力求高度还原当天演讲内容未加个人观点,如在细节部分略有出入欢迎留言指正.(文章较长,分为上.下两个部分) 前情提要: 在上篇中梁博士讲了容器技术短时间内爆发的根本原因,容器在企业中投产的必要性.必然性以及容器投产四种场景中的前两种:新一代的私有云.混合云环境:企业应用商店和一键部署:本篇将介绍最后两种场景:多环境.多资源池的DevOps流水线,构建轻量级PaaS服务,以及微服务.容器云等方面的内容,阅读前文清点击:梁胜 | 如何

私有容器云的物理架构

底层计算用的阿里云服务器 容器云用的红帽的openshift(基于k8s) 服务通过阿里云的slb暴露,并没有用k8s的router service用的NodePort 方式 原文地址:https://www.cnblogs.com/37yan/p/8890848.html

容器云平台在传统企业落地的一些思考和探索

本文内容是我今天在一个云原生论坛上演讲的材料,加上一些备注,现在分享给大家. 从应用的承载和部署方式这一角度看,一共经历了传统的物理机架构.虚拟化架构.和现在的容器化三种架构.但是,容器并不是一种虚拟化技术,它与虚拟机有实质性区别. 虽然把云分为IaaS.PaaS 和 SaaS 已经好多年了,但是,它们只有的差别,一直是想得出但摸不到.对我个人来说,只有在搞了OpenStack 后才算了解了一些IaaS,只有在用了 OpenShift 后才算了解了一些PaaS.这两个产品,对我都有云启蒙性的帮助