一、相关概念:
cacti(监控工具;收集数据,根据数据绘图,如收集到CPU load:0.8 1.2等是具体的数据,做聚合后绘图;thold插件实现报警功能)
nagios:
监控工具;
监控对象(主机、服务|资源、联系人、时段、命令)
nagios对监控对象的监控有四种状态,只取状态值(OK、CRITICAL、WARNING、UNKNOWN),不论数值是多少,只取状态值,例如将监控对象CPU利用率定义好,在90%定义为CRITICAL,80%时为WARNING,其它数值则OK,监测不到时为UNKNOWN,不论监控对象是什么,只取监控的四种状态,简化使得管理员只关心监控对象是否正常,而不管当前的值是多少,更重要的是nagios在这样分析的结果之上提供了功能非常强大的报警系统,而cacti中是用thold插件实现报警能力,它与nagios比报警能力差太远了
cacti和nagios的着眼点不同,cacti收集数据绘图、展示走势;nagios分析监控结果,返回四种状态的某一种,并在状态危急时启动强大的报警机制给管理员发送通知,到现今nagios被广泛采用,已成为工业标准,强大到nagios本身是高度插件式的,nagios core不做任何监控工作,只是支持监控本身的工作运行,可将nagios core理解为nagios的工作平台,所有的监控功能都通过插件实现,nagios有一堆的plugins,可用官方提供,用户自己也可开发,plugin每次检测主机资源通过分析四种状态中的一种,nagios core取回nagios plugin返回的状态值来判断接下来处理的动作,高度插件化使得nagios整个工作机制和配置过程极具灵活性(越灵活复杂度越高)
nagios的整个工作过程是靠几种监控对象实现的:
主机--主机组(主机是一种对象,主机组也是一种对象)
服务|资源—服务组(服务和资源都统一称为服务)
联系人—联系人组(nagios的重大功能是一旦出问题报警,要能联系到谁,将通知发给谁,发给哪一组人)
时段timeperiod(定义对主机服务的监控时间段,联系人在什么时间段可接受通知,如server策略白天一定要正常,若不正常要能接到通知,晚上不正常则无所谓,就没必要接到通知)
命令command(非常重要的对象,nagios通过plugin监控主机或服务,简单来讲plugin就是一堆script,这个script本身对哪些对象进行监控,如对linux主机或win主机的监控方式不一样,对于httpd和nginx的监控方式也不一样,尽管都是web service,对于不同对象的监控通常使用特定的script来实现,script要应用到特定的对象上去,就算是同一个script对于不同的监控对象接受的参数、使用的方式都有可能不同(例如某一主机同时在线500个用户认为OK,1000个则WARNING,1500就是CRITICAL,而另一主机性能差在线100个OK,200个就WARNING了,500个就CRITICAL),command就是将插件揉合进定义好的命令模板中,这个模板可以应用到某个或某些监控对象上,以实现具体的监控)
这些监控对象彼此间有紧密的联系(非常复杂),如主机要有联系人(出现故障给谁通知),在哪个时段可发送给指定联系人,监控使用什么命令,对象之间有时需要互相引用,每一个监控对象,主机|服务|资源,都要定义出来,以主机为例给它起个名字,给出描述信息,使用什么命令监控,出现什么样的问题发送通知,是WARNING就告知还是CRITICAL才告知,还要说明发送通知给谁,在什么时候发送通知等
nagios支持模板进行配置(有时需要定义N个主机,若这N个主机都是linux-server,这些server除名字和描述信息不同之外,其它的要监控的内容都可以相同,对于多个监控对象,如果有很多属性相同时可使用template(对象模板)、联系人模板、主机|服务都可使用模板,在定义对象时直接套用模板,在模板中继承一些属性,再定义一些独有的属性即可
nagios要完成监控工作要定义对象,这些对象就是定义好的实体、并对它们加以区别
如下图,nagios对某一监控对象进行监控,要通过某一手段获取远端主机相关的属性状态信息,cacti基于SNMP工作,nagios也如此,nagios core不进行任何监控工作,通过各种插件来监控,插件分五类:check_by_ssh、check_nrpe、snmp、NSCA、check_xyz
ssh(在远端server(被监控端)上运行sshdaemon,被监控端要能接受监控端的ssh命令,插件将取得的结果予以分析,将分析的结果返回给nagios core,由core决定是否报警)
nrpe(非常独特,专用于监控linux或unix主机的机制,要在远端server上专门安装一nrpe程序,nrpe在被监控端运行将有监控结果,将结果返回给监控端的nrpe,监控端的nrpe再将结果返回给nagios core,可将这种方式理解为是C/S架构,监控端的nrpe是client,而被监控端是server-side)
snmp(在监控端每隔一段时间运行一堆snmp命令,联系到被监控端的snmpd(161port),通过本地的插件分析将结果返回至nagios core,snmp专用于监控那些既不支持ssh又不支持nrpe的主机,如win主机支持snmp、nrpe,但nagios并不优先使用基于snmp来监控win,而是使用NSclient++(专门在win主机上的客户端工具,是win的WMI组件),这个工具运行起来可实现nagios与win通信并且可获取win上资源的运行状态,并最终返回给nagios core)
nsca(snmp协议中有一种机制是trap,被监控端可主动通知监控端,nsca就是这么一种被监控机制,让nagios实现被动监控功能)
监控linux|unix有nrpe/snmp/nsca;监控win在win上安装NSclient++;监控router/switch/printer用snmp
ssh|nrpe|NSclient|snmp|nsca有些实现专门监控主机有些实现专门监控服务,这些本身并不是监控,而具体监控是由插件来实现的,这些只是让插件获取性能数据的一种手段、一种基础,而有些服务在监控时可直接使用插件来实现而不用借助额外的任何手段
例如要监控一台linux主机:
要定义主机对象(实例化监控对象的过程,说明监控的是哪个主机ip地址);
要使用什么命令来监控(要定义命令对象,定义监控这个主机使用什么插件来监控,真正监控靠的是插件,插件能够监控的对象有很多,可用的插件也有很多,定义好命令把插件写里面,用这个命令对象监控这个主机对象,创建命令的过程就是实例化具体化插件的过程,创建对象的过程就是实例化被监控对象的过程;可使用多个命令来监控主机,如有的是监控主机资源、有的监控主机服务等,它们之间未必是一对一关系);
一旦这个主机出故障应通知给谁(定义联系人对象,联系人对象名字、邮件、手机号,说明白通知的接收者,可使用联系人组
监控工作什么时候进行(要定义时段,是7*24都监控还是只在工作日内监控,联系人可在哪些时段接受通知,若server出现的是微小故障不是特别严重不必要半夜接到通知,还可定义例行维护时段不做监控)
nagios还可定义主机间的依赖关系(如router下有swith,switch下有N个主机,nagios既监控router、switch也监控这些主机,若switch故障就要发警告信息,由于switch故障其下的主机当然不能监控到,可定义依赖如switch故障就不需要再检测监控主机了,否则会收到一堆信息
依赖有彼此间依赖(双向依赖)和上下依赖(父子间依赖);如两台host间相互依赖,那host1故障将不会收到host2的警告信息(不监控host2),host2故障也不收host1的警告信息(不监控host1);如既监控某主机,又监控主机上的一些服务,当这台主机挂掉时其上运行的服务就没必要监控了
nagios强大到能分析这些依赖关系,要事先定义好
以上nagios是种监控机制,通过插件进行监控,监控状态很简单只返回4种状态,OK、WARNING、CRITICAL、UNKNOWN
发通知要由一种状态转为另一种状态才向管理员发通知(如OK-->CRITICAL);有可能这样一种特殊情况,nagios监控某主机的一个服务,这个服务由于过于繁忙没及时响应(监控触发到被监控端,被监控端要消耗一些资源予以响应监控端),状态这时为UNKNOWN
状态有软状态和硬状态之分(当监控端发现状态发生改变,会重复多次检测,如OK-->UNKNOWN并不会立即发通知,再重复两次若仍为UNKNOWN就转为硬状态这时才通知,因为软状态的错误可能是临时性、偶然性的
还有一种非正常状态叫flapping(OK-->WARNING-->CRITICAL-->OK-->UNKNOWN-->OK),一旦主机处于此状态也要发通知
nagios提供了web接口(依赖php),像cacti那样展示出来(不但展示还发告警通知),要使用web接口则要装httpd,nagios的web server也要依赖于php,它也是一堆php script,在某些情况下要用到mysql(状态数据并不需要保存在mysql中,除非使用别的工具时),编译安装nagios时要装mysql,要监控mysql server时要调用mysql的头文件、库文件
nagios通常由一个主程序nagios(或叫nagioscore),一个插件程序(nagios-plugins)和四个可选的附件addon(NRPE、NSCA、NSclient++、NDOUtils)组成
注:NDOUtils用来将nagios的配置信息和各事件产生的数据存入数据库,以实现这些数据的快速检索和处理,可理解为是broker掮客,它能阻断nagios core自身的工作,在nagios core上附加一层新功能,将nagios core本来应该保存在文件中的信息,夺过来保存到数据库中(改变了原先应该走的方向)
安装nagios server-side要装nagios、nagios-plugins、httpd
NRPE(要实现基于NRPE监控linux则要装NRPE,客户端也要装NRPE,NRPE的运行依赖nagios-plugins,在client装NRPE前先安装nagios-plugins)
若要使用snmp监控别的主机,nagios-plugins已提供了snmp功能
若要监控windows,在win上装NSclient++
若要用NSCA,客户端要装上send-nsca,服务器端只要开启NSCA的功能(nagios自带的功能)
nagios监控win的手段有两种(snmp和NSClinet++)
注:NSClient++功能非常强大,可监测win的各种资源,如cpu/memory/disk spare/process/services,此工具还提供nrpe的能力和nsca的能力)
nagios与NSClient++通信(通信机制有N种,默认的且最简单常用的一种是nagios使用插件check_nt(如要监控win主机CPU状况使用check_nt命令并传递一些参数给NSClient++,NSClient++收到后在本地执行检测命令再返回给check_nt),这种方式虽易用但监测能力是最弱的;还可用nrpe功能,使用check_nrpe,建议使用此种check_nrpe监测能力更强大;通过nsca可实现被动检测,nagios监控端需要nsca daemon接受对方发来的检测结果)
注:check_nt的监控能力较弱,最好用check_nrpe
NRPE(nagios remote pluginexecutor)
二、操作:
[[email protected] ~]# uname -a(redhat6.5)
Linux localhost.localdomain2.6.32-431.el6.x86_64 #1 SMP Sun Nov 10 22:19:54 EST 2013 x86_64 x86_64 x86_64GNU/Linux
准备LAMP环境
同步系统时间
准备软件包:
nagios-3.3.1.tar.gz
nagios-plugins-1.4.14.tar.gz
[[email protected] ~]# yum -y install httpd php php-mysql mysql mysql-devel mysql-server
[[email protected] ~]# groupadd nagcmd(nagios的运行需要特殊的用户和组,这个组至关重要,很多nagios的管理功能一些cgi脚本的执行都要有这个组的权限才能执行)
[[email protected] ~]# useradd -G nagcmd nagios
[[email protected] ~]# passwd nagios
[[email protected] ~]# vim /etc/httpd/conf/httpd.conf(二进制格式包安装的httpd,用户名和组为apache,源码方式安装为daemon)
User apache
Group apache
[[email protected] ~]# usermod -a -G nagcmd apache
[[email protected] ~]# tar xf nagios-3.3.1.tar.gz
[[email protected] ~]# cd nagios
[[email protected] nagios]# ./configure --help| less
[[email protected] nagios]# ./configure --with-command-group=nagcmd --enable-event-broker --sysconfdir=/etc/nagios(--enable-event-broker,enables integration of event broker routines为ndo-utils作准备,无这个选项要使用nagios得重新编译)
……
Review the options above for accuracy. If they look okay,
type ‘make all‘ to compile the main program and CGIs.
[[email protected] nagios]# make all
[[email protected] nagios]# make install(安装nagios)
[[email protected] nagios]# make install-init(安装nagios的相关脚本,例如可使用servicestart|stop等)
[[email protected] nagios]# make install-commandmode(命令权限)
[[email protected] nagios]# make install-config(安装生成配置文件)
/usr/bin/install -c -m 775 -o nagios -gnagios -d /etc/nagios
/usr/bin/install -c -m 775 -o nagios -gnagios -d /etc/nagios/objects
/usr/bin/install -c -b -m 664 -o nagios -gnagios sample-config/nagios.cfg /etc/nagios/nagios.cfg
/usr/bin/install -c -b -m 664 -o nagios -gnagios sample-config/cgi.cfg /etc/nagios/cgi.cfg
/usr/bin/install -c -b -m 660 -o nagios -gnagios sample-config/resource.cfg /etc/nagios/resource.cfg
/usr/bin/install -c -b -m 664 -o nagios -gnagios sample-config/template-object/templates.cfg /etc/nagios/objects/templates.cfg
/usr/bin/install -c -b -m 664 -o nagios -gnagios sample-config/template-object/commands.cfg/etc/nagios/objects/commands.cfg
/usr/bin/install -c -b -m 664 -o nagios -gnagios sample-config/template-object/contacts.cfg /etc/nagios/objects/contacts.cfg
/usr/bin/install -c -b -m 664 -o nagios -gnagios sample-config/template-object/timeperiods.cfg/etc/nagios/objects/timeperiods.cfg
/usr/bin/install -c -b -m 664 -o nagios -gnagios sample-config/template-object/localhost.cfg /etc/nagios/objects/localhost.cfg
/usr/bin/install -c -b -m 664 -o nagios -gnagios sample-config/template-object/windows.cfg/etc/nagios/objects/windows.cfg
/usr/bin/install -c -b -m 664 -o nagios -gnagios sample-config/template-object/printer.cfg /etc/nagios/objects/printer.cfg
/usr/bin/install -c -b -m 664 -o nagios -gnagios sample-config/template-object/switch.cfg /etc/nagios/objects/switch.cfg
*** Config files installed ***
Remember, these are *SAMPLE* configfiles. You‘ll need to read
the documentation for more information onhow to actually define
services, hosts, etc. to fit yourparticular needs.
[[email protected] nagios]# make install-webconf(会自动在/etc/httpd/conf.d/下生成nagios.conf配置文件,用于web接口,用于识别nagios程序配置,网页在/usr/local/nagios/share/下,这个配置文件可理解为路径别名,之后可通过http://192.168.23.137/nagios访问)
/usr/bin/install -c -m 644sample-config/httpd.conf /etc/httpd/conf.d/nagios.conf
*** Nagios/Apache conf file installed ***
[[email protected] nagios]# htpasswd -c /etc/nagios/htpasswd.users nagiosadmin(nagios的登录认证机制是用httpd的方式实现的)
New password:
Re-type new password:
Adding password for user nagiosadmin
[[email protected] nagios]# service httpd restart
Stopping httpd: [ OK ]
Starting httpd: [ OK ]
[[email protected] nagios]# chkconfig --add nagios
[[email protected] nagios]# chkconfig --list nagios
nagios 0:off 1:off 2:off 3:on 4:on 5:on 6:off
[[email protected] nagios]# service nagios start
Starting nagios: done.
[[email protected] nagios]# cd ..
[[email protected] ~]# tar xf nagios-plugins-1.4.14.tar.gz
[[email protected] ~]# cd nagios-plugins-1.4.14
[[email protected] nagios-plugins-1.4.14]#./configure --help | less
[[email protected] nagios-plugins-1.4.14]#./configure --with-nagios-user=nagios --with-nagios-group=nagios--sysconfdir=/etc/nagios
[[email protected] nagios-plugins-1.4.14]#make && make install
[[email protected] nagios-plugins-1.4.14]#service nagios restart(要关掉selinux否则会阻止cgi脚本的运行,#setenforce 0)
Running configuration check...done.
Stopping nagios: done.
Starting nagios: done.
[[email protected] nagios-plugins-1.4.14]# cd
[[email protected] ~]# ls /etc/nagios
cgi.cfg htpasswd.users nagios.cfg objects resource.cfg
[[email protected] ~]# ls /etc/nagios/objects(objects/下的这些对象可放在任意位置,只要在主配置文件nagios.cfg中将其包含进来即可)
commands.cfg contacts.cfg localhost.cfg printer.cfg switch.cfg templates.cfg timeperiods.cfg windows.cfg
访问http://192.168.23.137/nagios
[[email protected] ~]# vim /etc/nagios/nagios.cfg(cfg_dir定义的目录下的所有文件都会加载进来)
log_file=/usr/local/nagios/var/nagios.log
cfg_file=/etc/nagios/objects/commands.cfg
cfg_file=/etc/nagios/objects/contacts.cfg
cfg_file=/etc/nagios/objects/timeperiods.cfg
cfg_file=/etc/nagios/objects/templates.cfg
cfg_file=/etc/nagios/objects/localhost.cfg
#cfg_dir=/etc/nagios/servers
resource_file=/etc/nagios/resource.cfg
status_file=/usr/local/nagios/var/status.dat
status_update_interval=10
check_external_commands=1
command_check_interval=-1
command_file=/usr/local/nagios/var/rw/nagiosNaNd
lock_file=/usr/local/nagios/var/nagios.lock
temp_file=/usr/local/nagios/var/nagios.tmp
temp_path=/tmp
log_rotation_method=d
……
注:command_file=/usr/local/nagios/var/rw/nagiosNaNd,定义command的执行权限和执行身份,不是定义command本身
[[email protected] ~]# vim/etc/nagios/resource.cfg(对nagios而言$USER1$是宏(变量),由变量定义的配置文件,nagios支持32个宏,从$USER1$到$USER32$,默认$USER1$已使用,这些宏可理解为是nagios的环境变量,除31个可自定义的宏外,nagios还支持原生态的宏,不必事先定义的,如$HOSTADDRESS$会根据上下文的不同用来表示不同的主机;resource.cfg此文件一般不允许通过前端的web接口访问,正是通过此配置文件剥离了用户接口与cgi的内容,cgi若要访问用户的配置信息可调用这个文件,但在web接口访问不到,加强其安全性)
$USER1$=/usr/local/nagios/libexec
[[email protected] ~]# ls /usr/local/nagios/libexec(其下是一堆的插件,要引用某一个插件时,使用$USER1$/PLUGINS_NAME即可)
[[email protected] ~]# vim /usr/local/nagios/var/status.dat(nagios监测的某一服务或主机在某一时刻都有状态,保留所有状态的数据文件)
[[email protected] ~]# cd /etc/nagios/objects
[[email protected] objects]# vim commands.cfg
define command{
command_name notify-host-by-email(必须要全局唯一,两个command_name一定不能重名,至关重要)
command_line /usr/bin/printf"%b" "***** Nagios *****\n\nNotification Type:$NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress: $HOSTADDRESS$\nInfo:$HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" | /bin/mail -s "**$NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$ **"$CONTACTEMAIL$
}
……
define command{
command_name check-host-alive
command_line $USER1$/check_ping-H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5(-w,warning,警告预值,有80%的丢包率且延迟为3000ms就警告;-c,critical的预值;-p,package,共检测几个数据包)
}
define command{
command_name check_local_disk
command_line $USER1$/check_disk-w $ARG1$ -c $ARG2$ -p $ARG3$($ARG#$在不同的主机上可传递不同的参数)
}
[[email protected] objects]# vim contacts.cfg
define contact{
contact_name nagiosadmin ; Shortname of user(contact_name定义的要全局唯一)
use generic-contact ; Inheritdefault values from generic-contact template (defined above)(use从哪个模板继承的一些属性)
alias Nagios Admin ; Fullname of user(描述性的名字,方便查看)
email [email protected] ;<<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
}
[[email protected] objects]# vim timeperiods.cfg
define timeperiod{
timeperiod_name 24x7(timeperiod_name全局唯一)
alias 24 Hours A Day, 7Days A Week
sunday 00:00-24:00
monday 00:00-24:00
tuesday 00:00-24:00
wednesday 00:00-24:00
thursday 00:00-24:00
friday 00:00-24:00
saturday 00:00-24:00
}
[[email protected] objects]# vim localhost.cfg
define host{
use linux-server ; Name of host template to use
; This host definition will inherit all variables that are defined(use使用哪个模板)
; in (or inherited by) the linux-server host template definition.
host_name localhost(host_name全局唯一)
alias localhost
address 127.0.0.1
}
define service{
use local-service ; Name ofservice template to use
host_name localhost(先定义好主机,再定义服务,服务必须是某个主机的服务,服务要全局唯一)
service_description PING
check_command check_ping!100.0,20%!500.0,60%(!100.0,20%,表示传递的第一个参数,!500.0,60%表示传递的第二个参数;要先在commands.cfg中定义好check_ping)
}
1、
通过check_nt方式监控windows主机
windows-side(被监控端):
在win主机上安装NSClinet++(http://nsclient.org/)
注意Allowed hosts为监控端naigos的地址
在win上使用netstat -an查看12489port是否开启,默认是1248已改为12489,这是check_nt插件与NSClient++通信的端口;5666是nrpe使用的端口
修改win上MSC配置文件将password注释掉,方便监控端配置,否则监控端每个监控语句都要多配置一个参数用来传递密码(生产环境中要设置)
在win命令行下重启服务(>nsclinet++.exe -stop,>nsclient++.exe-start)
nagios-side(监控端):
[[email protected] objects]# ifconfig | grep "inet addr:"
inet addr:192.168.23.138 Bcast:192.168.23.255 Mask:255.255.255.0
inet addr:127.0.0.1 Mask:255.0.0.0
[[email protected] objects]# cd /usr/local/nagios/libexec/
[[email protected] libexec]# ll check_nt
-rwxr-xr-x. 1 nagios nagios 95456 Apr 1 15:59 check_nt
[[email protected] libexec]# ./check_nt -h
Usage:check_nt -H host -v variable [-p port] [-w warning] [-c critical] [-l params] [-d SHOWALL] [-u] [-t timeout]
注:-H,--hostname=HOST
-v,--variable=STRING(variable有CLIENTVERSION,CPULOAD,UPTIME,USEDDISKSPACE,MEMUSE,SERVICESTATE,PROCSTATE,COUNTER,INSTANCES)
[[email protected] libexec]# ./check_nt -H 192.168.23.140 -v UPTIME -p 12489 -s nagios
System Uptime - 0 day(s) 0 hour(s) 40minute(s)
[[email protected] libexec]# ./check_nt -H 192.168.23.140 -p 12489 -v CPULOAD -w 80 -c 90 -l 5,80,90 -s nagios(显示的结果分性能信息和一般信息,用竖线|隔开,注意若自己开发插件时,性能信息和一般信息必须要使用竖线隔开)
CPU Load 0% (5 min average) | ‘5 min avg Load‘=0%;80;90;0;100
[[email protected] libexec]# ./check_nt -H 192.168.23.140 -p 12489 -v USEDDISKSPACE -w 80 -c 90 -l C -s nagios
C:\ - total: 40.00 Gb - used: 8.96 Gb (22%)- free 31.04 Gb (78%) | ‘C:\ Used Space‘=8.96Gb;32.00;36.00;0.00;40.00
[[email protected] libexec]# cd /etc/nagios/objects
[[email protected] objects]# vim commands.cfg
define command{
command_name check_nt
command_line $USER1$/check_nt-H $HOSTADDRESS$ -p 12489 -v $ARG1$ $ARG2$
}
[[email protected] objects]# vim windows.cfg
define host{
use windows-server ; Inherit default values from a template
host_name winserver ; The name we‘re giving to this host
alias My WindowsServer ; A longer name associatedwith the host
address 192.168.23.140 ; IP address of the host
}
define service{
use generic-service
host_name winserver
service_description NSClient++Version
check_command check_nt!CLIENTVERSION
}
define service{
use generic-service
host_name winserver
service_description Uptime
check_command check_nt!UPTIME
}
define service{
use generic-service
host_name winserver
service_description CPU Load
check_command check_nt!CPULOAD!-l 5,80,90
}
define service{
use generic-service
host_name winserver
service_description MemoryUsage
check_command check_nt!MEMUSE!-w 80 -c 90
}
define service{
use generic-service
host_name winserver
service_description C:\ DriveSpace
check_command check_nt!USEDDISKSPACE!-l c -w 80 -c 90
}
define service{
use generic-service
host_name winserver
service_description W3SVC
check_command check_nt!SERVICESTATE!-d SHOWALL -l W3SVC
}
define service{
use generic-service
host_name winserver
service_description Explorer
check_command check_nt!PROCSTATE!-d SHOWALL -lExplorer.exe
}
[[email protected] objects]# vim ../nagios.cfg(添加如下一行)
cfg_file=/etc/nagios/objects/windows.cfg
[[email protected] objects]#/usr/local/nagios/bin/nagios -v /etc/nagios/nagios.cfg
……
Total Warnings: 0
Total Errors: 0
Things look okay - No serious problems weredetected during the pre-flight check
[[email protected] objects]# service nagios restart
Running configuration check...done.
Stopping nagios: done.
Starting nagios: done.
2、
通过check_nrpe插件监测linux
nagios使用check_nrpe插件与被监控端的nrpe进程通信,nrpe的进程默认在5666port上,nagios-side监控端也要安装nrpe这个addon附件只不过不需启动这个服务
被监控端:
[[email protected] ~]# uname -a(centos6.3)
Linux localhost.localdomain2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012 x86_64 x86_64 x86_64GNU/Linux
[[email protected] ~]# ifconfig | grep "inet addr:"
inet addr:192.168.23.132 Bcast:192.168.23.255 Mask:255.255.255.0
inet addr:127.0.0.1 Mask:255.0.0.0
[[email protected] ~]# rpm -i nrpe-2.15-7.el6.src.rpm
[[email protected] ~]# cd rpmbuild
[[email protected] rpmbuild]# ls
SOURCES SPECS
[[email protected] SPECS]# yum -y install tcp_wrappers-devel
[[email protected] SPECS]# rpmbuild -bp nrpe.spec
[[email protected] SPECS]# cd ..
[[email protected] rpmbuild]# ls
BUILD BUILDROOT RPMS SOURCES SPECS SRPMS
[[email protected] rpmbuild]# cd BUILD
[[email protected] BUILD]# ls
nrpe-2.15
[[email protected] BUILD]# cd nrpe-2.15/
[[email protected] nrpe-2.15]# ./configure --with-nrpe-user=nagios --with-nrpe-group=nagios --with-nagios-user=nagios --with-nagios-group=nagios --enable-command-args --enable-ssl --sysconfdir=/etc/nagios(--enable-command-args更强功能向命令传递参数)
[[email protected] nrpe-2.15]# make all
[[email protected] nrpe-2.15]# make install-plugin
[[email protected] nrpe-2.15]# make install-daemon
[[email protected] nrpe-2.15]# make install-daemon-config
[[email protected] nrpe-2.15]# cd /etc/nagios
[[email protected] nagios]# vim nrpe.cfg
log_facility=daemon
pid_file=/var/run/nrpe/nrpe.pid
server_port=5666
server_address=192.168.23.132(服务监听的地址,不指默认为0.0.0.0)
nrpe_user=nagios
nrpe_group=nagios
allowed_hosts=192.168.23.138(由谁来监控)
debug=0
command_timeout=60
connection_timeout=300
# command[<command_name>]=<command_line>(监控端nagios基于nrpe监控被监控端,要发起监控请求,在被监控端要先定义好执行的命令)
command[check_users]=/usr/local/nagios/libexec/check_users-w 5 -c 10
command[check_load]=/usr/local/nagios/libexec/check_load-w 15,10,5 -c 30,25,20
command[check_sda1]=/usr/local/nagios/libexec/check_disk-w 20% -c 10% -p /dev/sda1
command[check_sda2]=/usr/local/nagios/libexec/check_disk-w 20% -c 10% -p /dev/sda2
command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs-w 5 -c 10 -s Z
command[check_total_procs]=/usr/local/nagios/libexec/check_procs-w 150 -c 200 include_dir=/etc/nrpe.d/
[[email protected] nrpe-2.15]#/usr/local/nagios/bin/nrpe -c /etc/nagios/nrpe.cfg -d(开启nrpe守护进程,可制作脚本/etc/init.d/nrped方便管理见文末)
[[email protected] nrpe-2.15]# netstat -tnlp |grep :5666
tcp 0 0 192.168.23.132:5666 0.0.0.0:* LISTEN 21662/nrpe
监控端:
安装nrpe(具体见以上被监控端安装,此处只需安装到make all和make install-plugin即可)
[[email protected] nrpe-2.15]# ls /usr/local/nagios/libexec(查看是否有check_nrpe)
[[email protected] nrpe-2.15]# cd !$
cd /usr/local/nagios/libexec
[[email protected] libexec]# ./check_nrpe -h
Usage: check_nrpe -H <host> [ -b <bindaddr> ] [-4] [-6] [-n] [-u] [-p <port>] [-t <timeout>][-c <command>] [-a <arglist...>]
[[email protected] libexec]# vim /etc/nagios/objects/commands.cfg
define command{
command_name check_nrpe
command_line $USER1$/check_nrpe-H $HOSTADDRESS$ -c $ARG1$
}
[[email protected] libexec]# cp /etc/nagios/objects/windows.cfg /etc/nagios/objects/linuxhost.cfg
[[email protected] libexec]# vim !$(此处service中定义的项要与被监控端nrpe.cfg中最末处定义的内容一致)
vim /etc/nagios/objects/linuxhost.cfg
define host{
use linux-server ; Inherit default values from a template
host_name linuxserver ; The name we‘re giving to this host
alias My Linux Server ;A longer name associated with the host
address 192.168.23.132 ; IP address ofthe host
}
define service{
use generic-service
host_name linuxserver
service_description CHECK_USERS
check_command check_nrpe!check_users
}
define service{
use generic-service
host_name linuxserver
service_description LOAD
check_command check_nrpe!check_load
}
define service{
use generic-service
host_name linuxserver
service_description SDA1
check_command check_nrpe!check_sda1
}
define service{
use generic-service
host_name linuxserver
service_description SDA2
check_command check_nrpe!check_sda2
}
define service{
use generic-service
host_name linuxserver
service_description Zombie
check_command check_nrpe!check_zombie_procs
}
define service{
use generic-service
host_name linuxserver
service_description Totalprocs
check_command check_nrpe!check_total_procs
}
[[email protected] libexec]# vim /etc/nagios/nagios.cfg
cfg_file=/etc/nagios/objects/linuxhost.cfg
[[email protected] libexec]#/usr/local/nagios/bin/nagios -v /etc/nagios/nagios.cfg
[[email protected] libexec]# service nagios restart
Running configuration check...done.
Stopping nagios: done.
Starting nagios: done.
3、
通过check_nrpe监测windows
被监控端:
C:\Program Files\NSClient++\NSC([modules]定义启动的模块;分号打头的是注释;allow_arguments是否允许nagios监控端传递参数,允许改为1;allow_nasty_meta_chars传递参数时是否允许包含特殊字符,允许改为1;use_ssl若启用则会强行使用ssl)
[modules]
NRPEListener.dll
NSClientListener.dll
NSCAAgent.dll
CheckWMI.dll
FileLogger.dll
CheckSystem.dll
CheckDisk.dll
CheckEventLog.dll
CheckHelpers.dll
[Settings]
use_file=1
allowed_hosts=192.168.23.138
[NSClient]
[NRPE]
port=5666
command_timeout=60
allow_arguments=1
allow_nasty_meta_chars=1
;use_ssl=1
bind_to_address=192.168.23.140
allowed_hosts=192.168.23.138
在win命令行下:
>cd ../..
>cd "Program FIles"
>cd "NSClient++"
>nsclient++ -stop
>nsclient++ -start
监控端:
[[email protected] ~]# cd /usr/local/nagios/libexec
[[email protected] libexec]# ./check_nrpe -H 192.168.23.140 -c checkCPU -a warn=80 crit=90 time=20 time=10 time=5
OK CPU Load ok.|‘20‘=0%;80;90;‘10‘=0%;80;90; ‘5‘=0%;80;90;
4、
/usr/local/nagios/libexec/下,check_http用于监测webservice,check_mysql用于监测mysql service
[[email protected] libexec]# ./check_http -h
Usage: check_http -H <vhost> | -I<IP-address> [-u <uri>] [-p <port>]
[-w <warn time>] [-c <critical time>] [-t <timeout>][-L]
[-a auth] [-f <ok | warn | critcal | follow | sticky |stickyport>]
[-e <expect>] [-s string] [-l] [-r <regex> | -R<case-insensitive regex>]
[-P string] [-m <min_pg_size>:<max_pg_size>] [-4|-6] [-N][-M <age>]
[-A string] [-k string] [-S] [-C <age>] [-T <content-type>][-j method]
Examples:
CHECK CONTENT: check_http -w 5 -c 10 --ssl -H www.verisign.com
[[email protected] libexec]# ./check_mysql -h
Usage: check_mysql [-d database] [-H host][-P port] [-s socket]
[-u user] [-p password] [-S]
添加监控httpd服务:
[[email protected] libexec]# cd /etc/nagios/objects
[[email protected] objects]# vim commands.cfg
define command{
command_name check_http
command_line $USER1$/check_http-I $HOSTADDRESS$ $ARG1$
}
[[email protected] objects]# vim linuxhost.cfg
define service{
use generic-service
host_name linuxserver
service_description Web Server
check_command check_http
}
[[email protected] objects]#/usr/local/nagios/bin/nagios -v /etc/nagios/nagios.cfg
[[email protected] objects]# service nagiosrestart
Running configuration check...done.
Stopping nagios: done.
Starting nagios: done.
添加监控mysql:
[[email protected] objects]# vim commands.cfg
define command{
command_name check_mysql
command_line $USER1$/check_mysql -H $HOSTADDRESS$ -u $ARG1$ -p $ARG2$
}
[[email protected] objects]# vim linuxhost.cfg
define service{
use generic-service
host_name linuxserver
service_description MySQLServer
check_command check_mysql!root!magedu
}
[[email protected] objects]#/usr/local/nagios/bin/nagios -v /etc/nagios/nagios.cfg
[[email protected] objects]# service nagios restart
Running configuration check...done.
Stopping nagios: done.
Starting nagios: done.
注:web service和mysql本身就对外提供服务,不需要NRPE或NSClient++这些额外插件
[[email protected] objects]# vim templates.cfg(host和service都定义在admins组上)
define contact{
name generic-contact ; The name of this contact template
service_notification_period 24x7 ; servicenotifications can be sent anytime
host_notification_period 24x7 ; hostnotifications can be sent anytime
service_notification_options w,u,c,r,f,s ; sendnotifications for all service states, flapping events, and scheduled downtimeevents
host_notification_options d,u,r,f,s ; sendnotifications for all host states, flapping events, and scheduled downtimeevents
service_notification_commands notify-service-by-email ; send service notificationsvia email
host_notification_commands notify-host-by-email ; send host notifications via email
register 0 ; DONTREGISTER THIS DEFINITION - ITS NOT A REAL CONTACT, JUST A TEMPLATE!
}
define host{
name linux-server ; The name of thishost template
use generic-host ; This templateinherits other values from the generic-host template
check_period 24x7 ; By default,Linux hosts are checked round the clock
check_interval 5 ; Actively checkthe host every 5 minutes
retry_interval 1 ; Schedule host check retries at1 minute intervals
max_check_attempts 10 ; Check each Linuxhost 10 times (max)
check_command check-host-alive ; Default command to check Linux hosts
notification_period workhours ; Linux adminshate to be woken up, so we only notify during the day
; Note that the notification_period variable is being overridden from
; the value that is inherited from the generic-host template!
notification_interval 120 ; Resendnotifications every 2 hours
notification_options d,u,r ; Only sendnotifications for specific host states
contact_groups admins ; Notifications get sent to the admins by default
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT AREAL HOST, JUST A TEMPLATE!
}
define host{
name windows-server ; The name of thishost template
use generic-host ; Inherit defaultvalues from the generic-host template
check_period 24x7 ; By default, Windows servers aremonitored round the clock
check_interval 5 ; Actively check the server every 5 minutes
retry_interval 1 ; Schedule host check retries at1 minute intervals
max_check_attempts 10 ; Check each server 10 times(max)
check_command check-host-alive ; Default command to check if serversare "alive"
notification_period 24x7 ; Send notification out at any time- day or night
notification_interval 30 ; Resend notifications every 30minutes
notification_options d,r ; Only send notifications forspecific host states
contact_groups admins ; Notifications get sent to the adminsby default
hostgroups windows-servers ; Host groups that Windows servers should be a member of
register 0 ; DONT REGISTER THIS - ITS JUSTA TEMPLATE
}
[[email protected] objects]# vim contacts.cfg
define contact{
contact_name nagiosadmin ; Shortname of user
use generic-contact ; Inherit default values from generic-contact template (defined above)
alias Nagios Admin ; Full name of user
email [email protected] ;<<***** CHANGE THIS TO YOUR EMAIL ADDRESS ******
}
define contactgroup{
contactgroup_name admins
alias NagiosAdministrators
members nagiosadmin
}
[[email protected] objects]# vim commands.cfg
define command{
command_name notify-host-by-email
command_line /usr/bin/printf"%b" "***** Nagios *****\n\nNotification Type:$NOTIFICATIONTYPE$\nHost: $HOSTNAME$\nState: $HOSTSTATE$\nAddress:$HOSTADDRESS$\nInfo: $HOSTOUTPUT$\n\nDate/Time: $LONGDATETIME$\n" |/bin/mail -s "** $NOTIFICATIONTYPE$ Host Alert: $HOSTNAME$ is $HOSTSTATE$**" $CONTACTEMAIL$
}
define command{
command_name notify-service-by-email
command_line /usr/bin/printf"%b" "***** Nagios *****\n\nNotification Type:$NOTIFICATIONTYPE$\n\nService: $SERVICEDESC$\nHost: $HOSTALIAS$\nAddress:$HOSTADDRESS$\nState: $SERVICESTATE$\n\nDate/Time: $LONGDATETIME$\n\nAdditionalInfo:\n\n$SERVICEOUTPUT$\n" | /bin/mail -s "** $NOTIFICATIONTYPE$Service Alert: $HOSTALIAS$/$SERVICEDESC$ is $SERVICESTATE$ **" $CONTACTEMAIL$
}
注:contacts.cfg中的generic-contact与templates.cfg中的generic-contact相关联
contacts.cfg中的admins与templates.cfg中的admins相关联
commands.cfg中的notify-host-by-email与templates.cfg中的notify-host-by-email
commands.cfg中的notify-service-by-email与templates.cfg中的notify-service-by-email
NSCA方式,定义主机时注意:
active_checks_enabled为0
passive_checks_enabled为1
附:nrped脚本
#vim /etc/init.d/nrped
-----------------------script start-----------------
#!/bin/sh
#
# chkconfig: - 86 14
nrpe_num=`ps aux | grep /bin/nrpe | grep -vgrep | wc -l`
case $1 in
start)
if [ $nrpe_num -eq 1 ]
then
echo "Error:nrpe is running."
else
/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
echo "nrpe started successfully."
fi
;;
stop)
if [ $nrpe_num -eq 1 ]
then
nrpe_pid=`ps aux | grep /bin/nrpe | grep -v grep | awk ‘{print $2}‘`
kill -9 $nrpe_pid
echo "nrpe stoped successfully."
else
echo "Error:nrpe is stoping."
fi
;;
restart)
if [ $nrpe_num -eq 1 ]
then
nrpe_pid=`ps aux | grep /bin/nrpe | grep -v grep | awk ‘{print $2}‘`
kill -9 $nrpe_pid
echo "nrpe stoped successfully."
/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
echo "nrpe started successfully."
else
echo "Error:nrpe is stoping"
/usr/local/nagios/bin/nrpe -c /usr/local/nagios/etc/nrpe.cfg -d
echo "nrpe started successfully."
fi
esac
-------------------script end---------------------------