In the article I will describe how the monitor a pacemaker cluster resource manager of a Linux cluster with the nagios monitoring system. The nagios check_snmp plugin requests and analyses data from the pacemaker SNMP agent.
Montoring pacemaker with nagios
In the article I will describe how the monitor a pacemaker cluster resource manager of a Linux cluster with the nagios monitoring system. The nagios check_snmp plugin requests and analyses data from the pacemaker SNMP agent .
The nagios plugin check_snmp
Nagios provides a universal plugin to gather data from SNMP agents: check_snmp. You have to tell the plugin which OID you want to measure and its interpretation. The interpretation tell the plugin what values of the measurement indicate a good state (i.e. OK), a not so good state (i.e. WARNING) and a bad state (i.e. CRITICAL). The parameters are passed to the plugin with standard Unix options. The most important are:
check_snmp -H <ip_address> -o <OID> [-w warn_range] [-c crit_range]
You also can configure a community string for SNMPv1 or v2c or all the cyrpto stuff for SNMPv3 with other options. As always, the plugin tells you all its options when you call it with --help.
We want to check a two node cluster for the following conditions:
- At least one node is online: 2 nodes are OK, 1 is WARNING and 0 online nodes are critical.
- There are no resources with failures. One failure in any resource gives a WARNING and mode are CRITICAL.
The SNMP agent of pacemaker delivers the sys4PcmkOnlineNodes OID. This is the number of nodes in the online state. The nagios check would be:
$ check_snmp -H <node> -C public -o sys4PcmkOnlineNodes.0 -w 2: -c 1: SNMP OK - 2 | PACEMAKER-MIB::sys4PcmkOnlineNodes.0=2
or in case one node is standby or offline:
SNMP WARNING - *1* | PACEMAKER-MIB::sys4PcmkOnlineNodes.0=1
During normal operation resources in a cluster should not have any errors. Any failcounter in the cluster is sign for problems that the admins has to take care of. So the total number if failures in a cluster sys4PcmkResourceFailures makes a perfect target for monitoring. The check_snmp syntax would be:
$ check_snmp -H <node> -o sys4PcmkResourceFailures.0 -C public -w :1 -c :2 SNMP OK - 4 | PACEMAKER-MIB::sys4PcmkResourceFailures.0=0
or in case of any errors:
SNMP OK - 4 | PACEMAKER-MIB::sys4PcmkResourceFailures.0=4