Monitoring iRODS Resources with Nagios

This post will present one way to monitor iRODS with Nagios.

We will start with a Nagios service which will monitor the status of iRODS resource servers and if a resource server is non-responsive, mark all the iRODS resources on that server down. Finally we will create a Nagios service to monitor the percentage of a resource’s capacity being used and generate Nagios warning and critical statuses for resources that have exceeded certain thresholds.

Basic Setup

As a demonstration, we will set up an iRODS grid with one iCAT enabled server (catalog producer) and two iRODS resource servers (catalog consumers). Each server will host exactly one resource.

  • ICAT.example.org – 192.168.1.150
  • Resource1.example.org – 192.168.1.151
  • Resource2.example.org – 192.168.1.152

After setting up the servers, make sure each server can access the others by fully qualified domain name (FQDN). Install iRODS on each server. In this demonstration we installed iRODS 4.2 (preview).

We will use the default resources created on Resource1.example.org and Resource2.example.org.

  • Resource1Resource
  • Resource2Resource

The instructions that follow are based on Ubuntu (14). Minor changes would be needed if installed on another Linux distribution.

Building the iping Command

We have created a simple application which will ping the iRODS services on each server. This application performs an rcConnect() to an iRODS server given the IP or hostname and port.

To build the iping binary we need to first build the iRODS externals as described in Building iRODS Externals.

After the externals are built, we need to clone the iRODS contrib package from GitHub, build and then install the iping package.

Perform the following steps on ICAT.example.org:

Now let’s test the iping command:

As another test, bring down one of the resource servers by either shutting down the machine or stopping the iRODS service (or possibly specifying a port other than the default of 1247).

The iping command will return a return code of 2 on failure. This will be used by Nagios to determine success (0) or failure (2).

Installing Nagios on the iCAT server

For simplicity, we will install Nagios on the iCAT server since we already have the iping command installed there.

Go to the local Nagios web tool to make sure everything is up. This would be http://localhost/nagios3/ if running from a browser on the iCAT server. Use “nagiosadmin” and the password you specified during Nagios installation.

Configuring Nagios

We need to create a directory for iRODS configuration and update the nagios.cfg to look for configuration files in this new directory.

Now create /etc/nagios3/irods/irods.cfg which will be a configuration file for the iping service. Put the following contents in this file:

Configuration Explanation

We have created a template called irods-server-template for our iRODS servers. Nagios allows configurations to be inherited from other configurations. Our servers will inherit from this template. This template itself inherits from the generic-host template. Nagios identifies this as a template by the “register 0” parameter. The key configuration in this template is the check_interval which is set to one (1) minute.

We have defined two hosts which are derived from the irods-server-template. The address parameter is set to the hostname for a resource server.

The hostgroup is simply a grouping of hosts. In our case we have a single hostgroup for the two resource servers.

Next we defined two commands:

  • iping-irods-server – This executes a wrapper shell script which will execute the iping binary. The $ARG1$ is the port number which is defined in the service definition.
  • update-irods-resource-state – This executes a shell script which will update the resource state.

The last configuration item is the service definition itself.

The service runs the iping-irods-server command with $ARG1$ of 1247 for the iRODS port (see “!1247” in the configuration). A failure is indicated by a return code of 2. On a change in state, the event_handler is executed. In this case it executes the command update-irods-resource-state which executes the update_irods_resc_state.sh script. This script will update the resource state in the iRODS catalog to either “up” or “down”. Finally, this service runs against all hosts that are defined in the irods-servers hostgroup.

Create scripts to perform the iping and update the resource states

Now we will create a bash script that is a wrapper around iping. This script will be placed in /usr/lib/nagios/plugins/iping.sh.

Make sure that this script can be executed.

When the iping service notices a resource state has changed, it will update the resource’s status in the iRODS database. Create the following script in /usr/lib/nagios/plugins/update_irods_resc_state.sh:

Restart Nagios and Test

Restart Nagios

Now bring down a resource server. Wait for approximately a minute and check the resource status.

Bring up the resource server. Wait another minute and check the resource status.

Using Resource Status Updates in a Replication Scenario

Now we will create a resource hierarchy with the two resources as children of a replication resource. When a resource is unavailable and a resource is marked down the replica will serve up the appropriate file as necessary.

Let’s create the resource hierarchy:

With both resources up let’s put a file into iRODS.

Bring Resource1Resource down and quickly try to get test.txt. Since Resource1Resource is the lowest replica number, iRODS will attempt to read test.txt from this resource. If Resource1Resource is still marked as up this will fail.

Now let’s wait a minute or so and try to get test.txt one more time. By this time Resource1Resource will have been marked as down and the system will retrieve the file from Resource2Resource as we would prefer.

Monitoring Resource Use (bytes)

As another example of using Nagios to monitor resources, we will create a Nagios plugin to monitor the resource use (bytes) so that we can get a warning or error when the resource nears the max_bytes.

The first step is to add a max_bytes setting to our resources’ context string. Just for demonstration purposes we will set the max_bytes to 500 bytes.

We will not enforce this max_bytes setting in our example as we are concerned with the monitoring aspects. Refer to the information in Composable Resources for information on how to get started with enforcing the max_bytes setting.

Next we need to update our configuration to create a service to monitor the bytes. Update /etc/nagios3/irods/irods.cfg and append the following to the bottom of this file:

We have added a new command called check-resource-use. This command will execute the check_resource_use.sh script (not yet defined) and send the host alias as the first argument. We have already set the host alias to the resource name for the resource on that host. The next two arguments are warning and critical thresholds as percentages of the max_bytes setting.

We then defined a service that will execute this command. The service runs against all servers in the irods-resource-server group, executes the check-resource-use command. We have hardcoded the warning level to 90.0% and critical level to 95.0%

Notes:

In this example we have a single resource for each server. If you had multiple resources you could create a separate host for each resource with an alias that matches the resource name, create a hostgroup for all of the resources, and add this new hostgroup to the check-resource-use service.

We have set the check_interval to 1 (minute) for the check-resource-use service. In a real world scenario it may be more appropriate to do this once an hour (60) or maybe even once a day (1440).

Now we just need to create the check_resource_use.sh script. This script is provided below.

Testing Resource Use

Restart Nagios so that it can read the new configuration.

Refresh the Nagios monitoring tool services page. You should now see check-resource-use services for both Resource1 and Resource2. If your resources are empty the status should be green (OK).

Now play around with adding files to the resources so that your use is between the warning and critical levels and then above the critical level. Nagios will update the service status appropriately.

Monitoring Connection Count

As one final example we will monitor the connection count on all iRODS servers.

We will be using the irods-grid command to get the connection count. Before this command can be executed, we need to add a few lines into the nagios’s irods_environment.json. Update ~nagios/.irods/irods_environment.json and add the following lines. Adjust the values for these to match what you have in your server configuration.

Next we need to update our configuration to add a host for the ICAT server, a hostgroup for the ICAT and two resource servers, a command to execute and a service to execute this command. Update /etc/nagios3/irods/irods.cfg and append the following to the bottom of this file:

We now need to create the check_agent_count.sh script. Create /usr/lib/nagios/plugins/check_agent_count.sh with the following contents and make this file executable.

This script takes the hostname as the first argument and performs an irods-grid status command. We use jq to parse the JSON output and write a single line of output for each server/pid combination. Then we get the number of connections for this host by searching for only lines that contain this hostname and counting the number of lines returned.

Note: If you don’t have jq, you can download it on Ubuntu with “sudo apt-get install jq”.

Since the output is just informational, we will always return 0 (OK) as long as the call was formatted correctly.

Restart Nagios and try it out. Refresh the Nagios monitoring tool services page. By default all three servers should have a count of 1. Try running multiple iRODS commands in parallel (possibly large puts that take a long time) and wait for an update. As long as the processes are still running when the check is performed you should have a count greater than 1.

Accessing Code

The final version of all of the files can be accessed here:

https://github.com/irods/contrib/tree/master/nagios_monitoring

Next:

Previous: