Nagios Monitoring for Erlang

December 30, 2009. Filed under erlang

One of my joys at work is getting to work with Erlang. If adoption increases, Erlang has quite a few benefits to offer in terms of distributed computing and reliability, but in the short term Erlang has the inevitable weakness of not being PHP or Java. Further, Erlang applications may rely on Mnesia instead of MySQL or PostgreSQL, and the end result is that a company's existing infrastructure (ops, monitoring, runbooks, etc) usually isn't effective at supporting Erlang without some modification.

Taking a stab at one aspect of this, I spent some time over the past few days writing monitoring scripts for Erlang process groups, nodes and applications for use with Nagios. The effort is tentatively named nagios_erlang, although I'll admit a certain weakness in its charm.

More thorough usage details are in the nagios_erlang README, but generally it provides:

  • the ability to check that the host can ping another node,
  • the ability to check that a specific application is running on another node1,
  • check that the number of processes in a process group satisfies warning and critical constraints (i.e. more than 5 is ok, less than 5 is warning, less than 3 is critical, etc).

At the moment they are performing active checks, but it should be straightforward to extend the script to support passive checks as well. (Add a second wrapper to output in NCSA format in nagios_erlang.erl, check for --passive parameter, write output to a temporary file, pipe it into NCSA send_message; something along those lines).

Full source code is available on Github.

  1. Awkwardly, it does this by trying to start the application and checking if it is already started. I couldn't come up with a more sophisticated approach, but perhaps I am simply blind to an appropriate function in the application module.