One of my joys at work is getting to work with Erlang. If adoption increases, Erlang has quite a few benefits to offer in terms of distributed computing and reliability, but in the short term Erlang has the inevitable weakness of not being PHP or Java. Further, Erlang applications may rely on Mnesia instead of MySQL or PostgreSQL, and the end result is that a company's existing infrastructure (ops, monitoring, runbooks, etc) usually isn't effective at supporting Erlang without some modification.
Taking a stab at one aspect of this, I spent some time over the past few days
writing monitoring scripts for Erlang process groups, nodes and applications
for use with Nagios. The effort is tentatively named nagios_erlang,
although I'll admit a certain weakness in its charm.
More thorough usage details are in the nagios_erlang README, but generally it provides:
- the ability to check that the host can ping another node,
- the ability to check that a specific application is running on another node1,
- check that the number of processes in a process group satisfies warning and critical constraints (i.e. more than 5 is ok, less than 5 is warning, less than 3 is critical, etc).
At the moment they are performing active checks,
but it should be straightforward to extend the script to support passive checks
as well. (Add a second wrapper to output in NCSA format in nagios_erlang.erl, check for --passive parameter,
write output to a temporary file, pipe it into NCSA send_message; something along those lines).
Full source code is available on Github.
Awkwardly, it does this by trying to start the application and checking if it is already started. I couldn't come up with a more sophisticated approach, but perhaps I am simply blind to an appropriate function in the
applicationmodule.↩
Thanks for the article. I minor note: I think using ++ operator instead of lists:concat/1 would make the format_output/1 look a little simpler.
Try application:which_applications/0 to get a list of running applications on the node.
Out of interest, is there any particular reason you didn't go down the route of using Erlang's SNMP daemon to provide this information? (Besides the fact that SNMP is a pain in the backside from a coding perspective...).
No particular reason, largely a lack of experience with Nagios and not having a strong grasp on the format it expected for passive checks.
Fair enough. I ask only because we faced the same issue here and our Nagios guys were quite insistent that SNMP was "the way to go" where possible. Partly because of the ease of integrating Nagios with it and party because it's "the standard" for monitoring things, so they could hook in other tools with equal ease and no changes to the code.
Erlang's SNMP stuff took me a day or so to get my head around, but once you've figured it out it's relatively easy to use (at least compared to other SNMP libraries I've tried). The worst bit is writing the damn MIB files...
Would love to see the solution you came up with, if you have the desire to release the code. :)
Maybe this is obvious, but regarding how to tell if an app is running on another node, can't you just call application:which_applications/0 on the other node and check if the app you're interested in is in the returned list?
Obvious, but I still managed to miss it. Thanks for pointing it out. I guess I missed it because I was too focused on
application:loaded_applications/0.Hi
plase tell us if you'll finish your series about dynamo and erlang.
Ive enjoyed so much reading them...
Next one will be up on Monday, and the follow up will be up before the Monday after that. Really fell off track, sorry for letting you down.
This is great. It looks like you've covered most of the major areas that I can think of and the interface seems pretty simple.
My only thought is would the boot-up time of the Erlang VM have any issues with nagios. I've experienced the VM boot-up to be sluggish compared to other interpreters and environments and I'd like to hear your thought on monitoring notes/applications within a grid and avoiding the actual VM (ie interfacing with epmd, etc).
Try latest software developed by Zyrion- providing server performance and network monitoring tools. http://www.zyrion.com/products/
Traverse system and network monitoring software is built on a powerful, fully-distributed architecture which improves scalability and performance. http://zyrion.com/technology/
Reply to this entry