Debugging Crashes With Small Scripts

January 23, 2010. Filed under erlang

I ran into a fantastic bug at work this past week. We had a component which would periodically crash once every couple weeks. Memory usage wasn't growing over time, and neither was CPU utilization increasing. There wasn't any pattern to the specific files which were being processed at the time of the crash. A break came when we realized that it consistently crashed after processing approximately thirty-two thousand files.

Further, tracing through our logs we were pretty certain that the line causing the crash was this one:

global:register_name(UniqueId, spawn(some_module, some_func, [some_params])),

And we were also finding this error message in our logs:

=ERROR REPORT==== 23-Jan-2010::11:00:48 ===
Too many processes

I'm not sure exactly how I would have approached this a year ago, but I had recently read Joe Armstrong's interview in Coders at Work and he mentions how he frequently writes simple scripts to check performance, toy with algorithms and so on, and that seemed like exactly the right approach for figuring out this problem.

First I was curious if the issue was in global:register_name/2 so I wanted to try registering a single process more than 32,000 times.

F = fun() -> receive A -> A end end,
Pid = spawn(F),
lists:foreach(fun(X) ->
    io:format("~p~n",[X]),
    global:register_name(list_to_atom(integer_to_list(X)), F)
  end,
  lists:seq(0, 100000)).

This did generate a number of errors (global doesn't like a single pid being registered with multiple names), but it didn't ever generate the Too many processes error, so it seemed that the issue could be reproduced by registering the same process more than 32,000 times.

The next step was to try registering 32,000 different processes.

F = fun() -> receive A -> A end end,
Z = fun(X) ->
    io:format("~p~n", [X]),
    Atom = list_to_atom(integer_to_list(X)),
    global:register_name(Atom, spawn(F))
  end,
  lists:foreach(Z, lists:seq(0, 100000)).

The second attempt produced the desired failure, failing with the Too many processes message after run for a bit. Finally, I needed to clarify if the issue was caused by creating too many processes or, registering too many different processes.

To do that, I added a timeout to the functions so that they only leaved for one second instead of lasting forever.

F = fun() ->
    receive A -> A
    after 1000 -> {error, timeout}
    end
  end,
Z = fun(X) ->
    io:format("~p~n", [X]),
    Atom = list_to_atom(integer_to_list(X)), global:register_name(Atom, spawn(F))
  end,
  lists:foreach(Z, lists:seq(0, 100000)).

This script runs all the way to 100,000 instead of crashing at 32,000. It turned out the exact number is 32768, and it can be overridden by passing +P at the command line as described in the erl documentation.

Of course, increasing the number of allowable processes wasn't quite the right solution, the real solution was to figure out why we had 32,768 processes hanging around and not getting cleaned up (turns out in a partially implemented section of the code something was getting spawned waiting for a message which it never received).

Going forward I think I'll be writing many more small test scripts as the latest addition to my debugging toolkit.