I ran into a fantastic bug at work this past week. We had a component which would periodically crash once every couple weeks.
Memory usage wasn't growing over time, and neither was CPU utilization increasing. There wasn't any pattern to the specific
files which were being processed at the time of the crash. A break came when we realized that it consistently crashed after processing
approximately thirty-two thousand files.
Further, tracing through our logs we were pretty certain that the line causing the crash was this one:
I'm not sure exactly how I would have approached this a year ago, but I had recently read
Joe Armstrong's interview in Coders at Work and he mentions
how he frequently writes simple scripts to check performance, toy with algorithms and so on, and that seemed
like exactly the right approach for figuring out this problem.
First I was curious if the issue was in global:register_name/2 so I wanted to try registering a single process more than 32,000 times.
This did generate a number of errors (global doesn't like a single pid being registered with multiple names), but it didn't
ever generate the Too many processes error, so it seemed that the issue could be reproduced by registering the same process
more than 32,000 times.
The next step was to try registering 32,000 different processes.
The second attempt produced the desired failure, failing with the Too many processes message after run for a bit.
Finally, I needed to clarify if the issue was caused by creating too many processes or, registering too many different processes.
To do that, I added a timeout to the functions so that they only leaved for one second instead of lasting forever.
This script runs all the way to 100,000 instead of crashing at 32,000. It turned out the exact number is 32768,
and it can be overridden by passing +P at the command line as described in the erl
Of course, increasing the number of allowable processes wasn't quite the right solution, the real solution was to figure
out why we had 32,768 processes hanging around and not getting cleaned up (turns out in a partially implemented section
of the code something was getting spawned waiting for a message which it never received).
Going forward I think I'll be writing many more small test scripts as the latest addition to my debugging toolkit.