The Day I Broke Production

February 15, 2010. Filed under erlang

One of my more exciting days as a developer was the day I broke the production environment. Although I was fortunate to escape unscathed, it was a formative experience; I'll never be quite as callous with my typing or my thinking in the production environment again.

Our product is an amalgamation of Erlang, Java and PHP, and figuring out a sane deployment approach had quite a learning curve. For PHP upgrades it was usually a no-op or sometimes requiring an Apache restart. For Java upgrades it involved killing the running process and then starting a new one with the updated code. For Erlang components it involved killing the beams and then running the start scripts again. For Java wrapped by Erlang, it involved killing and restarting the Erlang process.

Well, that's how it was supposed to work anyway.

The Problem

After some--but not all--deployments we encountered a peculiar problem: we would have dead processes hanging around in pg2. This would cause gen_server:call to fail when pg2:get_closest_pid happened to select one of those no-longer-alive processes, which in turn would cause the callers to blow up. Generally these pushes would douse our house of cards with gasoline and the next incoming request would provide the spark to alight the entire system.

It was hardly surprising that there would be some rough patches when deploying an alpha product, but it nonetheless got somewhat grating to have the problem pop up after some but never all pushes. While we eventually figured out better interim solutions and finally understood the underlying problem (more on that later), in the meantime we had a manual cleanup process.

The very first few times this happened the cleanup was very manual, meaning that we typed some commands in the Erlang shell and then mashed down the return key. Surprisingly enough, this segues us into the story of the day I broke production.

Worse than the Disease

After one push it immediately became clear that the dead process plague was visiting our fair system. At that point two of the other engineers had been taking care of the zombies, but one of them was on a month long vacation and the other wasn't immediately available, so I had two choices: leave quietly for a long lunch, or try to fix it myself.

So I fixed the problem, Sort of. What wrote was something along the lines of:

RemoteIsAlive = fun(Pid) ->
    rpc:call(node(Pid), erlang, is_alive, [Pid])
  end,

lists:foreach(fun(Group) ->
    Pids = pg2:get_members(Group),
    Filtered = lists:filter(RemoteIsAlive, Pids),
    lists:foreach(fun(Pid) -> pg2:leave(X, Group) end, Filtered)
  end, pg2:which_groups()).

Which happens to have one minor flaw: it removed all living processes from the process groups, leaving only dead processes in the process groups to service requests. Oops.

From there I messaged the QA lead to kindly ignore the incoming wave of test failures that were about to be unleashed, and went about fixing my fix. First, I deleted and recreated all the existing process groups on all the nodes to clean out the dead processes, and then I had to go around to each of the nodes and restart the applications running on them (which would cause them to rejoin the new process groups, thus populating them with live processes).

Since we only had about eight nodes at that point it only took a few minutes to get it all sorted, but they were some rather tense minutes.

Various and Sundry Solutions

Since that incident we've improved the situations in quite a few ways, in particular:

  • an immediate fix was to filter out dead processes from pg2 by wrapping
    gen_server:call similarly to what is described in this post on load balancing across process groups, with the addition of cleaning out the process group of dead processes before selecting the best process to route a request to,
  • as we began to investigate the underlying issue it turned out that some of our gen_server implementations weren't trapping their exits and thus their terminate function weren't being called and they were never leaving their process groups after stopping.

Many experienced Erlangers might identify the underlying problems as deploying the Erlang code in the wrong way. On the simple side there is the l function which reloads a module, and on the more sophisticated side is appup which helps manage upgrading Erlang applications.

I think what we were doing was unquestionably wrong for a pure Erlang system, or for a situation the entire team and/or company is Erlang-fluent, but I'm less willing to concede that it doesn't make sense for larger corporation which has its own deployment mechanisms which everyone is already familiar with. Large companies are always concerned about the cost of adopting new things, and the key to faster adoption is reducing the cost of adoption, often at the cost of purity. (This topic probably merits a real discussion rather than just an afterthought.)

Although the system got away unscathed, this day was definitely a learnable moment for me. Even simple and obvious solutions have a way of blowing up in the furnace of production deployment, and a combination of more defensive coding and more extensive failure testing (a kind of testing that I find many developers neglect to a fault) would have prevented the entire situation from arising.

What were your closest misses with breaking a running system?