Fail Open and Layer Policy

Published on September 20, 2016. architecture (33)

Around 2009, the Dynamo paper materialized into Cassandra. Cassandra escaped Facebook in the fashion of the time: an abrupt code bomb errupting into existence with little ongoing maintenance. Maintaining early versions of Cassandra was itself an explosive experience, and those who ran early versions developed a shared joke that Cassandra was a trojan horse released to blot out progress by an entire generation of Silicon Valley startups.

Working at Digg and running Cassandra 0.6, we didn’t laugh much.

While operationally it was a bit of a challenge, it introduced me to what felt like a very novel idea at the time: Cassandra and similarly designed NoSQL databases were bringing scalability to the masses by only providing operations that worked well at scale. Sure, later they released CQL and that rigid enforcement faded, but the idea that you can drive correct user behavior by radically restricting choice was a very powerful one.

A couple jobs later, I joined a team which was exploring the logical extremes of this concept; we had extremely rigid service provisioning and configuration tools, which only allowed a very specific shape of service run in a very specific way (colloquially, The Right Way). Our customers loved that our tooling always ensured they did things in a maintainable, easy to scale way, and that it didn’t require them to waste time trying a variety of tools, they could just get to work!

Just kidding. They revolted.

There was an initial response to dig in and explain why our customers should do it our way and not their way, but that was ~surprisingly ineffective, even in cases where the new way felt demonstratively worse. Sensing the tides had turned against us, we eventually relented and built a very flexible and buzzword compliant solution to service provisioning, allowing teams to provision whatever kind of service with whatever kind of programming language they wanted.

All was blissful for some time, with teams making their own decisions about technology. In particular this approach did a great job of aligning choice and responsibility, where teams who chose to use a non-standard technology ended up absorbing most of the additional complexity from doing so.

Well-aligned and well-designed, we were riding high, but oddly enough our bliss didn’t last forever. Pretty soon another right way graduated to The Right Way–which had relatively in common with its nominal predecessor– and our tooling suddenly had a new bug: it was too flexible.

It took me a while to figure out what to take away from those experiences, but these days I summarize my take away as:

Design systems which fail open and layer policy on top.

In this case, failing open means to default to allowing any behavior, even if you find it undesirable. This might be allowing a user to use unsupported programming languages, store too much data, or perform unindexed queries.

Then layering policies on top means adding filters which enforce designed behavior. Following the above example, that would be rejecting programming languages or libraries you find undesirable, users storing too much data, or queries without proper indexes.

The key insight for me is that a sufficiently generic implementation can last forever, but intentional restrictions tend to evolve rapidly over time; if infrastructure maintainers want to avoid rewriting their systems every year or two, then we need to be able to tweak policies to enforce restrictions while independently maintaining and improving the underlying capabilities. (I sometimes also describe this concept as “self-service with guard-rails”, for cases when these layers are more about providing informational norms than about enforcing restrictions.)

Like most rules unencumbered by nuance, I haven’t found this to be universally applicable, but I have found it useful in reducing the rate that tools transition into technical debt. The next time your iterating on your developer tooling, give it a whirl and see how it feels.