Around 2009, the Dynamo paper materialized into Cassandra.
Cassandra escaped Facebook in the fashion of the time: an abrupt code bomb
errupting into existence with little ongoing maintenance.
Maintaining early versions of Cassandra was itself an explosive experience, and
those who ran early versions developed a shared joke
that Cassandra was a trojan horse released to blot out progress by an
entire generation of Silicon Valley startups.
Working at Digg and running Cassandra 0.6, we didn’t laugh much.
While operationally it was a bit of a challenge, it introduced me to what felt
like a very novel idea at the time: Cassandra and similarly designed NoSQL databases
were bringing scalability to the masses by only providing operations that worked well at scale.
Sure, later they released CQL and that rigid enforcement faded, but
the idea that you can drive correct user behavior by radically restricting choice
was a very powerful one.
A couple jobs later, I joined a team which was exploring the logical extremes of
this concept; we had extremely rigid service provisioning and configuration tools,
which only allowed a very specific shape of service run in a very specific way (colloquially, The Right Way).
Our customers loved that our tooling always ensured they did things in
a maintainable, easy to scale way, and that it didn’t require them to waste time
trying a variety of tools, they could just get to work!
Just kidding. They revolted.
There was an initial response to dig in and explain why our customers should do
it our way and not their way, but that was ~surprisingly ineffective, even in cases
where the new way felt demonstratively worse. Sensing the tides had turned against us,
we eventually relented and built a very flexible and buzzword compliant
solution to service provisioning, allowing teams to provision whatever kind of service
with whatever kind of programming language they wanted.
All was blissful for some time, with teams making their own decisions about
technology. In particular this approach did a great job of aligning choice and responsibility,
where teams who chose to use a non-standard technology ended up absorbing most of the
additional complexity from doing so.
Well-aligned and well-designed, we were riding high, but
oddly enough our bliss didn’t last forever. Pretty soon another right way graduated
to The Right Way–which had relatively in common with its nominal predecessor– and our tooling
suddenly had a new bug: it was too flexible.
It took me a while to figure out what to take away from those experiences, but these days
I summarize my take away as:
Design systems which fail open and layer policy on top.
In this case, failing open means to default to allowing any behavior, even if you find it undesirable.
This might be allowing a user to use unsupported programming languages, store too much data,
or perform unindexed queries.
Then layering policies on top means adding filters which enforce designed behavior.
Following the above example, that would be rejecting programming languages or libraries
you find undesirable, users storing too much data, or queries without proper indexes.
The key insight for me is that a sufficiently generic implementation can last forever,
but intentional restrictions tend to evolve rapidly over time; if infrastructure maintainers
want to avoid rewriting their systems every year or two, then we need to be able to
tweak policies to enforce restrictions while independently maintaining and improving the
underlying capabilities. (I sometimes also describe this concept as “self-service with guard-rails”,
for cases when these layers are more about providing informational norms than about
Like most rules unencumbered by nuance, I haven’t found this to be universally applicable,
but I have found it useful in reducing the rate that tools transition into technical debt.
The next time your iterating on your developer tooling, give it a whirl and see how it feels.