August 11, 2019.
I've spent a fair amount of time recently discussing how distributed systems fail and strategies for preventing those failures, and the lack of a shared vocabulary poses a significant barrier to high-bandwidth communication.
Given some murky personal plans to write more in this area for the next bit, I've written up the vocabulary I've found most effective. Hopefully this will be useful in its own right, but at worst I'll use it to avoid repeating boilerplate definitions going forward!
Want the original sources instead? Of course!
Any discussion about distributed systems starts with a discusison of the CAP theorem, which argues you can only have two of these there properties:
consistency (C) equivalent to having a single up-to-date copy of the data; high availability (A) of that data (for updates); and tolerance to network partitions (P).
It's useful to note that the original authors themselves have since emphasized that "'2 of 3' is missleading", particularly because (1) production systems make numerous distinct choices between consistency and availability (not one overall choice), and (2) "all three properties are more continuous than binary."
One of the most useful explorations of the continuous nature of consistency and availability is Harvest, Yield and Scalable Tolerant Systems.
Fox and Brewer wrote Harvest Yield, and Scalable Tolerant Systems which refines the ideas of CAP to better address the non-binary nature of its properties.
Using this vocabulary, we can talk about availability tradeoffs in a more nuanced way, for example always continuing to return results from your search engine but sometimes not including all results if a node is unavailable, which is described as "harvest degradation."
It also introduces the concept of orthogonal mechanisms, advocating to remove all run-time interfaces between systems, instead relying on compile-time interfaces, such that they operate independently. A real-world example could be building a static cache of your most common content at compile time, and periodically rebuilding and shipping it, but not attempting to update the cache in real-time. (Which is actually how this blog works, with every instance having its own complete snapshot of data that it periodically fetches.)
ACID was described by Jim Gray's The Transaction Concept: Virtues and Limitations in '81, and describes four essential components of a transaction:
This is the approached followed by most SQL databases like MySQL and PostgreSQL, and some but far from all NoSQL databses (e.g. Neo4j follows ACID).
BASE is an alternative to ACID:
BASE is diametrically opposed to ACID. Where ACID is pessimistic and forces consistency at the end of every operation, BASE is optimistic and accepts that the database consistency will be in a state of flux.
The letters in BASE correspond to:
Many NoSQL databases follow this model, such as Cassandra, Datomic, Dynamo, Riak, and many more.
For more detail, the Towards Robust Distributed Systems presentation by Brewer is another good resources.
I'm hearing more about CALM lately, which is Consistency as Logical Monotonicity, and is detailed in Keeping CALM: When Distributed Consistency is Easy. It's core theorem is:
A program has a consistent, coordination-free distributed implementation if any only if it is monotonic.
Adrian Colyer's summary of Keeping CALM is great additional reading (as are all his summaries).
Availability configurations:
There is another version of this language using "hot", "cold" and "warm" that is sometimes used with "hot" meaning "active" and "cold" meaning "passive". However, sometimes those same terms are used to instead describe frequency of data retrieval, in the sense of "cold storage" offered by Amazon S3 Glacier for infrequently accessed data.
To avoid confusion, I recommend using "active" and "passive" to describe availability properties, and use "hot", "warm" and "cold" to describe frequency of data retrieval.
Frequently reference algorithms:
Consistency levels from Werner Vogel's Eventually Consistent:
Conflict resolution mechanisms for distributed systems:
This is hardly a complete set of terms, there are very, very many terms out there in distributed systems, I'll keep updating and iterating on this article over time.