I’ve spent a fair amount of time recently discussing how distributed systems fail and
strategies for preventing those failures, and the lack of a shared vocabulary poses a
significant barrier to high-bandwidth communication.
Given some murky personal plans to write more in this area for the next bit,
I’ve written up the vocabulary I’ve found most effective. Hopefully this will
be useful in its own right, but at worst I’ll use it to avoid repeating boilerplate
definitions going forward!
Want the original sources instead? Of course!
- CAP Twelve Years Later: How the “Rules” Have Changed, Brewer
- Harvest, Yield, and Scalable Tolerant Systems, Fox, Brewer
- The Transaction Concept: Virtues and Limitations, Gray
- BASE: an ACID Alternative, Pritchett
- Keeping CALM: When Distributed Consistency is Easy, Hellerstein, Alvaro
- Eventually Consistent, Vogel
- Even more papers worth reading
CAP: Consistency, Availability, Partitions
Any discussion about distributed systems starts with a discusison of the
which argues you can only have two of these there properties:
consistency (C) equivalent to having a single up-to-date copy of the data;
high availability (A) of that data (for updates); and
tolerance to network partitions (P).
It’s useful to note that the original authors themselves have since emphasized
that “‘2 of 3’ is missleading”, particularly because (1) production systems make
numerous distinct choices between consistency and availability (not one overall choice),
and (2) “all three properties are more continuous than binary.”
One of the most useful explorations of the continuous nature of consistency and availability
is Harvest, Yield and Scalable Tolerant Systems.
Harvest and Yield
Fox and Brewer wrote Harvest Yield, and Scalable Tolerant Systems
which refines the ideas of CAP to better address the non-binary nature of its properties.
- Yield is is the probability of completing a request (e.g. what you typically measure
when you talk about availability),
- Harvest is the fraction of data in the response relative to the amount that should
be available in an ideal case.
Using this vocabulary, we can talk about availability tradeoffs in a more nuanced way,
for example always continuing to return results from your search engine but sometimes
not including all results if a node is unavailable, which is described as “harvest degradation.”
It also introduces the concept of orthogonal mechanisms, advocating to remove all
run-time interfaces between systems, instead relying on compile-time interfaces,
such that they operate independently. A real-world example could be building a static cache
of your most common content at compile time, and periodically rebuilding and shipping it,
but not attempting to update the cache in real-time. (Which is actually how this blog works,
with every instance having its own complete snapshot of data that it periodically fetches.)
ACID was described by Jim Gray’s
The Transaction Concept: Virtues and Limitations in ‘81,
and describes four essential components of a transaction:
- Atomicity - either all changes occur or none occur
- Consistency - transactions cannot bring database into an invalid state (note that this is rather different than consistency in the CAP sense)
- Isolation - concurrent execution doesn’t change results
- Durability - effects survive failures
This is the approached followed by most SQL databases like MySQL and PostgreSQL,
and some but far from all NoSQL databses (e.g. Neo4j
BASE is an alternative to ACID:
BASE is diametrically opposed to ACID.
Where ACID is pessimistic and forces consistency at the end of every operation,
BASE is optimistic and accepts that the database consistency will be in a state of flux.
The letters in BASE correspond to:
- Basically Available - provide CAP’s Availability by prioritizing Yield over Harvest
- Soft state - state is evolving over time due to state changes, there is not a guaranteed correct version at a point in time
- Eventually consistency - if changes stop, state will eventually be consistent
Many NoSQL databases follow this model, such as Cassandra, Datomic, Dynamo, Riak, and many more.
For more detail, the Towards Robust Distributed Systems
presentation by Brewer is another good resources.
I’m hearing more about CALM lately, which is
Consistency as Logical Monotonicity,
and is detailed in Keeping CALM: When Distributed Consistency is Easy.
It’s core theorem is:
A program has a consistent, coordination-free distributed implementation
if any only if it is monotonic.
Adrian Colyer’s summary of Keeping CALM
is great additional reading (as are all his summaries).
- Active-active implies that a request routed to any node will be handled properly.
- Active-passive implies that a request will always be routed to a single active node,
but that it’s possible to quickly elect a new active node if the current active node
There is another version of this language using “hot”, “cold” and “warm” that
is sometimes used with “hot” meaning “active” and “cold” meaning “passive”.
However, sometimes those same terms are used to instead describe frequency of data retrieval,
in the sense of “cold storage” offered by Amazon S3 Glacier
for infrequently accessed data.
To avoid confusion, I recommend using “active” and “passive” to describe availability properties,
and use “hot”, “warm” and “cold” to describe frequency of data retrieval.
Frequently reference algorithms:
- Paxos is the original distributed consensus algorithm,
with modified versions used by Chubby
and many others. (Zookeeper’s ZAB is similar to Paxos as well.)
- RAFT is a much simpler consensus algorithm.
Used by Consul
- SWIM is a distributed gossip protocol for group membership
(e.g. for determining members of a caching ring, etc)
- Two-phase commit is a protocol
for atomic distributed commits
Consistency levels from Werner Vogel’s Eventually Consistent:
- Strong consistency - after an update completes, all further operations correctly use the new value
- Weak conssistency - operations after an update completes may not correctly use the new value
- Eventual consistency - if no new updates are made to an object, at some point in future it will return correct value
- Inconsistency window - the period between an update and the system guaranteeing the correct value will be returned
- Casual consistency - once an updated version is communicated to a process, it is guaranteed to not use older versions
- Monotonic read consistency - similar to casual consistency, but from perspective of receiving process: once a process has seen a version, it will never use previous versions
- Monotonic write consistency - writes within a process are serialized
- Read-your-writes consistency - reads after a write return the updated version
- Session consistency - reads after writes are always correct within a given session
- Read level, write level and replica level are the number of nodes in a distributed storage system
involved in the reading, writing and replication of a piece of data.
The Dynamo paper describes this approach
in some detail, and it’s used heavily by both Cassandra and MongoDB as well (among many others)
Conflict resolution mechanisms for distributed systems:
- last writer wins - the most recently written version is the correct version
- read repair - inconsistencies fixed at read time, slowing reads
- write repair - inconsistencies fixed at write, slowing writes
- asynchronous repair - inconsistencies fixed out of band somehow, not synchronously within read or write operation
- vector clocks create a logical clock to reconcile writes, described in Dynamo paper
- Single point of failure (SPOF) a single component that causes dependent services to fail
- Fault domains is a set of components that share a SPOF
- Fault tolerant describes a system that has multiple fault domains at the same level of functionality
- Replication is streaming changes from one process to another
- Synchronous replication is commiting change on a replica
at same time as committing on the primary (e.g. MySQL’s semisynchronous replication.
Typically very, very slow
This is hardly a complete set of terms, there are very, very many
terms out there in distributed systems, I’ll keep updating and iterating
on this article over time.