After reading Lampson's Hints for Computer System Design,
the topic which has most grabbed my fancy was fault tolerence
and fault recovery. Inherently fascinating, it has taken on
additional intrigue as we work at bringing sanity-in-failure
to the rather involved system we are building. As a result,
the next paper I picked up was Jim Gray's A Transaction Model, 1980.
From a great height, A Transaction Model discusses how systems fail,
concurrency, degrees of consistency, determining legality of concurrent
transactions in systems with consistency requirements, dangers of
concurrent transactions, solutions for concurrent transactions,
distributed databases and possible aspects of a transaction-centric
As a quick prelude to the remainder of my summary, we should
like include Gray's definition of transations,
in which he posits two requirements:
For many developers, of which I have been one, transactions
are generally only discussed with regard to choosing between
MySQL backends, or debating the relative merits of MySQL
and PostgreSQL. However, it has been an emerging thought of
mine that transactions apply at much higher levels as well,
and that most systems could be described as transactions,
with the due caveat that they are desired neither to be
particularly reliable nor to be particularly consistent.
Consider for example a Twitter client which is attempting
to post a tweet. The pre-commit stage involves writing the
tweet, and then the commit stage is the actual sending to
Twitter as well as displaying it in the client's GUI.
However, it is possible (perhaps likely, depending
on your daily diet of snark) that Twitter is not available
or you've been temporarily prohibitted from further API
usage for the next hour, in that case the client needs
to undo the pre-commit actions (remove your tweet from
its GUI) and either reattempt the pre-commit and commit
stages or simply abandon the transaction entirely.
I'm not sure that using transactions to model this
Twitter client logic is particularly useful, but
with greater ambition I imagine there are more
interesting examples one devise where viewing the
entire program as series of transactions would be
a powerful model for simplification.
The paper goes through a thorough process of outlining
the various kinds of errors--which I won't attempt
to faithfully recreate here--but at the broadest
discusses two kinds of failures: transaction restarts
and system restarts. The distinction between the two
is exactly what one would imagine from reading the terms.
More interesting is Gray's classification of resources
into four buckets:
real resources represent phyiscal entities (a genuine unit of currency issued from an ATM),
stable resources represent entities which persist through a system restart (a file written to disk),
volatile resources are entitites which are lost during a system restart (the current value of a variable),
volatile-recoverable resources are kept in an fast access form, but which are backed in such a way that they can be rebuilt after a system restart (Mnesia running with an in-memory store that is backed to disk).
The net advantage of thinking in these terms is that it makes it
clearer to determine where a transaction should reset to, and it
also helps determine if your system is unrecoverable (dependent
on a volatile value to recover post-system restart).
The last concept I'll mention--similar to the event log
discussed in Lampson's Hint's for Computer System Design--is
the concept of undo logs and redo logs which enable
a system to rollback a transaction's pre-commit events.
In essence, before taking the pre-commit steps the actions
are written to a redo-log. Then as each step is taken, it
is popped off the redo-log and pushed onto the undo-log,
such that any time the full list of events in the pre-commit
stage can be performed and reversed as required.
With stable undo and redo logs, it becomes possible to
sanely backout transactions which are underway when
a system restart occurs. This is a fairly amazing
property, which many systems would benefit from having,
but rather few have.
I won't belabor this point as it isn't examined in
great depth, but the broad theory here is that when concurrent
transactions are written to a log in such a way that
they are equivalent to some set of serial transactions,
several positive properties arise.
When concurrent transactions cannot be record in such a way
that they are equivalent to some serial transaction history,
then the sanity of the system becomes suspect.
Degrees of Consistency
Before examining potential implementations for creating
consistent transactions, Gray defines three degrees of
A Degree 3 consistency was defined before as a protocol which acquires locks
to cover all actions and holds all locks to transaction commit. It was
shown to prevent concurrency anomalies.
In order to support transaction restart, all systems acquire X-mode
locks to cover writes and hold them to transaction commit. This is
called the degree 1 consistency lock protocol.
If the system additionally acquires S-mode locks to cover reads but
releases the locks prior to commit then it provides degree 2 consistency.
Regarding the relative performance of using the less thorough degree 1 and
degree 2 consistencies, Gray wrote:
In the experiments we have done, degree 3 consistency has a cost
(throughput and processor overhead) indistinguishable from the
lower degrees of consistency.
I found his brief remarks regarding why others continue using the
(supposedly) comparably efficient and less consistent degree 1
and degree 2 consistencies to be rather amusing, or at least
Many times I have seen others--and most certainly been guilty
myself--of desperately seeking an explanation for why we have
already done something in a backwards fashion, and afterwards
trying to explain why it really was a rational decision instead
of just an ill-considered choice. Gray's two example excuses
were complexity of implementation and imagined but unproven
performance benefits. Both of which strike me as rather familiar
The issue of restarting a transaction when a node--throughout its
lifespan--exists on many different nodes is discussed, and is
a problem that maps closely to the trickiest problem we are
working on at work.
Gray's suggestion is that each node must record the origin
node it receives a given transaction from, record
the redo-undo history for steps occuring on itself, and
record the node the transaction next proceeds onto.
In such a way a full redo-undo history may be maintained
There is a great deal of interesting thoughtwork to be
done here in regard to in-memory file-backed databases--which
are themselves a distributed database--and then distributing
those in-memory and file-backend aspects across multiple nodes.
Transactions & Programming Language
In the final portion of the paper, Gray very briefly
describes how his results on transactions might be
incorporated into a future programming language.
Part of the described language is quite similar to
Java or Objective-C's method level locks (in Java
one can add the synchronized directive to a method
which transparently generates a lock).
The other portion has found less tranction (although
it seems to me that McCarthy's (SP?) language Elephant--the
sum of my knowledge about which is that its motto is
an elephant never forgets--may be related), but seems
adding language level support for
maintaining the redo and undo logs required for
adding language level support for marking attributes
as real, stable, recoverable-volatile and volatile,
and having the language implementation take responsibility
for maintaining the necessary record-keeping for implementing
stable and recoverable-volatile attributes.
This is likely a situation where my youth fails me, but I'm
not aware of any general purpose languages which have
incorporated these, but I do think it would be fairly straight-forward
to add a Python (or language-of-your-choice) library that
would support programming using these transaction-centric and
A Transaction Model was a very approachable paper which contained
a number of interesting ideas, and served as an excellent introduction
to transactions. It is always surreal to read papers which are twenty-six
years old but still entirely relevant, this one certainly is.
The most interesting aspect for me was--as mentioned before--fault
tolerance and recovery from failed transactions, and I'll be raiding
the cited works for additional reading material.